Dynamic direct access of MSO query evaluation over strings

Pierre Bourhis, Florent Capelli, Stefan Mengel and Cristian Riveros

CRIL, Université d’Artois

ICDT 2025

March 2025

Variable Set Automata

Tagging positions in words

Automaton with variables tagging positions:

 
ww b a a b a a a b a b a a
Positions 1 2 3 4 5 6 7 8 9 10 11 12
Run 1 xx yy
Run 2 xx yy
Run 3 xx yy
A(w)\llbracket A \rrbracket(w) xx yy
(run 1) 2 4
3 4
(run 2) 5 8
6 8
7 8
(run 3) 9 10

Gist of the paper: given ww, build a data structure allowing to access each tuple in A(w)\llbracket A \rrbracket(w) efficiently.

Properties of vset automaton

  • Determinism¹: two edges going out of state qq have distinct labels.
Forbidden
Allowed
  • Functionality: every path from q0q_0 to a final state tags each variable exactly once.
Not functional
Functional version

For each state qq of a functional automaton, there is a set of variables XqX_q that are tagged before reaching qq.

¹: We actually do not need full determinism but only unambiguity: each tuple of A(w)\llbracket A \rrbracket(w) has exactly one corresponding run.

Normalization

Let AA be an vset automaton. One can construct a deterministic function vset automaton AA' such that A=A\llbracket A \rrbracket = \llbracket A' \rrbracket.

  • Intuition: automata over states 2QX2^{Q \cup X}.
  • AA' may be of size exp(A)exp(A)

In this talk: AA is considered constant, hence assumed to be deterministic and functional.

Direct Access for vset automata

Direct access queries

Fix a (deterministic functional) automaton A(x1,,xk)A(x_1,\dots,x_k) (considered constant).

A direct access query for a word wΣ*w \in \Sigma^*:

  • input integer kk,
  • output the kkth tuple of A(w)\llbracket A \rrbracket(w) or,
  • fails if k>#A(w)k > \#\llbracket A \rrbracket(w).

where A(w)\llbracket A \rrbracket(w) is ordered by lexicographical ordering on [n]X[n]^X.

A[3]\llbracket A \rrbracket[3]

A[10]\llbracket A \rrbracket[10]

A[5]\llbracket A \rrbracket[5]

A(w)\llbracket A \rrbracket(w) xx yy
2 4
3 4
5 8
6 8
7 8
9 10

Direct access complexity

Given wΣ*w \in \Sigma^* of length nn:

  • Precomputation phase: construct a data structure DwD_w in time p(n)p(n).
  • Access phase: given kk, output A(w)[k]\llbracket A \rrbracket(w)[k] in time a(n)a(n) using DwD_w.

Naive approach:

  • DwD_w is a materialization of A(w)\llbracket A \rrbracket(w) in an array. At least O(#A(w))O(\#\llbracket A \rrbracket(w))
  • Access phase: read the kkth entry of DwD_w. O(1)O(1)

Can we have better preprocessing without hurting access time too much?

Contribution

We show that we can solve direct access with:

  • Linear time precomputation O(n)O(n),
  • Polylogarithmic access time O(log2(n))O(\log^2(n)).

O()O(\cdot) notation hides constants depending on |A||A| but they are all polynomially bounded if AA is deterministic and unambiguous.

Data structure idea

Counting reduction

Let τ=A(w)[k]\tau = \llbracket A \rrbracket(w)[k].

τ(x1)\tau(x_1) is the first position p1p_1 for which #Ax1p1(w)k\#\llbracket A \rrbracket_{x_1 \leq p_1}(w) \geq k

x1x_1 xkx_k
<p1< p_1
p1p_1
kk p1p_1
p1p_1
>p1> p_1
  • Binary search to find p1p_1: O(logn)O(\log n) calls to computing #Ax1p(w)k\#\llbracket A \rrbracket_{x_1 \leq p}(w) \geq k by changing pp
  • We proceed recursively to find τ(x2)\tau(x_2): first value pp such that

#Ax1=p1,x2p(w)k\#\llbracket A \rrbracket_{x_1 = p_1, x_2 \leq p}(w) \geq k #Ax1<p1(w)-\#\llbracket A \rrbracket_{x_1 < p_1}(w)

Transition matrices

Ma[q,q]=1M_a[q,q'] = 1 if and only if there is a transition from qq to qq' labeled by aa.

(Ma1Man)[q,q]=(M_{a_1} \cdot \dots \cdot M_{a_n})[q,q'] = number of paths from qq to qq' reading w=a1anw = a_1 \dots a_n.

qfF(Ma1Man)[q0,qf]=#A(w)\sum_{q_f \in F} (M_{a_1} \cdot \dots \cdot M_{a_n})[q_0,q_f] = \#\llbracket A \rrbracket(w)

Max[q,q]=1M_a^{x}[q,q'] = 1 if and only if there is a transition from qq to qq' labeled by aa and xx has been tagged before qq' (which only depends on qq' by functionality).

qfF(Ma1MapxMan)[q0,qf]=#Axp(w)\sum_{q_f \in F} (M_{a_1} \cdot \dots \cdot M_{a_p}^x \cdot \dots \cdot M_{a_n})[q_0,q_f] = \#\llbracket A \rrbracket_{x \leq p}(w)

Maintaining products

To find τ(x1)\tau(x_1) We need to find the smallest pp such that:

#Ax1p(w)k\#\llbracket A \rrbracket_{x_1 \leq p}(w) \geq k

Binary search by computing the product:

qfF(Ma1MapxMan)[q0,qf]=#Axp(w)\sum_{q_f \in F} (M_{a_1} \cdot \dots \cdot M_{a_p}^x \cdot \dots \cdot M_{a_n})[q_0,q_f] = \#\llbracket A \rrbracket_{x \leq p}(w)

  • Recomputing this product is too costly: O(n)O(n) matrix multiplications each time.
  • Ma1MapxManM_{a_1} \cdot \dots \cdot M_{a_p}^x \cdot \dots \cdot M_{a_n} and Ma1MapxManM_{a_1} \cdot \dots \cdot M_{a_{p'}}^x \cdot \dots \cdot M_{a_n} are almost the same.
  • If Pi:=Ma1MaiP_i := M_{a_1} \cdot \dots \cdot M_{a_i} and Qj:=MajManQ_j := M_{a_j} \cdot \dots \cdot M_{a_n} have been precomputed for every i,ji,j:

Ma1MapxMan=Pp1MapxQp+1M_{a_1} \cdot \dots \cdot M_{a_p}^x \cdot \dots \cdot M_{a_n} = P_{p-1} \cdot M_{a_p}^x \cdot Q_{p+1}

Each product in the binary search can be computed with two products.

Final data structure

Data structure DD to represent a product A1ArA_1 \cdot \dots \cdot A_r in a semigroup such that one can quickly update DD so that it represents the product where AiA_i is replaced by BiB_i.

Dynamic word problem.

 
 

Dynamic words

Updates

In this setting, one can “update” word ww, e.g., insert a letter at the end.

Can we use the data structure for ww to construct a data structure for wa{w \cdot a} more efficiently than rerunning O(n)O(n)-precomputation?

We can update the word in O(logn)O(\log n).

Cut and paste update

Given data structures D1D_1 for w1w_1 and D2D_2 for w2w_2, k|w1|,i,j|w2|k \leq |w_1|, i,j \leq |w_2|:

w1=a1anw_1 = a_1 \dots a_n, w2=b1bnw_2 = b_1 \dots b_n

We can construct D1D_1' for w1w_1' and D2D_2' for w2w_2' where

w1=w_1' = a1aka_1 \dots a_k bibjb_i \dots b_j ak+1ana_{k+1} \dots a_n

w2=w_2' = b1bj1bj+1bmb_1 \dots b_{j-1} b_{j+1} \dots b_m

in time O(log(n+m))O(\log (n+m)).

Cut and paste application

Cut and paste covers:

  • Concatenation of words (cut w2w_2 completely and paste at the end of w1w_1)
  • Insertion of letter (create data structure for w2=aw_2 = a in O(1)O(1), cut it and paste it in w1w_1)
  • Remove substring (cut the substring)
  • Update letter (cut the letter and insert a new one in its place)

And so on!

Data structure idea

  • Maintain matrix product in a balanced tree
  • Use AVL tree operations to keep it balanced

Future work

  • Implementations:
    • reasonable data structures
    • updates may be cheap enough to maintain a query on a code base.
  • Generalizations:
    • Orders that are not lexicographical
    • MSO over trees.