A Simple Algorithm for Worst Case Optimal Join and
Sampling
Florent Capelli, Oliver Irwin, Sylvain Salvati
CRIL, Université d’Artois
Journées du GT DAAL 2025
13 May 2025
Joining like a pro
|
|
|
|
0
|
0
|
|
0
|
1
|
|
2
|
1
|
|
|
|
|
0
|
0
|
|
0
|
2
|
|
2
|
3
|
|
|
|
|
0
|
2
|
|
1
|
0
|
|
1
|
2
|
- Devise a query plan:
- Materialize the intermediate joins.
|
|
|
|
|
0
|
0
|
0
|
|
0
|
0
|
2
|
|
0
|
1
|
0
|
|
0
|
1
|
2
|
|
2
|
1
|
3
|
Joining like a brute
|
|
|
|
0
|
0
|
|
0
|
1
|
|
2
|
1
|
|
|
|
|
0
|
0
|
|
0
|
2
|
|
2
|
3
|
|
|
|
|
0
|
2
|
|
1
|
0
|
|
1
|
2
|
Disruptive poll
In theory, is it better to join like:
- A pro
- A brute
No confidence vote
- Go home, you are not qualified to talk about joins after saying dumb
things like that.
- We are all theorist here, please
tell us the whole story
What is wrong with joining like a pro
- It is known that if
,
then
.
-
may have
answers!
Worst scenario for query plans
Consider
on domain
with:
- .
|
|
|
|
0
|
|
|
|
0
|
|
|
|
|
0
|
|
|
|
0
|
|
|
|
|
0
|
|
|
|
0
|
- ,
,
- .
Every query plan will materialize a table of size
but the answer table will never be of size greater than
.
And the brute?
Domain
with
and
.
|
|
|
|
0
|
|
|
|
0
|
|
|
|
|
0
|
|
|
|
0
|
|
|
|
|
0
|
|
|
|
0
|
![]()
If we do the “else” branches efficiently (e.g. by reading values from
one table), the algorithm makes
recursive calls.
Worst-case optimal join
Ideal complexity: output
in time
…
… unlikely to be possible.
Worst case optimal: output
in time
.
is the size of the largest input relation and
ignores polylog factors.
:
data complexity, ie,
is considered constant. Ideally,
is a reasonable polynomial though.
Worst case value
Consider a join query
and all databases for
with a bound
on the table size:
and let:
is the worst case: the size of the biggest answer set
possible with query
and databases where each table are bounded by
.
Worst case examples
- Cartesian product:
has
.
- Similarly:
has
.
- Square query:
has
.
- Triangle query:
,
.
- The n-cycle:
:
.
We know how to compute
such that
but we do not need it!
This is known as the AGM-bound
Worst case optimal join (WCOJ) algorithms
A join algorithm is worst case optimal (wrt
)
if for every
,
and
,
it computes
in time
- Data complexity model:
considered constant hence
also.
- In this talk,
will be a reasonable polynomial!
The DBMS approach is not worst case optimal (triangle example from
before).
Existing WCOJ Algorithm
Rich literature:
- NPRR join (Ngo, Porat, Ré, Rudra, PODS12): usual
join plans but with relations partitionned into high/low degree
tuples.
- Leapfrog Triejoin (Veldhuizen, ICDT14)
- Generic Join (Ngo, PODS18): both branch and bound
algorithms as ours but more complex analysis/data structures.
- PANDA (PODS17): handle complex database
constraints, very complex, long analysis.
We prove the worst case optimality of the branch and bound algorithm
in an elementary way.
Problem statement
Given
and
,
sample
with probability
or fail if
.
Naive algorithm:
- materialize
in a table
- sample
uniformly
- output
.
Complexity using WCOJ:
.
We can do better: (expected) time
PODS 23: [Deng, Lu, Tao] and [Kim, Ha, Fletcher,
Han]
Let’s do a modular proof of this
fact!
Revisiting the problem

Sampling answers reduces to sampling -leaves
in a tree with (,)-labeled leaves.
Sampling leaves, the easy way

- :
number of -leaves below
is known
- Recursively sample uniformly a
-leaf
in
with probability
.
- A leaf in
will hence be sampled with probability
Uniform!
In our case, we do not know
…
Sampling leaves with a nice oracle

- :
upperbound on the number of -leaves below
is known
- Recursively sample uniformly a
-leaf
in
with probability
.
- Fail with probability
or upon encountering .
Only makes sense if
.
Las Vegas uniform sampling algorithm:
- each leaf is output with probability
,
- fails with proba
where
is the number of -leaves under
.
Repeat until output:
expected calls, where
is the root.
Upper bound oracles for conjunctive queries

- Node
:
partial assignment
- Number of leaves below
:
.
- :
look for worst case bounds!
AGM bound: there exists positive rational numbers
such that
Define
:
- it is an upper bound on
,
- it is supperadditive:
- value of
at the root of the tree:
!
Wrapping up sampling
Given a super-additive function upperbounding the number of -leaves in a tree at each node, we have:
Las Vegas uniform sampling algorithm:
- each leaf /answer is output
with probability
- fails with proba
Repeat until output:
expected calls.
Final complexity: binarize to navigate the tree in
:
Matches existing results, proof more modular.
Beyond Cardinality Constraints
Worst case and constraints
So far we have considered worst case wrt this class:
Each relation is subject to a
cardinality constraint of size
.
What if we know that our instance has some extra properties (e.g., a
functional dependency)
- We know
- We want the join to run in
where
.
In this case, we say that our algorithm is worst case optimal wrt
.
Finer constraints can help
.
We have:
.
- Let
be the class of databases where
and
respect functional dependency
.
-
because each tuple of
can be extended to at most one solution.
Is our simple join worst case optimal for this class?
Short answer: yes if
is set before
.
Prefix closed classes
Recall the complexity of our algorithm:
where
A class of database
for
is prefix closed for order
if for each
and
:
is prefix closed (for any order)!
Our algorithm is (almost) worst case optimal as long as we use an
order for which
is prefix closed!
Acyclic functional dependencies
is a set of functional dependencies:
- :
vertices are the variables and
if
and
for some
.
- If
is acyclic, then let
be a topological sort of
.
Then
is prefix closed for order
(exactly the same proof as for cardinality constraints).
Hence our algorithm is worst case optimal wrt
(as long as we follow
).
We need to show that this functional dependencies
transfer in the binarised setting but it is almost immediate.
Degree constraints
A degree constraint is a constraint
where
.
A relation
verifies the constraint if
- Cardinality constraint = degree constraint with
.
- Functional dependency = degree constraint with
.
Acyclic degree constraints
set of degree constraints.
- :
vertices are the variables and
if
and
for some
.
- If
is acyclic, then let
be a topological sort of
.
Then
is prefix closed for order
(exactly the same proof as for cardinality constraints).
Hence our algorithm is worst case optimal wrt
(as long as we follow
).
We need to show that this functional dependencies
transfer in the binarised setting but it is almost immediate.
Bonus: sampling acyclic degree constraints
We can find
such that
for any
(polymatroid bound).
Define
:
- upperbound of
for any
,
- superadditive.
We have sampling with complexity
Conclusion
- Simple algorithms and analysis
- Modular:
- join is worst-case optimal as soon as the class is prefix
closed
- sampling is in
as long as one can provide a super additive upper bound
Future work:
- Other classes such as:
- cyclic FD,
- general system of degree constraints (as PANDA)
- Explore dynamic ordering: can we capture more classes?