Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable recursive CTE support by default #9554

Closed
alamb opened this issue Mar 11, 2024 · 6 comments · Fixed by #9619
Closed

Enable recursive CTE support by default #9554

alamb opened this issue Mar 11, 2024 · 6 comments · Fixed by #9619
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Mar 11, 2024

Is your feature request related to a problem or challenge?

As part of #462, @matthewgapp implemented support for recursive Common Table Expressions (aka Recursive CTEs)

Here is an example of such a query:

https://github.com/apache/arrow-datafusion/blob/4cd3c433004a7a6825643d6b3911db720efe5f76/datafusion/sqllogictest/test_files/cte.slt#L44-L68

At the moment, to use recursive CTEs you must enable a config option:

set datafusion.execution.enable_recursive_ctes = true;

Describe the solution you'd like

I would like recursive CTEs to be enabled by default (and thus useable without a config option)

The only reason I know of at the moment that they are NOT enabled by default is because they might buffer an infinite amount of data (and thus exceed the total memory available to DataFusion)

Describe alternatives you've considered

I think the basic idea would be to

  1. Extend RecursiveQueryExec with a MemoryReservation
  2. Track the memory of any buffered batches, erroring if they can not be buffered
  3. Write a test in memory_limit.rs showing the limit being hit: https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/memory_limit.rs
  4. Change the default value of

The main PR that added this feature was #8840

Additional context

No response

@alamb alamb added the enhancement New feature or request label Mar 11, 2024
@l1t1
Copy link

l1t1 commented Mar 12, 2024

it's a bit slow.

with recursive t(a) as
(select 1 as a
union all
select 1+a from t where a<10000)
select count(*) from t;
+----------+
| COUNT(*) |
+----------+
| 10000    |
+----------+
1 row in set. Query took 2.347 seconds.

duckdb 0.10 :

D with recursive t(a) as (select 1 as a union all select 1+a from t where a<10000) select count(*) from t;

Run Time (s): real 0.291 user 0.670804 sys 0.046800
D with recursive t(a) as (select 1 as a union all select 1+a from t where a<20000) select count(*) from t;

Run Time (s): real 0.602 user 1.310408 sys 0.140401
D with recursive t(a) as (select 1 as a union all select 1+a from t where a<40000) select count(*) from t;

Run Time (s): real 1.114 user 2.418016 sys 0.265202
D with recursive t(a) as (select 1 as a union all select 1+a from t where a<80000) select count(*) from t;

Run Time (s): real 2.326 user 5.163633 sys 0.374402

hyper 0.0.18161:

> with recursive t(a) as (select 1 as a union all select 1+a from t where a<10000) select count(*) from t;

0.061 s

> with recursive t(a) as (select 1 as a union all select 1+a from t where a<20000) select count(*) from t;

0.046 s

> with recursive t(a) as (select 1 as a union all select 1+a from t where a<40000) select count(*) from t;

0.052 s

> with recursive t(a) as (select 1 as a union all select 1+a from t where a<80000) select count(*) from t;

0.074 s

@jonahgao
Copy link
Member

it's a bit slow.

Disabling enable_round_robin_repartition and coalesce_batches will make it much faster.
@matthewgapp once mentioned this issue in PR #7581.

DataFusion CLI v36.0.0

❯ set datafusion.optimizer.enable_round_robin_repartition=false;
0 rows in set. Query took 0.002 seconds.

❯ set datafusion.execution.coalesce_batches=false;
0 rows in set. Query took 0.002 seconds.

❯ with recursive t(a) as
(select 1 as a
union all
select 1+a from t where a<10000)
select count(*) from t;
+----------+
| COUNT(*) |
+----------+
| 10000    |
+----------+
1 row in set. Query took 0.194 seconds.

@Dandandan
Copy link
Contributor

Looks like there is a draft PR here addressing some of the performance issues / improvements: matthewgapp#2

@jonahgao
Copy link
Member

take

@jonahgao
Copy link
Member

I am interested in implementing this memory limit feature.

@alamb
Copy link
Contributor Author

alamb commented Mar 15, 2024

Thank you @jonahgao 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants