feat: implement aggregation and subquery plans to SQL #9606

devinjdangelo · 2024-03-14T13:35:22Z

Which issue does this PR close?

works on #8661

Rationale for this change

See issue

What changes are included in this PR?

Adds support for aggregation and subquery plan nodes converting to a SQL AST

For example, one can now convert

Projection: p1.id, COUNT(*) AS cnt
  Aggregate: groupBy=[[p1.id]], aggr=[[COUNT(*)]]
    Projection: p1.id
      Inner Join:  Filter: p1.id = p2.id
        SubqueryAlias: p1
          TableScan: person
        SubqueryAlias: p2
          TableScan: person

to

SELECT p1.id, COUNT(*) AS cnt 
FROM (
    SELECT p1.id 
    FROM person AS p1 
    JOIN person AS p2 
    ON (p1.id = p2.id)
) 
GROUP BY p1.id

Are these changes tested?

Yes, new roundtrip tests are added

Are there any user-facing changes?

More plans supported

devinjdangelo · 2024-03-14T13:37:53Z

datafusion/sql/src/unparser/plan.rs

@@ -176,7 +250,9 @@ impl Unparser<'_> {
                )
            }
            LogicalPlan::Aggregate(_agg) => {


I'm currently assuming all aggregate plan nodes are preceded by a projection plan node. These two nodes are then handled together simultaneously in the above projection node block.

@alamb I was wondering if you know:

If an Aggregate plan node can ever NOT be preceded by a projection plan node?

If so, under what circumstances and what does that mean intuitively?

This does indeed become more tricky. I guess the schema computed at every layer of the plan passes up enough information for execution but maybe not for our Unparser use-case. The specific columns being selected are no longer needed/known at the aggregation step.

I wonder if it's worth creating a structure next to the plan that builds up schema-like data for serialization purpose.

@devinjdangelo
I have done a lot of work on this problem independently - although I think this approach of trying to rebuild the sqlparser-rs AST and then using it to generate SQL is nicer.

Regarding your questions you start to see all kinds of crazy plans when you start trying to unwind aggregations:

// Projection: pc.contacttypeid, ctypename, nocontacts // Sort: COUNT(*) AS nocontacts DESC NULLS FIRST // Projection: pc.contacttypeid, pc.name AS ctypename, COUNT(*) AS nocontacts, COUNT(*) // Filter: COUNT(*) >= Int64(100) // Aggregate: groupBy=[[pc.contacttypeid, pc.name]], aggr=[[COUNT(*)]]

Thank you @seddonm1! You are right. I cooked up a test case that looks similar to the plan you show.

Sort: COUNT(*) ASC NULLS LAST Projection: person.id, COUNT(*), person.first_name Filter: COUNT(*) > Int64(5) Aggregate: groupBy=[[person.first_name, person.id]], aggr=[[COUNT(*)]] TableScan: person

select id, count(*), first_name from person group by first_name, id having count(*)>5 order by count(*)

This case will of course fail in the PR. I think that is OK, as we can work on support for more complex aggregations and additional edge cases we find in future PRs.

It will get interesting as we will need to start writing logic that peaks up and down the plan 🤔

I can put my current code on a gist if that would help. i had to do multiple nested matching conditions to be able to traverse and extract nested data.

What would be ideal is if this could drive changes in the LogicalPlan nodes to help unparse them.

That would be great! Also feel free to open a PR to DataFusion if you are interested in collaborating on this functionality.

devinjdangelo · 2024-03-14T13:39:14Z

datafusion/sql/src/unparser/plan.rs

-
-                self.select_to_sql_recursively(p.input.as_ref(), query, select, relation)
+                // A second projection implies a derived tablefactor
+                if !select.already_projected() {


This block should probably be refactored into multiple helper functions to reduce cognitive load here...

There are currently 4 paths under consideration depending on if a projection node is encountered multiple times and if it is followed by an aggregation node or not.

devinjdangelo · 2024-03-14T14:29:43Z

cc @backkem if you have a moment to review

backkem

I'm also not sure if the assumption on the plan structure always holds up. Outside of this, the change looks good to me. I guess we can build out more complex cases as needed in follow-up PRs.

alamb · 2024-03-14T23:51:21Z

I plan to review this PR carefully tomorrow

alamb · 2024-03-15T10:48:17Z

In general, I don't think it will be possible to convert all LogicalPlans to SQL as LogicalPlans can be built programatically and some will perfom calculations that are not possible to express in SQL

Therefore, I think the best testing strategy is to do round trip tests that verify any pattern the DataFusion SQL planner makes can be converted back to SQL

This case will of course fail in the PR. I think that is OK, as we can work on support for more complex aggregations and additional edge cases we find in future PRs.

I think this is a wise strategy.

alamb

Thanks @devinjdangelo @backkem and @seddonm1

It is really exciting to see this functionality progress

impl agg and subquery plans to sql

e1da531

github-actions bot added the sql SQL Planner label Mar 14, 2024

devinjdangelo commented Mar 14, 2024

View reviewed changes

backkem approved these changes Mar 14, 2024

View reviewed changes

alamb mentioned this pull request Mar 14, 2024

DataFusion weekly project plan (Andrew Lamb) - March 11, 2024 #9555

Closed

5 tasks

alamb approved these changes Mar 15, 2024

View reviewed changes

alamb merged commit 219de5f into apache:main Mar 15, 2024
23 checks passed

devinjdangelo mentioned this pull request Mar 15, 2024

Improve Robustness of Unparser Testing and Implementation #9623

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement aggregation and subquery plans to SQL #9606

feat: implement aggregation and subquery plans to SQL #9606

devinjdangelo commented Mar 14, 2024

devinjdangelo Mar 14, 2024

backkem Mar 14, 2024

seddonm1 Mar 14, 2024

devinjdangelo Mar 14, 2024

devinjdangelo Mar 14, 2024

seddonm1 Mar 14, 2024

devinjdangelo Mar 14, 2024

devinjdangelo Mar 14, 2024 •

edited

Loading

devinjdangelo commented Mar 14, 2024

backkem left a comment

alamb commented Mar 14, 2024

alamb commented Mar 15, 2024

alamb left a comment

feat: implement aggregation and subquery plans to SQL #9606

feat: implement aggregation and subquery plans to SQL #9606

Conversation

devinjdangelo commented Mar 14, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

devinjdangelo Mar 14, 2024

Choose a reason for hiding this comment

backkem Mar 14, 2024

Choose a reason for hiding this comment

seddonm1 Mar 14, 2024

Choose a reason for hiding this comment

devinjdangelo Mar 14, 2024

Choose a reason for hiding this comment

devinjdangelo Mar 14, 2024

Choose a reason for hiding this comment

seddonm1 Mar 14, 2024

Choose a reason for hiding this comment

devinjdangelo Mar 14, 2024

Choose a reason for hiding this comment

devinjdangelo Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

devinjdangelo commented Mar 14, 2024

backkem left a comment

Choose a reason for hiding this comment

alamb commented Mar 14, 2024

alamb commented Mar 15, 2024

alamb left a comment

Choose a reason for hiding this comment

devinjdangelo Mar 14, 2024 •

edited

Loading