Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement aggregation and subquery plans to SQL #9606

Merged
merged 1 commit into from
Mar 15, 2024

Conversation

devinjdangelo
Copy link
Contributor

Which issue does this PR close?

works on #8661

Rationale for this change

See issue

What changes are included in this PR?

Adds support for aggregation and subquery plan nodes converting to a SQL AST

For example, one can now convert

Projection: p1.id, COUNT(*) AS cnt
  Aggregate: groupBy=[[p1.id]], aggr=[[COUNT(*)]]
    Projection: p1.id
      Inner Join:  Filter: p1.id = p2.id
        SubqueryAlias: p1
          TableScan: person
        SubqueryAlias: p2
          TableScan: person

to

SELECT p1.id, COUNT(*) AS cnt 
FROM (
    SELECT p1.id 
    FROM person AS p1 
    JOIN person AS p2 
    ON (p1.id = p2.id)
) 
GROUP BY p1.id

Are these changes tested?

Yes, new roundtrip tests are added

Are there any user-facing changes?

More plans supported

@github-actions github-actions bot added the sql SQL Planner label Mar 14, 2024
@@ -176,7 +250,9 @@ impl Unparser<'_> {
)
}
LogicalPlan::Aggregate(_agg) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently assuming all aggregate plan nodes are preceded by a projection plan node. These two nodes are then handled together simultaneously in the above projection node block.

@alamb I was wondering if you know:

  1. If an Aggregate plan node can ever NOT be preceded by a projection plan node?
  2. If so, under what circumstances and what does that mean intuitively?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does indeed become more tricky. I guess the schema computed at every layer of the plan passes up enough information for execution but maybe not for our Unparser use-case. The specific columns being selected are no longer needed/known at the aggregation step.

I wonder if it's worth creating a structure next to the plan that builds up schema-like data for serialization purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devinjdangelo
I have done a lot of work on this problem independently - although I think this approach of trying to rebuild the sqlparser-rs AST and then using it to generate SQL is nicer.

Regarding your questions you start to see all kinds of crazy plans when you start trying to unwind aggregations:

// Projection: pc.contacttypeid, ctypename, nocontacts
//   Sort: COUNT(*) AS nocontacts DESC NULLS FIRST
//     Projection: pc.contacttypeid, pc.name AS ctypename, COUNT(*) AS nocontacts, COUNT(*)
//       Filter: COUNT(*) >= Int64(100)
//         Aggregate: groupBy=[[pc.contacttypeid, pc.name]], aggr=[[COUNT(*)]]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @seddonm1! You are right. I cooked up a test case that looks similar to the plan you show.

Sort: COUNT(*) ASC NULLS LAST
  Projection: person.id, COUNT(*), person.first_name
    Filter: COUNT(*) > Int64(5)
      Aggregate: groupBy=[[person.first_name, person.id]], aggr=[[COUNT(*)]]
        TableScan: person
select id, count(*), first_name 
from person 
group by first_name, id 
having count(*)>5 
order by count(*)

This case will of course fail in the PR. I think that is OK, as we can work on support for more complex aggregations and additional edge cases we find in future PRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will get interesting as we will need to start writing logic that peaks up and down the plan 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can put my current code on a gist if that would help. i had to do multiple nested matching conditions to be able to traverse and extract nested data.

What would be ideal is if this could drive changes in the LogicalPlan nodes to help unparse them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great! Also feel free to open a PR to DataFusion if you are interested in collaborating on this functionality.


self.select_to_sql_recursively(p.input.as_ref(), query, select, relation)
// A second projection implies a derived tablefactor
if !select.already_projected() {
Copy link
Contributor Author

@devinjdangelo devinjdangelo Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block should probably be refactored into multiple helper functions to reduce cognitive load here...

There are currently 4 paths under consideration depending on if a projection node is encountered multiple times and if it is followed by an aggregation node or not.

@devinjdangelo
Copy link
Contributor Author

cc @backkem if you have a moment to review

Copy link
Contributor

@backkem backkem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure if the assumption on the plan structure always holds up. Outside of this, the change looks good to me. I guess we can build out more complex cases as needed in follow-up PRs.

@alamb
Copy link
Contributor

alamb commented Mar 14, 2024

I plan to review this PR carefully tomorrow

@alamb
Copy link
Contributor

alamb commented Mar 15, 2024

In general, I don't think it will be possible to convert all LogicalPlans to SQL as LogicalPlans can be built programatically and some will perfom calculations that are not possible to express in SQL

Therefore, I think the best testing strategy is to do round trip tests that verify any pattern the DataFusion SQL planner makes can be converted back to SQL

This case will of course fail in the PR. I think that is OK, as we can work on support for more complex aggregations and additional edge cases we find in future PRs.

I think this is a wise strategy.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devinjdangelo @backkem and @seddonm1

It is really exciting to see this functionality progress

@alamb alamb merged commit 219de5f into apache:main Mar 15, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants