-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement aggregation and subquery plans to SQL #9606
Conversation
@@ -176,7 +250,9 @@ impl Unparser<'_> { | |||
) | |||
} | |||
LogicalPlan::Aggregate(_agg) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently assuming all aggregate plan nodes are preceded by a projection plan node. These two nodes are then handled together simultaneously in the above projection node block.
@alamb I was wondering if you know:
- If an Aggregate plan node can ever NOT be preceded by a projection plan node?
- If so, under what circumstances and what does that mean intuitively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does indeed become more tricky. I guess the schema computed at every layer of the plan passes up enough information for execution but maybe not for our Unparser use-case. The specific columns being selected are no longer needed/known at the aggregation step.
I wonder if it's worth creating a structure next to the plan that builds up schema-like data for serialization purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devinjdangelo
I have done a lot of work on this problem independently - although I think this approach of trying to rebuild the sqlparser-rs AST and then using it to generate SQL is nicer.
Regarding your questions you start to see all kinds of crazy plans when you start trying to unwind aggregations:
// Projection: pc.contacttypeid, ctypename, nocontacts
// Sort: COUNT(*) AS nocontacts DESC NULLS FIRST
// Projection: pc.contacttypeid, pc.name AS ctypename, COUNT(*) AS nocontacts, COUNT(*)
// Filter: COUNT(*) >= Int64(100)
// Aggregate: groupBy=[[pc.contacttypeid, pc.name]], aggr=[[COUNT(*)]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @seddonm1! You are right. I cooked up a test case that looks similar to the plan you show.
Sort: COUNT(*) ASC NULLS LAST
Projection: person.id, COUNT(*), person.first_name
Filter: COUNT(*) > Int64(5)
Aggregate: groupBy=[[person.first_name, person.id]], aggr=[[COUNT(*)]]
TableScan: person
select id, count(*), first_name
from person
group by first_name, id
having count(*)>5
order by count(*)
This case will of course fail in the PR. I think that is OK, as we can work on support for more complex aggregations and additional edge cases we find in future PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will get interesting as we will need to start writing logic that peaks up and down the plan 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can put my current code on a gist if that would help. i had to do multiple nested matching conditions to be able to traverse and extract nested data.
What would be ideal is if this could drive changes in the LogicalPlan nodes to help unparse
them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great! Also feel free to open a PR to DataFusion if you are interested in collaborating on this functionality.
|
||
self.select_to_sql_recursively(p.input.as_ref(), query, select, relation) | ||
// A second projection implies a derived tablefactor | ||
if !select.already_projected() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block should probably be refactored into multiple helper functions to reduce cognitive load here...
There are currently 4 paths under consideration depending on if a projection node is encountered multiple times and if it is followed by an aggregation node or not.
cc @backkem if you have a moment to review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not sure if the assumption on the plan structure always holds up. Outside of this, the change looks good to me. I guess we can build out more complex cases as needed in follow-up PRs.
I plan to review this PR carefully tomorrow |
In general, I don't think it will be possible to convert all Therefore, I think the best testing strategy is to do round trip tests that verify any pattern the DataFusion SQL planner makes can be converted back to SQL
I think this is a wise strategy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @devinjdangelo @backkem and @seddonm1
It is really exciting to see this functionality progress
Which issue does this PR close?
works on #8661
Rationale for this change
See issue
What changes are included in this PR?
Adds support for aggregation and subquery plan nodes converting to a SQL AST
For example, one can now convert
to
Are these changes tested?
Yes, new roundtrip tests are added
Are there any user-facing changes?
More plans supported