-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11077] [SQL] Join elimination in Catalyst #9089
Conversation
Do not eliminate referential integrity full outer joins, or inner joins where foreign key is nullable. Require foreign keys to reference unique columns.
This is necessary to support aliased self joins and multiple foreign keys with the same referent.
Instead just leave the KeyHint unresolved.
Previously we stored its name as part of referencedAttr, requiring a catalog lookup.
Test build #43619 has finished for PR 9089 at commit
|
They were references to the join elimination logic in Teradata, which is really just a standard optimization rule.
Test build #43620 has finished for PR 9089 at commit
|
@marmbrus I addressed your comments from the review about a month ago:
There were a few comments that didn't make sense on second thought:
Finally, the new DataFrame methods should probably be marked as alpha somehow, but I'm not sure of the best way. Maybe a new ScalaDoc group? cc @rxin, @jkbradley |
We can tag them as Experimental (even though the entire DataFrame API is experimental!) |
attributeRewrites.get(referencedAttr).getOrElse(referencedAttr)) | ||
case other => other | ||
} | ||
KeyHint((keys ++ newKeys).distinct, child) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just use newKeys
here? Why do we need to keep old keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good eye! This is to accommodate future self-joins. If we got rid of the old foreign keys, a future self-join would not recognize that the new keys applied to it, because the attributes would have been rewritten. I just added a comment noting this.
There's a unit test that covers this (fails if you remove the old keys).
@rxin Thanks, I added the Experimental tags. |
Test build #43633 has finished for PR 9089 at commit
|
Jenkins, retest this please. |
Test build #43638 has finished for PR 9089 at commit
|
Calling uniqueKey on a DataFrame throws out the column names. Is that intended? |
This reverts commit 5071759.
@jkbradley Oops, thanks for catching that. I introduced it in 5071759 because I misunderstood the function of |
Test build #43757 has finished for PR 9089 at commit
|
Test build #43758 has finished for PR 9089 at commit
|
@ankurdave Np, thanks for the fix. Btw, should the fix be accompanied by a unit test to catch that issue? |
Test build #45232 has finished for PR 9089 at commit
|
Build finished. 5912 tests run, 0 skipped, 0 failed. |
Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one. |
Join elimination is a query optimization where certain joins can be eliminated when followed by projections that only keep columns from one side of the join, and when certain columns are known to be unique or foreign keys. This can be very useful for queries involving views and machine-generated queries.
This PR adds join elimination by (1) supporting unique and foreign key hints in logical plans, (2) adding methods in the DataFrame API to let users provide these hints, and (3) adding an optimizer rule that eliminates unique key outer joins and referential integrity joins when followed by an appropriate projection.
This change is described in detail here: https://docs.google.com/document/d/1-YgQSQywHfAo4PhAT-zOOkFZtVcju99h3dYQq-i9GWQ/edit?usp=sharing