Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11077] [SQL] Join elimination in Catalyst #9089

Closed
wants to merge 26 commits into from

Conversation

ankurdave
Copy link
Contributor

Join elimination is a query optimization where certain joins can be eliminated when followed by projections that only keep columns from one side of the join, and when certain columns are known to be unique or foreign keys. This can be very useful for queries involving views and machine-generated queries.

This PR adds join elimination by (1) supporting unique and foreign key hints in logical plans, (2) adding methods in the DataFrame API to let users provide these hints, and (3) adding an optimizer rule that eliminates unique key outer joins and referential integrity joins when followed by an appropriate projection.

This change is described in detail here: https://docs.google.com/document/d/1-YgQSQywHfAo4PhAT-zOOkFZtVcju99h3dYQq-i9GWQ/edit?usp=sharing

@SparkQA
Copy link

SparkQA commented Oct 13, 2015

Test build #43619 has finished for PR 9089 at commit 578797c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class KeyHint(newKeys: Seq[Key], child: LogicalPlan) extends UnaryNode
    • sealed abstract class Key
    • case class UniqueKey(attr: Attribute) extends Key
    • case class ForeignKey(

They were references to the join elimination logic in Teradata, which is
really just a standard optimization rule.
@SparkQA
Copy link

SparkQA commented Oct 13, 2015

Test build #43620 has finished for PR 9089 at commit 7c7357b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class KeyHint(newKeys: Seq[Key], child: LogicalPlan) extends UnaryNode
    • sealed abstract class Key
    • case class UniqueKey(attr: Attribute) extends Key
    • case class ForeignKey(

@ankurdave
Copy link
Contributor Author

@marmbrus I addressed your comments from the review about a month ago:

  1. Foreign key references now store the referenced relation directly as a logical plan rather than requiring a catalog lookup.
  2. We now use semanticEquals and AttributeSet for attributes instead of ==.

There were a few comments that didn't make sense on second thought:

  1. Move the attribute equivalence check in ForeignKeyFinder to a method on LogicalPlan. We thought this would simplify the logic, but it turned out not to (still need to maintain the disjoint-set data structure, and the logic gets split between LogicalPlan and Project).
  2. Move foreign key attribute resolution to its own rule that runs at the end of analysis. This would work fine, but it seems to fit well within ResolveReferences.

Finally, the new DataFrame methods should probably be marked as alpha somehow, but I'm not sure of the best way. Maybe a new ScalaDoc group?

cc @rxin, @jkbradley

@rxin
Copy link
Contributor

rxin commented Oct 13, 2015

We can tag them as Experimental (even though the entire DataFrame API is experimental!)

attributeRewrites.get(referencedAttr).getOrElse(referencedAttr))
case other => other
}
KeyHint((keys ++ newKeys).distinct, child)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just use newKeys here? Why do we need to keep old keys?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good eye! This is to accommodate future self-joins. If we got rid of the old foreign keys, a future self-join would not recognize that the new keys applied to it, because the attributes would have been rewritten. I just added a comment noting this.

There's a unit test that covers this (fails if you remove the old keys).

@ankurdave
Copy link
Contributor Author

@rxin Thanks, I added the Experimental tags.

@SparkQA
Copy link

SparkQA commented Oct 13, 2015

Test build #43633 has finished for PR 9089 at commit 55bb135.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class KeyHint(newKeys: Seq[Key], child: LogicalPlan) extends UnaryNode
    • sealed abstract class Key
    • case class UniqueKey(attr: Attribute) extends Key
    • case class ForeignKey(

@ankurdave
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Oct 13, 2015

Test build #43638 has finished for PR 9089 at commit 55bb135.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class KeyHint(newKeys: Seq[Key], child: LogicalPlan) extends UnaryNode
    • sealed abstract class Key
    • case class UniqueKey(attr: Attribute) extends Key
    • case class ForeignKey(

@jkbradley
Copy link
Member

Calling uniqueKey on a DataFrame throws out the column names. Is that intended?

@ankurdave
Copy link
Contributor Author

@jkbradley Oops, thanks for catching that. I introduced it in 5071759 because I misunderstood the function of transformExpressionsDown. Should be fixed now.

@SparkQA
Copy link

SparkQA commented Oct 15, 2015

Test build #43757 has finished for PR 9089 at commit e1ec23d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class KeyHint(newKeys: Seq[Key], child: LogicalPlan) extends UnaryNode
    • sealed abstract class Key
    • case class UniqueKey(attr: Attribute) extends Key
    • case class ForeignKey(

@SparkQA
Copy link

SparkQA commented Oct 15, 2015

Test build #43758 has finished for PR 9089 at commit 0cd8a91.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class KeyHint(newKeys: Seq[Key], child: LogicalPlan) extends UnaryNode
    • sealed abstract class Key
    • case class UniqueKey(attr: Attribute) extends Key
    • case class ForeignKey(

@jkbradley
Copy link
Member

@ankurdave Np, thanks for the fix. Btw, should the fix be accompanied by a unit test to catch that issue?

@SparkQA
Copy link

SparkQA commented Nov 6, 2015

Test build #45232 has finished for PR 9089 at commit 5abceae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class KeyHint(newKeys: Seq[Key], child: LogicalPlan) extends UnaryNode\n * sealed abstract class Key\n * case class UniqueKey(attr: Attribute) extends Key\n * case class ForeignKey(\n

@AmplabJenkins
Copy link

Build finished. 5912 tests run, 0 skipped, 0 failed.

@rxin
Copy link
Contributor

rxin commented Jun 15, 2016

Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one.

@asfgit asfgit closed this in 1a33f2e Jun 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants