[SPARK-11077] [SQL] Join elimination in Catalyst #9089

ankurdave · 2015-10-13T04:42:36Z

Join elimination is a query optimization where certain joins can be eliminated when followed by projections that only keep columns from one side of the join, and when certain columns are known to be unique or foreign keys. This can be very useful for queries involving views and machine-generated queries.

This PR adds join elimination by (1) supporting unique and foreign key hints in logical plans, (2) adding methods in the DataFrame API to let users provide these hints, and (3) adding an optimizer rule that eliminates unique key outer joins and referential integrity joins when followed by an appropriate projection.

This change is described in detail here: https://docs.google.com/document/d/1-YgQSQywHfAo4PhAT-zOOkFZtVcju99h3dYQq-i9GWQ/edit?usp=sharing

Do not eliminate referential integrity full outer joins, or inner joins where foreign key is nullable. Require foreign keys to reference unique columns.

This is necessary to support aliased self joins and multiple foreign keys with the same referent.

Instead just leave the KeyHint unresolved.

Previously we stored its name as part of referencedAttr, requiring a catalog lookup.

SparkQA · 2015-10-13T04:50:44Z

Test build #43619 has finished for PR 9089 at commit 578797c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class KeyHint(newKeys: Seq[Key], child: LogicalPlan) extends UnaryNode
- sealed abstract class Key
- case class UniqueKey(attr: Attribute) extends Key
- case class ForeignKey(

They were references to the join elimination logic in Teradata, which is really just a standard optimization rule.

SparkQA · 2015-10-13T06:48:07Z

Test build #43620 has finished for PR 9089 at commit 7c7357b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class KeyHint(newKeys: Seq[Key], child: LogicalPlan) extends UnaryNode
- sealed abstract class Key
- case class UniqueKey(attr: Attribute) extends Key
- case class ForeignKey(

ankurdave · 2015-10-13T07:06:14Z

@marmbrus I addressed your comments from the review about a month ago:

Foreign key references now store the referenced relation directly as a logical plan rather than requiring a catalog lookup.
We now use semanticEquals and AttributeSet for attributes instead of ==.

There were a few comments that didn't make sense on second thought:

Move the attribute equivalence check in ForeignKeyFinder to a method on LogicalPlan. We thought this would simplify the logic, but it turned out not to (still need to maintain the disjoint-set data structure, and the logic gets split between LogicalPlan and Project).
Move foreign key attribute resolution to its own rule that runs at the end of analysis. This would work fine, but it seems to fit well within ResolveReferences.

Finally, the new DataFrame methods should probably be marked as alpha somehow, but I'm not sure of the best way. Maybe a new ScalaDoc group?

cc @rxin, @jkbradley

rxin · 2015-10-13T07:18:33Z

We can tag them as Experimental (even though the entire DataFrame API is experimental!)

viirya · 2015-10-13T07:25:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+                          attributeRewrites.get(referencedAttr).getOrElse(referencedAttr))
+                      case other => other
+                    }
+                    KeyHint((keys ++ newKeys).distinct, child)


Can't we just use newKeys here? Why do we need to keep old keys?

Good eye! This is to accommodate future self-joins. If we got rid of the old foreign keys, a future self-join would not recognize that the new keys applied to it, because the attributes would have been rewritten. I just added a comment noting this.

There's a unit test that covers this (fails if you remove the old keys).

ankurdave · 2015-10-13T07:56:06Z

@rxin Thanks, I added the Experimental tags.

SparkQA · 2015-10-13T08:16:30Z

Test build #43633 has finished for PR 9089 at commit 55bb135.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class KeyHint(newKeys: Seq[Key], child: LogicalPlan) extends UnaryNode
- sealed abstract class Key
- case class UniqueKey(attr: Attribute) extends Key
- case class ForeignKey(

ankurdave · 2015-10-13T08:18:27Z

Jenkins, retest this please.

SparkQA · 2015-10-13T10:42:52Z

Test build #43638 has finished for PR 9089 at commit 55bb135.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class KeyHint(newKeys: Seq[Key], child: LogicalPlan) extends UnaryNode
- sealed abstract class Key
- case class UniqueKey(attr: Attribute) extends Key
- case class ForeignKey(

jkbradley · 2015-10-14T20:10:25Z

Calling uniqueKey on a DataFrame throws out the column names. Is that intended?

This reverts commit 5071759.

ankurdave · 2015-10-15T00:17:39Z

@jkbradley Oops, thanks for catching that. I introduced it in 5071759 because I misunderstood the function of transformExpressionsDown. Should be fixed now.

SparkQA · 2015-10-15T02:16:10Z

Test build #43757 has finished for PR 9089 at commit e1ec23d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class KeyHint(newKeys: Seq[Key], child: LogicalPlan) extends UnaryNode
- sealed abstract class Key
- case class UniqueKey(attr: Attribute) extends Key
- case class ForeignKey(

SparkQA · 2015-10-15T02:58:16Z

Test build #43758 has finished for PR 9089 at commit 0cd8a91.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class KeyHint(newKeys: Seq[Key], child: LogicalPlan) extends UnaryNode
- sealed abstract class Key
- case class UniqueKey(attr: Attribute) extends Key
- case class ForeignKey(

jkbradley · 2015-10-16T18:02:19Z

@ankurdave Np, thanks for the fix. Btw, should the fix be accompanied by a unit test to catch that issue?

…oinElimination

SparkQA · 2015-11-06T20:47:13Z

Test build #45232 has finished for PR 9089 at commit 5abceae.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class KeyHint(newKeys: Seq[Key], child: LogicalPlan) extends UnaryNode\n * sealed abstract class Key\n * case class UniqueKey(attr: Attribute) extends Key\n * case class ForeignKey(\n

AmplabJenkins · 2015-11-06T20:48:07Z

Build finished. 5912 tests run, 0 skipped, 0 failed.

rxin · 2016-06-15T22:04:11Z

Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one.

ankurdave added 19 commits August 3, 2015 22:33

Eliminate outer join before project

4f52877

Use KeyHint to do join elimination

ae46ab0

Add foreign keys

df9ef14

Alias-aware join elimination + bugfixes

b22f702

Propagate foreign keys through Join operator

9072cb7

Remove key hints after join elimination

f430ea2

Support inner joins based on referential integrity

1302531

Correctness fixes for join elimination

35949f5

Do not eliminate referential integrity full outer joins, or inner joins where foreign key is nullable. Require foreign keys to reference unique columns.

Do key hint resolution during analysis

945e523

This is necessary to support aliased self joins and multiple foreign keys with the same referent.

Don't crash when foreign key refers to unresolved relation

504c9d8

Instead just leave the KeyHint unresolved.

Fix JoinEliminationSuite

83c8ff9

Merge remote-tracking branch 'apache-spark/master' into GraphFrames

0b0b840

Fix KeyHintSuite after merge

9150dda

In ForeignKey, store referencedRelation as logical plan

873b322

Previously we stored its name as part of referencedAttr, requiring a catalog lookup.

Use semanticEquals for Attributes

98e0b5e

Remove TODOs

d43a2c0

Add more comments

f4e7e01

Merge remote-tracking branch 'apache-spark/master' into GraphFrames

49b196e

Use SharedSQLContext in KeyHintSuite

578797c

Remove long URLs

7c7357b

They were references to the join elimination logic in Teradata, which is really just a standard optimization rule.

viirya reviewed Oct 13, 2015
View reviewed changes

ankurdave added 3 commits October 13, 2015 00:36

Fix override of KeyHint#transformExpressions{Up,Down}

5071759

Declare new DataFrame methods extra-experimental

ec2b80b

Explain why we keep old keys in self-join rewrite

55bb135

Revert "Fix override of KeyHint#transformExpressions{Up,Down}"

e1ec23d

This reverts commit 5071759.

Update transformExpressions override comments

0cd8a91

Merge remote-tracking branch 'apache-spark/master' into SPARK-11077-J…

5abceae

…oinElimination

marmbrus mentioned this pull request Mar 24, 2016

[SPARK-14032] [SQL] Eliminate Unnecessary Distinct/Aggregate #11854

Closed

gatorsmile mentioned this pull request Mar 24, 2016

[SPARK-14112] [SQL] [WIP] Unique Constraints over a Set of AttributeReferences #11930

Closed

asfgit closed this in 1a33f2e Jun 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-11077] [SQL] Join elimination in Catalyst #9089

[SPARK-11077] [SQL] Join elimination in Catalyst #9089

ankurdave commented Oct 13, 2015

SparkQA commented Oct 13, 2015

SparkQA commented Oct 13, 2015

ankurdave commented Oct 13, 2015

rxin commented Oct 13, 2015

viirya Oct 13, 2015

ankurdave Oct 13, 2015

ankurdave commented Oct 13, 2015

SparkQA commented Oct 13, 2015

ankurdave commented Oct 13, 2015

SparkQA commented Oct 13, 2015

jkbradley commented Oct 14, 2015

ankurdave commented Oct 15, 2015

SparkQA commented Oct 15, 2015

SparkQA commented Oct 15, 2015

jkbradley commented Oct 16, 2015

SparkQA commented Nov 6, 2015

AmplabJenkins commented Nov 6, 2015

rxin commented Jun 15, 2016

[SPARK-11077] [SQL] Join elimination in Catalyst #9089

[SPARK-11077] [SQL] Join elimination in Catalyst #9089

Conversation

ankurdave commented Oct 13, 2015

SparkQA commented Oct 13, 2015

SparkQA commented Oct 13, 2015

ankurdave commented Oct 13, 2015

rxin commented Oct 13, 2015

viirya Oct 13, 2015

Choose a reason for hiding this comment

ankurdave Oct 13, 2015

Choose a reason for hiding this comment

ankurdave commented Oct 13, 2015

SparkQA commented Oct 13, 2015

ankurdave commented Oct 13, 2015

SparkQA commented Oct 13, 2015

jkbradley commented Oct 14, 2015

ankurdave commented Oct 15, 2015

SparkQA commented Oct 15, 2015

SparkQA commented Oct 15, 2015

jkbradley commented Oct 16, 2015

SparkQA commented Nov 6, 2015

AmplabJenkins commented Nov 6, 2015

rxin commented Jun 15, 2016