[SPARK-14032] [SQL] Eliminate Unnecessary Distinct/Aggregate #11854

gatorsmile · 2016-03-20T22:22:49Z

What changes were proposed in this pull request?

Distinct is an expensive operation. If possible, we should avoid it. This PR is to eliminate Distinct (the Aggregate for Distinct) when the child operators can guarantee the value uniqueness,

For example, in the following TPC-DS query 38, the left child of the first Intersect is Distinct, and thus, we can remove the top Distinct after converting Intersect to Left-semi + Distinct.

select count(*) from (
    select distinct c_last_name, c_first_name, d_date
    from store_sales, date_dim, customer
          where store_sales.ss_sold_date_sk = date_dim.d_date_sk
      and store_sales.ss_customer_sk = customer.c_customer_sk
      and d_month_seq between [DMS] and [DMS] + 11
  intersect
    select distinct c_last_name, c_first_name, d_date
    from catalog_sales, date_dim, customer
          where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
      and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
      and d_month_seq between [DMS] and [DMS] + 11
  intersect
    select distinct c_last_name, c_first_name, d_date
    from web_sales, date_dim, customer
          where web_sales.ws_sold_date_sk = date_dim.d_date_sk
      and web_sales.ws_bill_customer_sk = customer.c_customer_sk
      and d_month_seq between [DMS] and [DMS] + 11
) hot_cyst

Note: Since we do not have the cardinality info, we are unable to conclude if the distinct of the right child can be removed. It totally depends on the data of the right child of Intersect. In this PR, we just remove the top Distinct.

Use a simplified query to show the effect of this PR:

df.distinct().intersect(df).intersect(df)

Before the fix, the optimized plan is like

Aggregate [id#37,value#38], [id#37,value#38]
+- Join LeftSemi, Some(((id#37 <=> id#64) && (value#38 <=> value#65)))
   :- Aggregate [id#37,value#38], [id#37,value#38]
   :  +- Join LeftSemi, Some(((id#37 <=> id#57) && (value#38 <=> value#58)))
   :     :- Aggregate [id#37,value#38], [id#37,value#38]
   :     :  +- LocalRelation [id#37,value#38], [[id1,1],[id1,1],[id,1],[id1,2]]
   :     +- LocalRelation [id#57,value#58], [[id1,1],[id1,1],[id,1],[id1,2]]
   +- LocalRelation [id#64,value#65], [[id1,1],[id1,1],[id,1],[id1,2]]

After the fix, the optimized plan is like

Join LeftSemi, Some(((id#37 <=> id#64) && (value#38 <=> value#65)))
:- Join LeftSemi, Some(((id#37 <=> id#57) && (value#38 <=> value#58)))
:  :- Aggregate [id#37,value#38], [id#37,value#38]
:  :  +- LocalRelation [id#37,value#38], [[id1,1],[id1,1],[id,1],[id1,2]]
:  +- LocalRelation [id#57,value#58], [[id1,1],[id1,1],[id,1],[id1,2]]
+- LocalRelation [id#64,value#65], [[id1,1],[id1,1],[id,1],[id1,2]]

How was this patch tested?

Added a few test cases

…tDistinct

gatorsmile · 2016-03-20T22:31:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+  }
+
+  // propagate the distinct property from the child
+  @tailrec


Another solution is to add a property isDistinct to LogicalPlan. However, it could be expensive for recursive calls, compared with the @tailrec. In the future, if the physical plan will use the property isDistinct, we can rewrite it. Actually, this is a very critical property at runtime algorithm optimization. Thanks!

SparkQA · 2016-03-21T00:06:14Z

Test build #53641 has finished for PR 11854 at commit 96d9d4e.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-21T00:25:56Z

Test build #53642 has finished for PR 11854 at commit dddc78b.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

…tDistinct

SparkQA · 2016-03-22T09:03:50Z

Test build #53753 has finished for PR 11854 at commit bae2c86.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-03-23T16:46:35Z

@sameeragarwal I haven't looked deeply into this PR, but I think this might be a good candidate use case for the newly introduced constraints facilities?

gatorsmile · 2016-03-23T16:59:53Z

@liancheng Yeah, you are right. We also can put it into the Constraints, if we can introduce a new expression, like, IsDistinct, which will be used for Constraints only. : )

gatorsmile · 2016-03-23T17:27:45Z

also CC @marmbrus @yhuai

sameeragarwal · 2016-03-23T17:27:59Z

yes, that's a great idea. The current constraints framework however is just limited to per-row constraints whereas constraints like IsDistinct and (say) isSorted are per-attribute constraints. We should definitely support per-attribute constraints as well but those may require a different set of per-operator propagation rules.

gatorsmile · 2016-03-23T17:30:08Z

@sameeragarwal Should we do it now? Or you have the other plan? Thanks!

sameeragarwal · 2016-03-23T17:39:56Z

I think it'd be great to have it. However, as Michael had suggested earlier, it'd be nice to first come up with a set of candidate queries that'd potentially benefit from these optimizations in order to better motivate the kind of per-attribute constraints we need to track. I think q38 is an excellent example. Do you have some others in mind?

gatorsmile · 2016-03-23T17:56:50Z

So far, nope. Actually, the idea of this PR comes out when I fixing a JIRA related to TPC-DS Q38.

Generally, IsDistinct can also benefit the physical execution of queries. This could have a broader impact. In RDBMS, unique constraints are very important in query optimization and runtime. I believe it is also applicable to Spark SQL, although we do not have constraint enforcement or unique constraints in Spark SQL

liancheng · 2016-03-24T00:36:18Z

@sameeragarwal Thanks for the explanation! (One question is that, it seems that per-attribute constraints are not enough since ordering and distinctness can be properties of a group of attributes.)

gatorsmile · 2016-03-24T00:41:13Z

Agree with @liancheng Distinctness is defined for a set of attributes. Ordering also needs to consider the sequence of the attributes.

sameeragarwal · 2016-03-24T00:48:56Z

Sorry for the confusion -- these attribute constraints should still be on a per-operator basis (i.e., part of the QueryPlan). What I meant was that they can track attribute-specific properties (instead of just row-specific properties) such as distinctness of a set of attributes.

marmbrus · 2016-03-24T00:51:21Z

I agree with @sameeragarwal that this probably doesn't belong in constraints since its a cross row concern. However, as he says, it would be nice to come up with general mechanism to reason about uniqueness and other cross row constraints as a function of a given QueryPlan. For example #9089 also proposes such an API.

gatorsmile · 2016-03-24T02:13:00Z

I see. In the #9089, the key can only contain a single attribute.

Will try to define a function in QueryPlan for uniqueness ASAP. Network outage now. Hopefully service will be back soon. Thanks!

gatorsmile · 2016-03-24T05:00:29Z

Added a function distinctSet into QueryPlan. This function will return the set of attributes whose combination can uniquely identify a row. Maybe I should create a separate PR for this only and added a few test cases to cover the correctness.

gatorsmile · 2016-03-24T05:30:22Z

The motivation of this function distinctSet is to obtain the uniqueness constraint from the child operators. The output of Distinct, Intersect, Except, and Aggregate (iff its aggregate expressions is identical to its grouping expressions) can always guarantee the uniqueness. Thus, the parent operators can use it for query optimization.

SparkQA · 2016-03-24T06:16:56Z

Test build #54003 has finished for PR 11854 at commit 7d95bc1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-03-24T06:38:21Z

retest this please

SparkQA · 2016-03-24T07:53:37Z

Test build #54013 has finished for PR 11854 at commit 7d95bc1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-03-24T18:30:27Z

I actually maybe like the see the following more in the form of a design doc (check out the constraints JRIA):

what are the interesting cross row things we want to reason about
how are they useful
a couple of API options for representing all of them.

gatorsmile · 2016-03-24T18:34:33Z

Yeah, completely agree. Will do it after DDL-related PRs are completed.

Thanks!

srowen · 2016-06-23T10:15:09Z

@gatorsmile can this be closed for now?

gatorsmile · 2016-06-23T16:43:11Z

This requires some discussions about how to add/use distinct in the optimizer. Will do it in the next release. Thanks!

gatorsmile added 3 commits March 20, 2016 15:01

eliminate Distinct

37352f4

Merge remote-tracking branch 'upstream/master' into eliminateIntersec…

577bfe7

…tDistinct

code clean.

96d9d4e

gatorsmile reviewed Mar 20, 2016
View reviewed changes

added one more test case.

dddc78b

gatorsmile added 2 commits March 21, 2016 22:43

Merge remote-tracking branch 'upstream/master' into eliminateIntersec…

21bbde1

…tDistinct

fix R test case failure.

bae2c86

create a distinctSet for uniqueness constraint

7d95bc1

gatorsmile mentioned this pull request Mar 24, 2016

[SPARK-14112] [SQL] [WIP] Unique Constraints over a Set of AttributeReferences #11930

Closed

gatorsmile closed this Jun 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14032] [SQL] Eliminate Unnecessary Distinct/Aggregate #11854

[SPARK-14032] [SQL] Eliminate Unnecessary Distinct/Aggregate #11854

gatorsmile commented Mar 20, 2016

gatorsmile Mar 20, 2016

SparkQA commented Mar 21, 2016

SparkQA commented Mar 21, 2016

SparkQA commented Mar 22, 2016

liancheng commented Mar 23, 2016

gatorsmile commented Mar 23, 2016

gatorsmile commented Mar 23, 2016

sameeragarwal commented Mar 23, 2016

gatorsmile commented Mar 23, 2016

sameeragarwal commented Mar 23, 2016

gatorsmile commented Mar 23, 2016

liancheng commented Mar 24, 2016

gatorsmile commented Mar 24, 2016

sameeragarwal commented Mar 24, 2016

marmbrus commented Mar 24, 2016

gatorsmile commented Mar 24, 2016

gatorsmile commented Mar 24, 2016

gatorsmile commented Mar 24, 2016

SparkQA commented Mar 24, 2016

gatorsmile commented Mar 24, 2016

SparkQA commented Mar 24, 2016

marmbrus commented Mar 24, 2016

gatorsmile commented Mar 24, 2016

srowen commented Jun 23, 2016

gatorsmile commented Jun 23, 2016

[SPARK-14032] [SQL] Eliminate Unnecessary Distinct/Aggregate #11854

[SPARK-14032] [SQL] Eliminate Unnecessary Distinct/Aggregate #11854

Conversation

gatorsmile commented Mar 20, 2016

What changes were proposed in this pull request?

How was this patch tested?

gatorsmile Mar 20, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 21, 2016

SparkQA commented Mar 21, 2016

SparkQA commented Mar 22, 2016

liancheng commented Mar 23, 2016

gatorsmile commented Mar 23, 2016

gatorsmile commented Mar 23, 2016

sameeragarwal commented Mar 23, 2016

gatorsmile commented Mar 23, 2016

sameeragarwal commented Mar 23, 2016

gatorsmile commented Mar 23, 2016

liancheng commented Mar 24, 2016

gatorsmile commented Mar 24, 2016

sameeragarwal commented Mar 24, 2016

marmbrus commented Mar 24, 2016

gatorsmile commented Mar 24, 2016

gatorsmile commented Mar 24, 2016

gatorsmile commented Mar 24, 2016

SparkQA commented Mar 24, 2016

gatorsmile commented Mar 24, 2016

SparkQA commented Mar 24, 2016

marmbrus commented Mar 24, 2016

gatorsmile commented Mar 24, 2016

srowen commented Jun 23, 2016

gatorsmile commented Jun 23, 2016