Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14032] [SQL] Eliminate Unnecessary Distinct/Aggregate #11854

Closed

Conversation

gatorsmile
Copy link
Member

What changes were proposed in this pull request?

Distinct is an expensive operation. If possible, we should avoid it. This PR is to eliminate Distinct (the Aggregate for Distinct) when the child operators can guarantee the value uniqueness,

For example, in the following TPC-DS query 38, the left child of the first Intersect is Distinct, and thus, we can remove the top Distinct after converting Intersect to Left-semi + Distinct.

select count(*) from (
    select distinct c_last_name, c_first_name, d_date
    from store_sales, date_dim, customer
          where store_sales.ss_sold_date_sk = date_dim.d_date_sk
      and store_sales.ss_customer_sk = customer.c_customer_sk
      and d_month_seq between [DMS] and [DMS] + 11
  intersect
    select distinct c_last_name, c_first_name, d_date
    from catalog_sales, date_dim, customer
          where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
      and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
      and d_month_seq between [DMS] and [DMS] + 11
  intersect
    select distinct c_last_name, c_first_name, d_date
    from web_sales, date_dim, customer
          where web_sales.ws_sold_date_sk = date_dim.d_date_sk
      and web_sales.ws_bill_customer_sk = customer.c_customer_sk
      and d_month_seq between [DMS] and [DMS] + 11
) hot_cyst

Note: Since we do not have the cardinality info, we are unable to conclude if the distinct of the right child can be removed. It totally depends on the data of the right child of Intersect. In this PR, we just remove the top Distinct.

Use a simplified query to show the effect of this PR:

df.distinct().intersect(df).intersect(df)

Before the fix, the optimized plan is like

Aggregate [id#37,value#38], [id#37,value#38]
+- Join LeftSemi, Some(((id#37 <=> id#64) && (value#38 <=> value#65)))
   :- Aggregate [id#37,value#38], [id#37,value#38]
   :  +- Join LeftSemi, Some(((id#37 <=> id#57) && (value#38 <=> value#58)))
   :     :- Aggregate [id#37,value#38], [id#37,value#38]
   :     :  +- LocalRelation [id#37,value#38], [[id1,1],[id1,1],[id,1],[id1,2]]
   :     +- LocalRelation [id#57,value#58], [[id1,1],[id1,1],[id,1],[id1,2]]
   +- LocalRelation [id#64,value#65], [[id1,1],[id1,1],[id,1],[id1,2]]

After the fix, the optimized plan is like

Join LeftSemi, Some(((id#37 <=> id#64) && (value#38 <=> value#65)))
:- Join LeftSemi, Some(((id#37 <=> id#57) && (value#38 <=> value#58)))
:  :- Aggregate [id#37,value#38], [id#37,value#38]
:  :  +- LocalRelation [id#37,value#38], [[id1,1],[id1,1],[id,1],[id1,2]]
:  +- LocalRelation [id#57,value#58], [[id1,1],[id1,1],[id,1],[id1,2]]
+- LocalRelation [id#64,value#65], [[id1,1],[id1,1],[id,1],[id1,2]]

How was this patch tested?

Added a few test cases

}

// propagate the distinct property from the child
@tailrec
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another solution is to add a property isDistinct to LogicalPlan. However, it could be expensive for recursive calls, compared with the @tailrec. In the future, if the physical plan will use the property isDistinct, we can rewrite it. Actually, this is a very critical property at runtime algorithm optimization. Thanks!

@SparkQA
Copy link

SparkQA commented Mar 21, 2016

Test build #53641 has finished for PR 11854 at commit 96d9d4e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2016

Test build #53642 has finished for PR 11854 at commit dddc78b.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53753 has finished for PR 11854 at commit bae2c86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

@sameeragarwal I haven't looked deeply into this PR, but I think this might be a good candidate use case for the newly introduced constraints facilities?

@gatorsmile
Copy link
Member Author

@liancheng Yeah, you are right. We also can put it into the Constraints, if we can introduce a new expression, like, IsDistinct, which will be used for Constraints only. : )

@gatorsmile
Copy link
Member Author

also CC @marmbrus @yhuai

@sameeragarwal
Copy link
Member

yes, that's a great idea. The current constraints framework however is just limited to per-row constraints whereas constraints like IsDistinct and (say) isSorted are per-attribute constraints. We should definitely support per-attribute constraints as well but those may require a different set of per-operator propagation rules.

@gatorsmile
Copy link
Member Author

@sameeragarwal Should we do it now? Or you have the other plan? Thanks!

@sameeragarwal
Copy link
Member

I think it'd be great to have it. However, as Michael had suggested earlier, it'd be nice to first come up with a set of candidate queries that'd potentially benefit from these optimizations in order to better motivate the kind of per-attribute constraints we need to track. I think q38 is an excellent example. Do you have some others in mind?

@gatorsmile
Copy link
Member Author

So far, nope. Actually, the idea of this PR comes out when I fixing a JIRA related to TPC-DS Q38.

Generally, IsDistinct can also benefit the physical execution of queries. This could have a broader impact. In RDBMS, unique constraints are very important in query optimization and runtime. I believe it is also applicable to Spark SQL, although we do not have constraint enforcement or unique constraints in Spark SQL

@liancheng
Copy link
Contributor

@sameeragarwal Thanks for the explanation! (One question is that, it seems that per-attribute constraints are not enough since ordering and distinctness can be properties of a group of attributes.)

@gatorsmile
Copy link
Member Author

Agree with @liancheng Distinctness is defined for a set of attributes. Ordering also needs to consider the sequence of the attributes.

@sameeragarwal
Copy link
Member

Sorry for the confusion -- these attribute constraints should still be on a per-operator basis (i.e., part of the QueryPlan). What I meant was that they can track attribute-specific properties (instead of just row-specific properties) such as distinctness of a set of attributes.

@marmbrus
Copy link
Contributor

I agree with @sameeragarwal that this probably doesn't belong in constraints since its a cross row concern. However, as he says, it would be nice to come up with general mechanism to reason about uniqueness and other cross row constraints as a function of a given QueryPlan. For example #9089 also proposes such an API.

@gatorsmile
Copy link
Member Author

I see. In the #9089, the key can only contain a single attribute.

Will try to define a function in QueryPlan for uniqueness ASAP. Network outage now. Hopefully service will be back soon. Thanks!

@gatorsmile
Copy link
Member Author

Added a function distinctSet into QueryPlan. This function will return the set of attributes whose combination can uniquely identify a row. Maybe I should create a separate PR for this only and added a few test cases to cover the correctness.

@gatorsmile
Copy link
Member Author

The motivation of this function distinctSet is to obtain the uniqueness constraint from the child operators. The output of Distinct, Intersect, Except, and Aggregate (iff its aggregate expressions is identical to its grouping expressions) can always guarantee the uniqueness. Thus, the parent operators can use it for query optimization.

@SparkQA
Copy link

SparkQA commented Mar 24, 2016

Test build #54003 has finished for PR 11854 at commit 7d95bc1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 24, 2016

Test build #54013 has finished for PR 11854 at commit 7d95bc1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

I actually maybe like the see the following more in the form of a design doc (check out the constraints JRIA):

  • what are the interesting cross row things we want to reason about
  • how are they useful
  • a couple of API options for representing all of them.

@gatorsmile
Copy link
Member Author

Yeah, completely agree. Will do it after DDL-related PRs are completed.

Thanks!

@srowen
Copy link
Member

srowen commented Jun 23, 2016

@gatorsmile can this be closed for now?

@gatorsmile
Copy link
Member Author

This requires some discussions about how to add/use distinct in the optimizer. Will do it in the next release. Thanks!

@gatorsmile gatorsmile closed this Jun 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants