-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-14032] [SQL] Eliminate Unnecessary Distinct/Aggregate #11854
Conversation
} | ||
|
||
// propagate the distinct property from the child | ||
@tailrec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another solution is to add a property isDistinct
to LogicalPlan
. However, it could be expensive for recursive calls, compared with the @tailrec
. In the future, if the physical plan will use the property isDistinct
, we can rewrite it. Actually, this is a very critical property at runtime algorithm optimization. Thanks!
Test build #53641 has finished for PR 11854 at commit
|
Test build #53642 has finished for PR 11854 at commit
|
Test build #53753 has finished for PR 11854 at commit
|
@sameeragarwal I haven't looked deeply into this PR, but I think this might be a good candidate use case for the newly introduced constraints facilities? |
@liancheng Yeah, you are right. We also can put it into the Constraints, if we can introduce a new expression, like, |
yes, that's a great idea. The current constraints framework however is just limited to per-row constraints whereas constraints like |
@sameeragarwal Should we do it now? Or you have the other plan? Thanks! |
I think it'd be great to have it. However, as Michael had suggested earlier, it'd be nice to first come up with a set of candidate queries that'd potentially benefit from these optimizations in order to better motivate the kind of per-attribute constraints we need to track. I think q38 is an excellent example. Do you have some others in mind? |
So far, nope. Actually, the idea of this PR comes out when I fixing a JIRA related to TPC-DS Q38. Generally, |
@sameeragarwal Thanks for the explanation! (One question is that, it seems that per-attribute constraints are not enough since ordering and distinctness can be properties of a group of attributes.) |
Agree with @liancheng Distinctness is defined for a set of attributes. Ordering also needs to consider the sequence of the attributes. |
Sorry for the confusion -- these attribute constraints should still be on a per-operator basis (i.e., part of the |
I agree with @sameeragarwal that this probably doesn't belong in constraints since its a cross row concern. However, as he says, it would be nice to come up with general mechanism to reason about uniqueness and other cross row constraints as a function of a given |
I see. In the #9089, the key can only contain a single attribute. Will try to define a function in QueryPlan for uniqueness ASAP. Network outage now. Hopefully service will be back soon. Thanks! |
Added a function |
The motivation of this function |
Test build #54003 has finished for PR 11854 at commit
|
retest this please |
Test build #54013 has finished for PR 11854 at commit
|
I actually maybe like the see the following more in the form of a design doc (check out the constraints JRIA):
|
Yeah, completely agree. Will do it after DDL-related PRs are completed. Thanks! |
@gatorsmile can this be closed for now? |
This requires some discussions about how to add/use |
What changes were proposed in this pull request?
Distinct
is an expensive operation. If possible, we should avoid it. This PR is to eliminateDistinct
(theAggregate
forDistinct
) when the child operators can guarantee the value uniqueness,For example, in the following TPC-DS query 38, the left child of the first
Intersect
isDistinct
, and thus, we can remove the topDistinct
after convertingIntersect
toLeft-semi
+Distinct
.Note: Since we do not have the cardinality info, we are unable to conclude if the distinct of the right child can be removed. It totally depends on the data of the right child of
Intersect
. In this PR, we just remove the topDistinct
.Use a simplified query to show the effect of this PR:
Before the fix, the optimized plan is like
After the fix, the optimized plan is like
How was this patch tested?
Added a few test cases