-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30598][SQL] Detect equijoins better #27309
[SPARK-30598][SQL] Detect equijoins better #27309
Conversation
8184b1b
to
ca782e7
Compare
Test build #117190 has finished for PR 27309 at commit
|
Test build #117191 has finished for PR 27309 at commit
|
Test build #117195 has finished for PR 27309 at commit
|
This query (and the optimization) is useful for users? Actually, it seems pgsql don't allow it:
|
To double-check / restate my own understanding of the example query: For inner joins:If we had an inner join of the form SELECT * FROM t1 JOIN t2 ON t1.c2 = 2 AND t2.c2 = 2 then that's effectively a cross-join (because we'll push each For full outer joins:Given SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.c2 = 2 AND t2.c2 = 2 then the optimization proposed in this PR would cause us to hash partition rows according to Prior to this patch, our only choice was to plan this query as a broadcast nested loop join (BNLJ): if we exceeded the broadcast size limit then this query would fail. As a result, this PR's changes end up raising the limit on the size of data we can query. However, I think that this change might slightly regress performance in cases where one side of the join is very small: Spark currently doesn't support broadcast hash join for full outer joins, so queries which previously could fit as broadcast nested loop joins would instead become sort-merge joins. Sidebar: a multi-pass approach:I think that it might be possible to rewrite SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.c2 = 2 AND t2.c2 = 2 as (SELECT * from t1 CROSS JOIN t2 where t1.c2 = 2 AND t2.c2 = 2)
UNION ALL
(SELECT t1.*, <nulls> from t1 where not (t1.c2 <=> 2))
UNION ALL
(SELECT <nulls>, t2.* from t2 where not (t2.c2 <=> 2)) (note the use of null-safe equals) That has the advantage of avoiding a shuffle for the non-matching rows at the cost of needing to scan each join input twice (since we don't have a great way to emit multiple output streams from a single task). This could also be helpful in case Automatically picking that plan is hard without really good cost-based optimization, though. For left joins:AFAIK the potential drawbacks for full outer joins (loss of broadcast join in cases where data is really tiny) don't apply to left joins (since we'd still be able to plan broadcast hash joins), so it seems like this effectively raises the scale limit by giving us an alternative to broadcast nested loop join when the data is very large. For other join types:I haven't considered any other join types. Summary:For joins where columns of different tables are related via being equal to the same constant value, it looks like this PR's changes give us an alternative to BNLJ in situations where the data is very large. @peter-toth, do you have a motivating use-case / more realistic example of where this query pattern occurs? My initial feeling is that this seems like a pretty niche optimization and it's not clear whether this occurs often enough in real queries to warrant the added complexity and potential corner-cases. |
@maropu I don't know why pgsql doesn't allow it, but Spark SQL does and the query makes sense. IMHO the 2 queries: @JoshRosen thanks for the detailed comment. This is a niche optimization indeed. Let me share why I raised this PR. I have another WIP PR here: #24553 and in the last commit I started playing with enabling constant propagation on join conditions too. I believe it could be beneficial on some niche inner joins e.g. |
@maropu , @JoshRosen I updated the description to reflect where this PR has a benefit. I believe the 2 selects should generate the same plan using SMJ as it is safer and the required change is pretty small. Please let me know if the change makes sense. If you think it doesn't then will I close the PR. |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
The improvement in this is PR can extract equalities from join conditions so that we can recognise implicit equijoins.
E.g. this example query:
has the following plan currently:
But if we take
where the equality between the sides explicitly defined (
t1.c2 = t2.c2
) the plan is:The second plan is better as SMJ doesn't have the broadcast size limitation as BNLJ do.
After this PR the implicit equalities are detected and the first query has the same plan as the second.
Why are the changes needed?
Improve stability.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing and new UTs.