-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31350][SQL] Coalesce bucketed tables for sort merge join if applicable #28123
Conversation
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
Outdated
Show resolved
Hide resolved
@cloud-fan @maropu @gatorsmile Could you check this PR and see what you think about the approach? Thanks in advance! |
Test build #120822 has finished for PR 28123 at commit
|
Test build #120821 has finished for PR 28123 at commit
|
This optimization always works well? e.g., 100000 buckets vs 2 buckets. |
Not always, so this is enabled if |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/InjectBucketHint.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
Outdated
Show resolved
Hide resolved
If so, I think we need a threshold config to turn of this optimization if the numbers of joined buckets have a large gap. |
also cc: @viirya |
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketInJoin.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketInJoin.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketInJoin.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketInJoin.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala
Show resolved
Hide resolved
Test build #121238 has finished for PR 28123 at commit
|
Test build #124194 has finished for PR 28123 at commit
|
retest this please |
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
Show resolved
Hide resolved
...src/main/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketsInSortMergeJoin.scala
Show resolved
Hide resolved
isScanOperation(s.left) && | ||
isScanOperation(s.right) && | ||
satisfiesOutputPartitioning(s.leftKeys, s.left.outputPartitioning) && | ||
satisfiesOutputPartitioning(s.rightKeys, s.right.outputPartitioning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR but just an idea: we don't need to do bucket scan at all if it can't save shuffles. This can increase parallelism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a follow-up issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can give it a shot after this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to do bucket scan at all if it can't save shuffles. This can increase parallelism.
@cloud-fan IMO there's other benefit to do bucket scan even though it can't save shuffle, e.g. bucket filter push down. So we probably need to take that into consideration before disabling bucketing.
Test build #124205 has finished for PR 28123 at commit
|
Test build #124229 has finished for PR 28123 at commit
|
...est/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketsInSortMergeJoinSuite.scala
Show resolved
Hide resolved
Test build #124258 has finished for PR 28123 at commit
|
retest this please |
Test build #124271 has finished for PR 28123 at commit
|
Thanks! I appreciate your hard work, @imback82 ! Merged to master. Also, thanks for the reviews, @cloud-fan @viirya ! |
Thanks @imback82 for making this change! Sorry for late comment, just a few questions: (1).Is there a reason why we don't cover ShuffledHashJoin as well? (we are seeing in production, people also use ShuffledHashJoin a lot for joining bucketed tables when one side is small) (2).Per this PR, the ordering property of coalesced bucket files does not preserve (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L317), and the ordering can be preserved through a sort-merge-way read of all sorted buckets file. This can help when reading multiple partitions of bucketed table. (3).We are seeing in production, coalescing might hurt the parallelism, if the number of buckets are too few. Another way to avoid shuffle and sort, is to split/divide the table with less buckets. E.g. joining tables with t1 (8 buckets) and t2 (32 buckets), we can keep number of tasks to be 32, and each task for reading t1 table will have a filter at run-time to only keep its portion of table (divide the table with less buckets). This has downside of reading the t1 more than once from multiple tasks, but if the size of t1 is not big, it's a good trade off to have more parallelism (and may be better than shuffling t1 directly). We are running above 3 features years in facebook (https://databricks.com/session_eu19/spark-sql-bucketing-at-facebook), and I would like to make or help the followup change if this sounds reasonable for everyone. cc @imback82, @cloud-fan, @maropu , @viirya, @gatorsmile and @sameeragarwal. |
Thanks for your interest, @c21
As you said in (3), too, I think that's because there is a concern where coalescing might hurt the parallelism. You can see the related discussion in the history: #28123 (comment)
I think that's the long-standing issue we have. Have you checked the discussion in SPARK-24528? If you're interested in the issue, you can revisit it there. |
I had a rough POC few months back, where each row is filtered out based on its bucket id, but I never got a chance to run benchmarks. @c21 Do you have some numbers to share? I am wondering how reading multiple copies impacts the overall runtime. |
Thank you @maropu and @imback82
For (1): I created a PR to cover shuffled hash join as well - #29079. Could you help take a look? Thanks.
We have some internal numbers in production, e.g. we are seeing like 50% CPU improvement for specific queries. So comparing divide (extra N times for reading input table, N is the ratio between two tables buckets) and non-divide (extra 1 time for shuffle write, and 1 time for read), if the join big tables buckets ratio is 2 (e.g. join 510 buckets with 1024 buckets), dividing can be better than non-dividing as we only do 1 extra read for input table, but avoid 1 shuffle write and 1 shuffle read (but this also depends on efficiency of shuffled service). I think we can add benchmark in Re POC - I feel overall approach looks good to me. But IMO I think we should do the coalesce/divide in physical plan rule, but not logical plan rule. Also I think we probably can leave vectorization code path aside now, as it introduces too much change to handle vectorization when doing filter for it. Let me know if you are still on it, or I can help with this feature as well. Thanks. |
Thanks @c21!
Yes, I agree. (The POC was done before this PR, which also started with logical plan but changed to physical plan rule per suggestion)
Without supporting
I will clean up the POC and create a PR with some benchmark numbers. |
@imback82 sounds good. I am refactoring and to making the coalesce cover shuffled hash join as well in (#29079). I would like to run same query as you did here, and verified performance numbers. Wondering could you share more instructions for which query in TPCDS you ran, or the query you used if you changed based the benchmark? Thanks.
Sorry I might miss something, but I feel it's not true. From my understanding, wholestage codegen is on row-based, cannot work with vectorization yet. |
…applicable ### What changes were proposed in this pull request? Based on a follow up comment in #28123, where we can coalesce buckets for shuffled hash join as well. The note here is we only coalesce the buckets from shuffled hash join stream side (i.e. the side not building hash map), so we don't need to worry about OOM when coalescing multiple buckets in one task for building hash map. > If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. Refactor existing physical plan rule `CoalesceBucketsInSortMergeJoin` to `CoalesceBucketsInJoin`, for covering shuffled hash join as well. Refactor existing unit test `CoalesceBucketsInSortMergeJoinSuite` to `CoalesceBucketsInJoinSuite`, for covering shuffled hash join as well. ### Why are the changes needed? Avoid shuffle for joining different bucketed tables, is also useful for shuffled hash join. In production, we are seeing users to use shuffled hash join to join bucketed tables (set `spark.sql.join.preferSortMergeJoin`=false, to avoid sort), and this can help avoid shuffle if number of buckets are not same. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests in `CoalesceBucketsInJoinSuite` for verifying shuffled hash join physical plan. ### Performance number per request from maropu I was looking at TPCDS per suggestion from maropu. But I found most of queries from TPCDS are doing aggregate, and only several ones are doing join. None of input tables are bucketed. So I took the approach to test a modified version of `TPCDS q93` as ``` SELECT ss_ticket_number, sr_ticket_number FROM store_sales JOIN store_returns ON ss_ticket_number = sr_ticket_number ``` And make `store_sales` and `store_returns` to be bucketed tables. Physical query plan without coalesce: ``` ShuffledHashJoin [ss_ticket_number#109L], [sr_ticket_number#120L], Inner, BuildLeft :- Exchange hashpartitioning(ss_ticket_number#109L, 4), true, [id=#67] : +- *(1) Project [ss_ticket_number#109L] : +- *(1) Filter isnotnull(ss_ticket_number#109L) : +- *(1) ColumnarToRow : +- FileScan parquet default.store_sales[ss_ticket_number#109L] Batched: true, DataFilters: [isnotnull(ss_ticket_number#109L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/chengsu/spark/spark-warehouse/store_sales], PartitionFilters: [], PushedFilters: [IsNotNull(ss_ticket_number)], ReadSchema: struct<ss_ticket_number:bigint>, SelectedBucketsCount: 2 out of 2 +- *(2) Project [sr_returned_date_sk#111L, sr_return_time_sk#112L, sr_item_sk#113L, sr_customer_sk#114L, sr_cdemo_sk#115L, sr_hdemo_sk#116L, sr_addr_sk#117L, sr_store_sk#118L, sr_reason_sk#119L, sr_ticket_number#120L, sr_return_quantity#121L, sr_return_amt#122, sr_return_tax#123, sr_return_amt_inc_tax#124, sr_fee#125, sr_return_ship_cost#126, sr_refunded_cash#127, sr_reversed_charge#128, sr_store_credit#129, sr_net_loss#130] +- *(2) Filter isnotnull(sr_ticket_number#120L) +- *(2) ColumnarToRow +- FileScan parquet default.store_returns[sr_returned_date_sk#111L,sr_return_time_sk#112L,sr_item_sk#113L,sr_customer_sk#114L,sr_cdemo_sk#115L,sr_hdemo_sk#116L,sr_addr_sk#117L,sr_store_sk#118L,sr_reason_sk#119L,sr_ticket_number#120L,sr_return_quantity#121L,sr_return_amt#122,sr_return_tax#123,sr_return_amt_inc_tax#124,sr_fee#125,sr_return_ship_cost#126,sr_refunded_cash#127,sr_reversed_charge#128,sr_store_credit#129,sr_net_loss#130] Batched: true, DataFilters: [isnotnull(sr_ticket_number#120L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/chengsu/spark/spark-warehouse/store_returns], PartitionFilters: [], PushedFilters: [IsNotNull(sr_ticket_number)], ReadSchema: struct<sr_returned_date_sk:bigint,sr_return_time_sk:bigint,sr_item_sk:bigint,sr_customer_sk:bigin..., SelectedBucketsCount: 4 out of 4 ``` Physical query plan with coalesce: ``` ShuffledHashJoin [ss_ticket_number#109L], [sr_ticket_number#120L], Inner, BuildLeft :- *(1) Project [ss_ticket_number#109L] : +- *(1) Filter isnotnull(ss_ticket_number#109L) : +- *(1) ColumnarToRow : +- FileScan parquet default.store_sales[ss_ticket_number#109L] Batched: true, DataFilters: [isnotnull(ss_ticket_number#109L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/chengsu/spark/spark-warehouse/store_sales], PartitionFilters: [], PushedFilters: [IsNotNull(ss_ticket_number)], ReadSchema: struct<ss_ticket_number:bigint>, SelectedBucketsCount: 2 out of 2 +- *(2) Project [sr_returned_date_sk#111L, sr_return_time_sk#112L, sr_item_sk#113L, sr_customer_sk#114L, sr_cdemo_sk#115L, sr_hdemo_sk#116L, sr_addr_sk#117L, sr_store_sk#118L, sr_reason_sk#119L, sr_ticket_number#120L, sr_return_quantity#121L, sr_return_amt#122, sr_return_tax#123, sr_return_amt_inc_tax#124, sr_fee#125, sr_return_ship_cost#126, sr_refunded_cash#127, sr_reversed_charge#128, sr_store_credit#129, sr_net_loss#130] +- *(2) Filter isnotnull(sr_ticket_number#120L) +- *(2) ColumnarToRow +- FileScan parquet default.store_returns[sr_returned_date_sk#111L,sr_return_time_sk#112L,sr_item_sk#113L,sr_customer_sk#114L,sr_cdemo_sk#115L,sr_hdemo_sk#116L,sr_addr_sk#117L,sr_store_sk#118L,sr_reason_sk#119L,sr_ticket_number#120L,sr_return_quantity#121L,sr_return_amt#122,sr_return_tax#123,sr_return_amt_inc_tax#124,sr_fee#125,sr_return_ship_cost#126,sr_refunded_cash#127,sr_reversed_charge#128,sr_store_credit#129,sr_net_loss#130] Batched: true, DataFilters: [isnotnull(sr_ticket_number#120L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/chengsu/spark/spark-warehouse/store_returns], PartitionFilters: [], PushedFilters: [IsNotNull(sr_ticket_number)], ReadSchema: struct<sr_returned_date_sk:bigint,sr_return_time_sk:bigint,sr_item_sk:bigint,sr_customer_sk:bigin..., SelectedBucketsCount: 4 out of 4 (Coalesced to 2) ``` Run time improvement as 50% of wall clock time: ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz shuffle hash join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ shuffle hash join coalesce bucket off 1541 1664 106 1.9 535.1 1.0X shuffle hash join coalesce bucket on 1060 1169 81 2.7 368.1 1.5X ``` Closes #29079 from c21/split-bucket. Authored-by: Cheng Su <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>
### What changes were proposed in this pull request? Add `CoalesceBucketsInJoin` to AQE `preprocessingRules`. ### Why are the changes needed? Previously optimized bucket join: 'CoalesceBucketsInJoin'` : #28123 But when using AQE , `CoalesceBucketsInJoin` can not match beacuse the top of the spark plan is `AdaptiveSparkPlan`. The code : ``` val spark = SparkSession.builder() .appName("BucketJoin") .master("local[*]") .config("spark.sql.adaptive.enabled", true) .config("spark.driver.memory", "4") .config("spark.sql.autoBroadcastJoinThreshold", "-1") .config("spark.sql.bucketing.coalesceBucketsInJoin.enabled", true) .enableHiveSupport() .getOrCreate() val df1 = (0 until 20).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", "k") val df2 = (0 until 20).map(i => (i % 7, i % 11, i.toString)).toDF("i", "j", "k") df1.write.format("parquet").bucketBy(4, "i").saveAsTable("t1") df2.write.format("parquet").bucketBy(2, "i").saveAsTable("t2") val t1 = spark.table("t1") val t2 = spark.table("t2") val joined = t1.join(t2, t1("i") === t2("i")) joined.explain() ``` Before the PR ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- SortMergeJoin [i#50], [i#56], Inner :- Sort [i#50 ASC NULLS FIRST], false, 0 : +- Filter isnotnull(i#50) : +- FileScan parquet spark_catalog.default.t1[i#50,j#51,k#52] Batched: true, Bucketed: true, DataFilters: [isnotnull(i#50)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/shezhiming/gh/zzzzming_new/spark/spark-warehouse/t1], PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int,k:string>, SelectedBucketsCount: 4 out of 4 +- Sort [i#56 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(i#56, 4), ENSURE_REQUIREMENTS, [plan_id=78] +- Filter isnotnull(i#56) +- FileScan parquet spark_catalog.default.t2[i#56,j#57,k#58] Batched: true, Bucketed: false (disabled by query planner), DataFilters: [isnotnull(i#56)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/shezhiming/gh/zzzzming_new/spark/spark-warehouse/t2], PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int,k:string> ``` After the PR output: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- SortMergeJoin [i#50], [i#56], Inner :- Sort [i#50 ASC NULLS FIRST], false, 0 : +- Filter isnotnull(i#50) : +- FileScan parquet spark_catalog.default.t1[i#50,j#51,k#52] Batched: true, Bucketed: true, DataFilters: [isnotnull(i#50)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/shezhiming/gh/zzzzming_new/spark/spark-warehouse/t1], PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int,k:string>, SelectedBucketsCount: 4 out of 4 (Coalesced to 2) +- Sort [i#56 ASC NULLS FIRST], false, 0 +- Filter isnotnull(i#56) +- FileScan parquet spark_catalog.default.t2[i#56,j#57,k#58] Batched: true, Bucketed: true, DataFilters: [isnotnull(i#56)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/shezhiming/gh/zzzzming_new/spark/spark-warehouse/t2], PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int,k:string>, SelectedBucketsCount: 2 out of 2 ``` Additional Notes: We don't add CoalesceBucketsInJoin to `AdaptiveSparkPlanExec#queryStageOptimizerRules` because queryStageOptimizerRules is not applied at the beginning of the init plan. Instead, they are applied in the createQueryStages() method. And createQueryStages() is bottom-up, which causes the exchange to be eliminated to be wrapped in a layer of ShuffleQueryStage first, making CoalesceBucketsInJoin unrecognizable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add UT Closes #40688 from zzzzming95/SPARK-43021. Authored-by: zzzzming95 <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
When two bucketed tables with different number of buckets are joined, it can introduce a full shuffle:
This PR proposes to introduce coalescing buckets when the following conditions are met to eliminate the full shuffle:
spark.sql.bucketing.coalesceBucketsInSortMergeJoin.enabled
is set totrue
.spark.sql.bucketing.coalesceBucketsInSortMergeJoin.maxBucketRatio
.Why are the changes needed?
Eliminating the full shuffle can benefit for scenarios where two large tables are joined. Especially when the tables are already bucketed but differ in the number of buckets, we could take advantage of it.
Does this PR introduce any user-facing change?
If the bucket coalescing conditions explained above are met, a full shuffle can be eliminated (also note that you will see
SelectedBucketsCount: 8 out of 8 (Coalesced to 4)
in the physical plan):How was this patch tested?
Added unit tests