Improve bucketed table write parallelism for Presto on Spark #15934

arhimondr · 2021-04-13T01:34:17Z

Make sure each partitioned (bucketed) table writer task has assigned task_partitioned_writer_count number of buckets to keep all available writer threads busy. Currently there's only 1 bucket per task being assigned what results in threads starvation.

== RELEASE NOTES ==

Presto on Spark Changes
* Improve bucketed table write parallelism

arhimondr · 2021-04-13T01:48:04Z

presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/AddExchanges.java

@@ -598,17 +612,50 @@ else if (redistributeWrites) {
                    !source.getProperties().isCompatibleTablePartitioningWith(shufflePartitioningScheme.get().getPartitioning(), false, metadata, session) &&
                    !(source.getProperties().isRefinedPartitioningOver(shufflePartitioningScheme.get().getPartitioning(), false, metadata, session) &&
                            canPushdownPartialMerge(source.getNode(), partialMergePushdownStrategy))) {
+                PartitioningScheme exchangePartitioningScheme = shufflePartitioningScheme.get();
+                if (node.getTablePartitioningScheme().isPresent() && isPrestoSparkAssignBucketToPartitionForPartitionedTableWriteEnabled(session)) {
+                    int writerThreadsPerNode = getTaskPartitionedWriterCount(session);


In Presto on Spark we always create 1 task per bucket. It is generally fine, as we can further partition data using local partitioning. However when writing to a bucketed table the data cannot be further partitioned, as all the data for a single bucket has to be written to a single file. Thus for the fragment that only contains a partitioned table writer operator we have to make sure that we assign as least as many buckets per task as there are threads available (getTaskPartitionedWriterCount).

What makes this PR special - is that the bucketToPartition is being assigned during the planning phase when usually in all other places it is assigned during the scheduling phase. This creates a precedent. Now the plan contains information about the physical partitions assignment, such as number of partitions and the mapping to buckets, before it get's to the scheduler (or RDD translator in case of Presto on Spark).

Another approach would be to try to "reverse engineer" the plan during the scheduling phase (which in case of Presto on Spark is a translation to RDD). By "reverse engineer" I mean traversing the distributed plan and trying to deduce what fragment is a "partitioned table writer only", and then assign the bucketToPartition accordingly.

I don't really like either of two solutions. I decided to go with the current approach as it is less likely to break. On the other side it breaks some assumptions that we currently have.

I would really love to hear your thoughts.

CC: @highker @rongrong @shixuan-fan

I don't think I fully understand the problem we are trying to solve so might ask some dumb questions. Is it fair to say that because we need to "assign as least as many buckets per task as there are threads available", we would need to assign bucketToPartition at planning phase so that we could do addExchanges accordingly?

Why do we need to assign at least as many buckets per task as there are threads available? My impression is that when creating writer, each writer would know the bucket they are trying to write, so ideally even if we have more threads than bucket, there should only be idle threads. What am I missing here?

Is it fair to say that because we need to "assign as least as many buckets per task as there are threads available", we would need to assign bucketToPartition at planning phase so that we could do addExchanges accordingly?

Yeah, this is exactly what is happening.

Why do we need to assign at least as many buckets per task as there are threads available? My impression is that when creating writer, each writer would know the bucket they are trying to write, so ideally even if we have more threads than bucket, there should only be idle threads. What am I missing here?

It is usually not a problem in classic Presto, as it is multi-tenant. If one query doesn't use available CPU's - other queries will. Resource allocation in Spark works differently. In Spark we are being allocated a fixed number of CPU's. If we are not utilizing them - we are effectively wasting them.

viczhang861

The solution makes sense to me. It is an improvement compared with previous bucketToPartition allocation.

viczhang861 · 2021-04-13T22:02:33Z

presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/AddExchanges.java

+                    int bucket = 0;
+                    int partition = 0;
+                    while (bucket < bucketCount) {
+                        for (int i = 0; i < writerThreadsPerNode && bucket < bucketCount; i++) {
+                            bucketToPartition[bucket] = partition;
+                            bucket++;
+                        }
+                        partition++;
+                    }


bucketToPartition[bucket] = bucket / writerThreadsPerNode

This will create a skew. Buckets assigned must be non co-divisible.

Let me give you an example. Let's say we have 4 buckets and 2 writer threads. That means that we will end up with 2 partitions. Let's assume the buckets are assigned using the bucketToPartition[bucket] = bucket / writerThreadsPerNode. Partition 0 will get Bucket 0 and Bucket 2, Partition 1 will get Bucket 1 and Bucket 3. Then locally we also take modulo when assigning buckets to threads. Bucket 0 % 2 = Thread 0, Bucket 2 % 2 = Thread 0. So we will end up with a single thread writing to 2 buckets. What we want to achieve is each thread has a single bucket to write. So we want to assingn Bucket 0 and Bucket 1 to Partition 0, and Bucket 2 and Bucket 3 to Partition 1, so further when assigning buckets to threads each Thread has a bucket to write.

Please check again, as it is range assigned not modular based distribution

Discussed offline. I'm being slow. bucketToPartition[bucket] = bucket % writerThreadsPerNode doesn't work, bucketToPartition[bucket] = bucket / writerThreadsPerNode does exactly what I'm doing with extra cycles. Let me fix that.

presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/AddExchanges.java

Move partitioning assignment to PrestoSparkQueryExecutionFactory This will allow to simply follow the number of partitions set in the bucketToPartition when creating a spark partitioner instead of running the logic of assigning numbers of partitions twice

When writing to a partitioned (bucketed) table ensure that each writer node has enough buckets to write to efficiently utilize all available concurrent threads

arhimondr requested review from aweisberg, viczhang861 and pgupta2 April 13, 2021 01:34

arhimondr commented Apr 13, 2021

View reviewed changes

viczhang861 reviewed Apr 13, 2021

View reviewed changes

arhimondr added 5 commits April 14, 2021 20:50

Run join, order by and window tests with Presto on Spark

5443ee2

Reformat TestPrestoSparkQueryRunner

84f347e

Remove duplicate logger setting

f054a1d

Refactor PrestoSparkRddFactory

9a9c1bd

Move partitioning assignment to PrestoSparkQueryExecutionFactory This will allow to simply follow the number of partitions set in the bucketToPartition when creating a spark partitioner instead of running the logic of assigning numbers of partitions twice

Optimize partitioned table write for Presto on Spark

68689dc

When writing to a partitioned (bucketed) table ensure that each writer node has enough buckets to write to efficiently utilize all available concurrent threads

arhimondr force-pushed the optimize-bucketed-table-write branch from f8f78bb to 68689dc Compare April 15, 2021 01:00

viczhang861 approved these changes Apr 15, 2021

View reviewed changes

arhimondr merged commit 33cdd9b into prestodb:master Apr 15, 2021

arhimondr deleted the optimize-bucketed-table-write branch April 15, 2021 12:23

vaishnavibatni mentioned this pull request Apr 27, 2021

Add release notes for 0.252 #16013

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve bucketed table write parallelism for Presto on Spark #15934

Improve bucketed table write parallelism for Presto on Spark #15934

arhimondr commented Apr 13, 2021

arhimondr Apr 13, 2021

shixuan-fan Apr 13, 2021

arhimondr Apr 13, 2021

viczhang861 left a comment

viczhang861 Apr 13, 2021

arhimondr Apr 14, 2021

viczhang861 Apr 15, 2021

arhimondr Apr 15, 2021 •

edited

Loading

Improve bucketed table write parallelism for Presto on Spark #15934

Improve bucketed table write parallelism for Presto on Spark #15934

Conversation

arhimondr commented Apr 13, 2021

arhimondr Apr 13, 2021

Choose a reason for hiding this comment

shixuan-fan Apr 13, 2021

Choose a reason for hiding this comment

arhimondr Apr 13, 2021

Choose a reason for hiding this comment

viczhang861 left a comment

Choose a reason for hiding this comment

viczhang861 Apr 13, 2021

Choose a reason for hiding this comment

arhimondr Apr 14, 2021

Choose a reason for hiding this comment

viczhang861 Apr 15, 2021

Choose a reason for hiding this comment

arhimondr Apr 15, 2021 • edited Loading

Choose a reason for hiding this comment

arhimondr Apr 15, 2021 •

edited

Loading