-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize S3 storage writing for MSQ durable storage #16481
Conversation
.../s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java
Outdated
Show resolved
Hide resolved
.../s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java
Outdated
Show resolved
Hide resolved
...extensions/src/test/java/org/apache/druid/storage/s3/output/RetryableS3OutputStreamTest.java
Fixed
Show fixed
Hide fixed
.../s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java
Outdated
Show resolved
Hide resolved
.../s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java
Outdated
Show resolved
Hide resolved
.../s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java
Outdated
Show resolved
Hide resolved
.../s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java
Show resolved
Hide resolved
Were the tests run on a middle manager or an indexer? |
@LakshSingla On Indexer |
@Akshat-Jain : I had some doubts regarding the changes :
Please let me know your thoughts, if that's possible! 👍 |
@rohangarg Thanks for the detailed questions!
Hope these answer your questions. Happy to discuss further on any of the above points if needed, thanks! :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have commented a few readability concerns with the latest set of changes.
a. I am curious as to how it performs on a non-local cluster.
b. As I understand, the PR introduces concurrency, and also modifies the chunk size. I wonder what the improvement will be if we fine tune the chunk size, and not have concurrency in the job. So measuring the time of the jobs like:
Original changes + 100 MB chunk size
Original changes + 20 MB chunk size
Original changes + 5 MB chunk size
New changes
Though the latter can be pretty tedious, so original v/s new is what I am primarily concerned about
...ions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadConfig.java
Outdated
Show resolved
Hide resolved
...ions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadConfig.java
Outdated
Show resolved
Hide resolved
...ions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadConfig.java
Outdated
Show resolved
Hide resolved
...ions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadConfig.java
Outdated
Show resolved
Hide resolved
...sions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3StorageDruidModule.java
Outdated
Show resolved
Hide resolved
.../s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java
Outdated
Show resolved
Hide resolved
.../s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java
Outdated
Show resolved
Hide resolved
...sions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3StorageDruidModule.java
Outdated
Show resolved
Hide resolved
.../s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java
Outdated
Show resolved
Hide resolved
What is meant by output heavy query?
In which scenario is this possible? |
...sions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3StorageDruidModule.java
Outdated
Show resolved
Hide resolved
...sions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3StorageDruidModule.java
Outdated
Show resolved
Hide resolved
...sions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3StorageDruidModule.java
Outdated
Show resolved
Hide resolved
.../s3-extensions/src/test/java/org/apache/druid/storage/s3/S3StorageConnectorProviderTest.java
Outdated
Show resolved
Hide resolved
.../s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java
Outdated
Show resolved
Hide resolved
|
Thanks, I checked earlier but I couldn't understand their timings much. For instance, were you able to reason about why the serial upload is taking ~ 8 minutes for 450MBs? Also, the chunk size in those queries seemed a bit de-generate to me.
I was actually meaning that the threads which are running and are writing to the output stream currently are able to write it to S3 too on the same thread, which might not be the case which this change and the writes might have to wait. Are we solving that scenario in this change? Further, that brings me to another point I wanted to raise - have we considered supporting parallel uploads via the execution framework (maybe an operator to fan out writes) instead of doing it on a per storage basis? If so, can you please share the reasons for not choosing that? One more positive if it can be done via execution framework is that it will then automatically support all storage connectors.
You can create summary statistics over the wait times and post them in the logs - hopefully that's good enough :)
Yes, it makes sense to restrict the number of concurrent files. But I would do that in the executor related code itself instead of leaking it into the output stream.
You can determine that the file is small when not even one local chunk is fully filled, and the output stream is closed. And once you've determined that a file is small, direct |
The serial upload is uploading 5 MBs one-by-one, hence takes an extremely long time.
So, I had to reduce the chunk size to lowest possible value to be able to simulate upload of a large number of chunks, as I can't do it otherwise due to limited system resources. But it shouldn't affect the comparison/improvement trend, hence we can still rely on this data to get some idea on the degree of savings.
Actually it's the opposite. Previously the writes to the output stream (local file) had to wait for the previous chunk write+upload to finish. Now, the write doesn't have to wait for the previous uploads to finish (there's a backpressure semantic to restrict too many writes, but that's different from the current point). Please let me know if this answers the question, happy to clarify this further!
No, we hadn't discussed any approach like that. cc: @cryptoe
Can you please clarify what do you mean by "leaking it into the output stream"? Sorry, I didn't get this part.
But this would work only when there's a single chunk. How about when there are 5 chunks, for example? We'd have to wait for the last chunk to be processed before we can make the decision that the |
Ok, although the bandwidth comes out to be around < 1 MB/sec with this calculation. So, I'm not sure about realistic nature of the test.
Removing backpressure from the situation, previously if you have written a chunk, you'll upload it and wait for the upload to finish to start writing another chunk - and all that will happen without any extra wait time. Now, if you've written a chunk, you'll put it in an executor which might be consumed by the parts of other writers and you'll wait in the queue for your upload to be initiated. My question was that is the new behavior expected and fine?
I meant that the code for semaphores directly in the output stream seems more related to the behavior of the output stream rather than a property of the uploading mechanism.
Yes, this would only work for a single chunk. I've never recommended to extend it any further than a single chunk. Further, a multi-part request would involve 3 API calls (which are chargeable) for small uploads as compared to a PUT call which only makes 1 API call. Also, I think the multi-part API was intended more to be used as a resilient mechanism for large file uploads. So, it feels like unnecessary to me for small files. In any case, it was a suggestion for optimization and you can feel free to ignore it if you want. |
The data I mentioned was average of 3+ readings. I can do it again (once all review comments are addressed and everyone is okay with the implementation).
Yes, the upload will happen sooner (despite the potential wait) than it would have happened in the original sequential code.
We don't want the main threads to be blocked by the upload threads (except for the last chunk where we have to complete the multipart upload). Hence, the main thread cannot release the semaphore post finish of upload(), if that's what the suggestion was.
Cool, I agree then. The only counter point I have would be the maintainability angle. I'll discuss it with Karan offline if it makes sense to go for it. Either way, this can go in a separate follow-up PR, if it has to. Hope that works! |
...ons-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java
Show resolved
Hide resolved
...sions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3StorageDruidModule.java
Outdated
Show resolved
Hide resolved
...ons-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java
Outdated
Show resolved
Hide resolved
...ons-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java
Outdated
Show resolved
Hide resolved
...ons-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java
Outdated
Show resolved
Hide resolved
.../s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java
Outdated
Show resolved
Hide resolved
.../s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java
Outdated
Show resolved
Hide resolved
...ons-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java
Outdated
Show resolved
Hide resolved
5d8ef60
to
bdac529
Compare
...ons-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java
Outdated
Show resolved
Hide resolved
...ons-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java
Fixed
Show fixed
Hide fixed
...ons-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java
Fixed
Show fixed
Hide fixed
…cutor service with blocking queue for backpressure semantics.
9adaefe
to
059a5e9
Compare
...ons-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java
Outdated
Show resolved
Hide resolved
chunkSize = Math.max(chunkSize, s3ExportConfig.getChunkSize().getBytes()); | ||
} | ||
|
||
return (int) (S3OutputConfig.S3_MULTIPART_UPLOAD_MAX_PART_SIZE_BYTES / chunkSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the numerator be the overall disk size available for keeping chunks?
The value used here S3OutputConfig.S3_MULTIPART_UPLOAD_MAX_PART_SIZE_BYTES
is 5gb which just seems to be the maximum object size limit imposed by S3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kfaraz I was trying to do it similar to how it's done in WorkerStorageParameters
. IIUC, in WorkerStorageParameters
, we have a hard-coded size of 1 GB (MINIMUM_SUPER_SORTER_TMP_STORAGE_BYTES = 1_000_000_000L;
) which is used instead of computing the overall available disk space.
Also, I don't think we should use the overall disk size available for this calculation, and should rather have a hard-limit? I chose 5 GB as that's the maximum allowed chunk size by S3. In the default case, chunk size is 100 MB, so this means around 50 chunks allowed on disk at once, which seems fine?
Thoughts? cc: @cryptoe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5GB doesn't seem like a bad default but limiting it to 5GB even when we might have much more disk available seems weird. Also, if 5GB is the limit we want to use, we should define a separate constant rather than reuse the constant for max size.
we should use the overall disk size available for this calculation
Just to clarify, I have not suggested using the entire disk space for storing chunks, rather the portion actually allocated for this purpose. That allocation may either be a hard-coded value, a fixed percentage of the overall disk space or even a config.
In any case, that amount of disk space should be passed as an argument to the computeMaxNumChunksOnDisk
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kfaraz I had a discussion with Karan about it. The conclusion was that we don't need to block this PR for this review comment, and a follow-up task could be to add a metric for gathering data on how long are the processing threads actually blocked on this constraint. We could then tune this accordingly based on the gathered data.
Hope that works!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. 👍🏻
...ons-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java
Outdated
Show resolved
Hide resolved
...ons-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java
Outdated
Show resolved
Hide resolved
...extensions/src/test/java/org/apache/druid/storage/s3/output/RetryableS3OutputStreamTest.java
Outdated
Show resolved
Hide resolved
...ons-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java
Outdated
Show resolved
Hide resolved
...ons-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java
Outdated
Show resolved
Hide resolved
...ons-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java
Outdated
Show resolved
Hide resolved
...ons-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java
Outdated
Show resolved
Hide resolved
...ons-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments, @Akshat-Jain !
I have left a few more comments to clean up the tests but they need not block this PR and maybe addressed in a follow-up.
...core/s3-extensions/src/test/java/org/apache/druid/storage/s3/output/S3UploadManagerTest.java
Outdated
Show resolved
Hide resolved
...core/s3-extensions/src/test/java/org/apache/druid/storage/s3/output/S3UploadManagerTest.java
Outdated
Show resolved
Hide resolved
...core/s3-extensions/src/test/java/org/apache/druid/storage/s3/output/S3UploadManagerTest.java
Outdated
Show resolved
Hide resolved
...core/s3-extensions/src/test/java/org/apache/druid/storage/s3/output/S3UploadManagerTest.java
Outdated
Show resolved
Hide resolved
...core/s3-extensions/src/test/java/org/apache/druid/storage/s3/output/S3UploadManagerTest.java
Outdated
Show resolved
Hide resolved
...ons-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java
Outdated
Show resolved
Hide resolved
...ons-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java
Outdated
Show resolved
Hide resolved
@kfaraz Thank you for the extremely exhaustive review comments, got to learn a LOT from them throughout this PR! 😄 |
Happy to help, @Akshat-Jain ! 🙂 |
.../s3-extensions/src/main/java/org/apache/druid/storage/s3/output/RetryableS3OutputStream.java
Outdated
Show resolved
Hide resolved
...ons-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java
Outdated
Show resolved
Hide resolved
...ons-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3UploadManager.java
Outdated
Show resolved
Hide resolved
@Akshat-Jain, while looking into some test failures on one of my PRs, I noticed that the test It's also reproducible easily in the IDE: try running this test with "Repeat Until Failure" configuration. I haven't looked into what's causing the flakiness, but it seems like there's an ordering mismatch in the uploaded S3 chunks. I'm not sure if there's a race condition in the upload executor added in this PR, but I thought I'd mention it in case this is a legitimate issue. |
As part of #16481, we have started uploading the chunks in parallel. That means that it's not necessary for the part that finished uploading last to be less than or equal to the chunkSize (as the final part could've been uploaded earlier). This made a test in RetryableS3OutputStreamTest flaky where we were asserting that the final part should be smaller than chunk size. This commit fixes the test, and also adds another test where the file size is such that all chunk sizes would be of equal size.
Description
This PR optimises S3 storage writing for MSQ durable storage by uploading the chunks to S3 in a separate threadpool.
Currently, we were creating and uploading a chunk to S3 sequentially, which isn't optimal.
This PR changes the logic such that the creation of chunks still happen in the same processing threadpool, but the uploading of chunks is offloaded to a separate threadpool.
Test Plan
Timing Data
Following section summarizes the query duration comparison for a few different queries. We compare the original duration with the new duration across a few combinations of parameters.
Note:
trips_xaa
is one of the sample datasources with 60M rows.Other Notes
column:Query 1
Original timings:
New timings:
Query 2
Original timings:
New timings:
Query 3
Original timings:
New timings:
Query 4
Original timings:
New timings:
Query 5
The optimization is much more significant when we have I/O bound queries like this one with durable storage enabled for storing intermediary stage outputs as well as the query results. This query returned 56.8 MB of data.
Original timings:
New timings:
Query 6
This is another I/O bound query like the previous one. This query returned 454 MB of data.
Original timings:
New timings:
This PR has: