-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v22.3.x] cloud_storage: parallelize remote_segment stop #9239
[v22.3.x] cloud_storage: parallelize remote_segment stop #9239
Conversation
adcce7a
to
c9f7991
Compare
CONFLICT: - omits changes to many_partitions_test which doesn't have a tiered storage case on this branch - omit cloud segment merging flag not in v22.3.x The eviction loop is currently run serially and is waited for by each remote_partition upon stopping, which effecitvely serializes partition_manager stopping each of its partitions. This resulted in what looked like a hang, but was actually a series of waits to stop cloud segments. This commit parallelizes segment stopping in both the eviction loop and in the remote_partition stop call. A test that previously took over 9 minutes to shutdown a node now takes 10 seconds. I also considered moving the eviction loop into the remote_partition to further reduce partitions waiting for each other to complete, but that wasn't necessary to avoid this issue. This also adds a test case similar to the tiered storage many_partitions_test case that reproduces slow shutdown with a large number of partitions and a large number of segments. On my local workstation, the test consistently fails to finish shutting down in 30 seconds without these changes. (cherry picked from commit b8ba09c)
c9f7991
to
b14e4ad
Compare
@andrwng anything blocking this? |
Thanks for the ping. Green CI and a review from the storage team. I retriggered CI, and will poke some folks once that passes |
@andrwng anything blocking triage of the failures, making this ready for review, and assigning reviewers? |
Understood. CI failures are probably >50% of the reason I'm doing these reminders. My hope is that it'll settle. 🤞 |
Backport of #9207
CONFLICT:
The eviction loop is currently run serially and is waited for by each remote_partition upon stopping, which effecitvely serializes partition_manager stopping each of its partitions. This resulted in what looked like a hang, but was actually a series of waits to stop cloud segments.
This commit parallelizes segment stopping in both the eviction loop and in the remote_partition stop call. A test that previously took over 9 minutes to shutdown a node now takes 10 seconds.
I also considered moving the eviction loop into the remote_partition to further reduce partitions waiting for each other to complete, but that wasn't necessary to avoid this issue.
This also adds a test case similar to the tiered storage many_partitions_test case that reproduces slow shutdown with a large number of partitions and a large number of segments. On my local workstation, the test consistently fails to finish shutting down in 30 seconds without these changes.
(cherry picked from commit b8ba09c)
Backports Required
Release Notes
Improvements