Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v22.3.x] cloud_storage: parallelize remote_segment stop #9239

Merged

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Mar 2, 2023

Backport of #9207

CONFLICT:

  • omits changes to many_partitions_test which doesn't have a tiered storage case on this branch

The eviction loop is currently run serially and is waited for by each remote_partition upon stopping, which effecitvely serializes partition_manager stopping each of its partitions. This resulted in what looked like a hang, but was actually a series of waits to stop cloud segments.

This commit parallelizes segment stopping in both the eviction loop and in the remote_partition stop call. A test that previously took over 9 minutes to shutdown a node now takes 10 seconds.

I also considered moving the eviction loop into the remote_partition to further reduce partitions waiting for each other to complete, but that wasn't necessary to avoid this issue.

This also adds a test case similar to the tiered storage many_partitions_test case that reproduces slow shutdown with a large number of partitions and a large number of segments. On my local workstation, the test consistently fails to finish shutting down in 30 seconds without these changes.

(cherry picked from commit b8ba09c)

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.1.x
  • v22.3.x
  • v22.2.x

Release Notes

Improvements

  • Shutting down a server with many hydrated tiered storage segments and partitions is now significantly faster.

@jcsp jcsp changed the title cloud_storage: parallelize remote_segment stop [v22.3.x] cloud_storage: parallelize remote_segment stop Mar 2, 2023
@andrwng andrwng force-pushed the v22.3.x-tiered-storage-slow-shutdown branch from adcce7a to c9f7991 Compare March 2, 2023 22:08
CONFLICT:
- omits changes to many_partitions_test which doesn't have a tiered
  storage case on this branch
- omit cloud segment merging flag not in v22.3.x

The eviction loop is currently run serially and is waited for by each
remote_partition upon stopping, which effecitvely serializes
partition_manager stopping each of its partitions. This resulted in what
looked like a hang, but was actually a series of waits to stop cloud
segments.

This commit parallelizes segment stopping in both the eviction loop and
in the remote_partition stop call. A test that previously took over 9
minutes to shutdown a node now takes 10 seconds.

I also considered moving the eviction loop into the remote_partition to
further reduce partitions waiting for each other to complete, but that
wasn't necessary to avoid this issue.

This also adds a test case similar to the tiered storage
many_partitions_test case that reproduces slow shutdown with a large
number of partitions and a large number of segments. On my local
workstation, the test consistently fails to finish shutting down in 30
seconds without these changes.

(cherry picked from commit b8ba09c)
@andrwng andrwng force-pushed the v22.3.x-tiered-storage-slow-shutdown branch from c9f7991 to b14e4ad Compare March 3, 2023 01:47
@BenPope BenPope added the kind/backport PRs targeting a stable branch label Apr 14, 2023
@BenPope BenPope added this to the v22.3.x-next milestone Apr 14, 2023
@BenPope
Copy link
Member

BenPope commented Apr 14, 2023

@andrwng anything blocking this?

@andrwng
Copy link
Contributor Author

andrwng commented Apr 14, 2023

@andrwng anything blocking this?

Thanks for the ping. Green CI and a review from the storage team. I retriggered CI, and will poke some folks once that passes

@BenPope
Copy link
Member

BenPope commented Apr 18, 2023

@andrwng anything blocking triage of the failures, making this ready for review, and assigning reviewers?

@BenPope
Copy link
Member

BenPope commented Apr 18, 2023

Nothing's led me to be suspicious of this change specifically, but there are a lot of failures in this branch (over the course of a few retries):

Understood. CI failures are probably >50% of the reason I'm doing these reminders. My hope is that it'll settle. 🤞

@piyushredpanda
Copy link
Contributor

piyushredpanda commented Apr 25, 2023

CI failure seems #9646 and #10024

@andrwng andrwng merged commit 14834dc into redpanda-data:v22.3.x Apr 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/backport PRs targeting a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants