[v22.3.x] cloud_storage: parallelize remote_segment stop #9239

andrwng · 2023-03-02T18:34:51Z

Backport of #9207

CONFLICT:

omits changes to many_partitions_test which doesn't have a tiered storage case on this branch

The eviction loop is currently run serially and is waited for by each remote_partition upon stopping, which effecitvely serializes partition_manager stopping each of its partitions. This resulted in what looked like a hang, but was actually a series of waits to stop cloud segments.

This commit parallelizes segment stopping in both the eviction loop and in the remote_partition stop call. A test that previously took over 9 minutes to shutdown a node now takes 10 seconds.

I also considered moving the eviction loop into the remote_partition to further reduce partitions waiting for each other to complete, but that wasn't necessary to avoid this issue.

This also adds a test case similar to the tiered storage many_partitions_test case that reproduces slow shutdown with a large number of partitions and a large number of segments. On my local workstation, the test consistently fails to finish shutting down in 30 seconds without these changes.

(cherry picked from commit b8ba09c)

Backports Required

Release Notes

Improvements

Shutting down a server with many hydrated tiered storage segments and partitions is now significantly faster.

CONFLICT: - omits changes to many_partitions_test which doesn't have a tiered storage case on this branch - omit cloud segment merging flag not in v22.3.x The eviction loop is currently run serially and is waited for by each remote_partition upon stopping, which effecitvely serializes partition_manager stopping each of its partitions. This resulted in what looked like a hang, but was actually a series of waits to stop cloud segments. This commit parallelizes segment stopping in both the eviction loop and in the remote_partition stop call. A test that previously took over 9 minutes to shutdown a node now takes 10 seconds. I also considered moving the eviction loop into the remote_partition to further reduce partitions waiting for each other to complete, but that wasn't necessary to avoid this issue. This also adds a test case similar to the tiered storage many_partitions_test case that reproduces slow shutdown with a large number of partitions and a large number of segments. On my local workstation, the test consistently fails to finish shutting down in 30 seconds without these changes. (cherry picked from commit b8ba09c)

BenPope · 2023-04-14T00:36:24Z

@andrwng anything blocking this?

andrwng · 2023-04-14T01:04:43Z

@andrwng anything blocking this?

Thanks for the ping. Green CI and a review from the storage team. I retriggered CI, and will poke some folks once that passes

BenPope · 2023-04-18T21:01:30Z

@andrwng anything blocking triage of the failures, making this ready for review, and assigning reviewers?

andrwng · 2023-04-18T23:02:47Z

Nothing's led me to be suspicious of this change specifically, but there are a lot of failures in this branch (over the course of a few retries):

CI Failure (KgoVerifier failed waiting for worker) in ShadowIndexingManyPartitionsTest.test_many_partitions_shutdown #9544
fixed [v22.3.x] Add config properties to metrics report #9303
CI Failure (Can't truncate manifest up to offset ..., offset out of range (via BadLogLines)) in CloudRetentionTest.test_cloud_retention #9286
[v22.3.x] CI Failure (slightly high HWMs in read replicas) in TestReadReplicaService.test_identical_hwms.partition_count=5 #10179
[v22.3.x] CI Failure (missing warning log line) in RpkRedpandaStartTest.test_rpc_tls_list #10180
[v22.3.x] CI Failure (decommissioned node doesn't attempt to join) in NodesDecommissioningTest.test_node_is_not_allowed_to_join_after_restart #10183

BenPope · 2023-04-18T23:32:41Z

Nothing's led me to be suspicious of this change specifically, but there are a lot of failures in this branch (over the course of a few retries):

Understood. CI failures are probably >50% of the reason I'm doing these reminders. My hope is that it'll settle. 🤞

piyushredpanda · 2023-04-25T03:04:44Z

CI failure seems #9646 and #10024

github-actions bot added the area/redpanda label Mar 2, 2023

jcsp changed the title ~~cloud_storage: parallelize remote_segment stop~~ [v22.3.x] cloud_storage: parallelize remote_segment stop Mar 2, 2023

andrwng force-pushed the v22.3.x-tiered-storage-slow-shutdown branch from adcce7a to c9f7991 Compare March 2, 2023 22:08

andrwng force-pushed the v22.3.x-tiered-storage-slow-shutdown branch from c9f7991 to b14e4ad Compare March 3, 2023 01:47

BenPope added the kind/backport PRs targeting a stable branch label Apr 14, 2023

BenPope added this to the v22.3.x-next milestone Apr 14, 2023

BenPope assigned andrwng Apr 18, 2023

piyushredpanda modified the milestones: v22.3.x-next, v22.3.17 Apr 22, 2023

piyushredpanda requested review from abhijat and Lazin April 22, 2023 03:44

jcsp approved these changes Apr 25, 2023

View reviewed changes

andrwng merged commit 14834dc into redpanda-data:v22.3.x Apr 25, 2023

andrwng mentioned this pull request Apr 25, 2023

[v22.3.x] tests: add method to sum metrics #10342

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v22.3.x] cloud_storage: parallelize remote_segment stop #9239

[v22.3.x] cloud_storage: parallelize remote_segment stop #9239

andrwng commented Mar 2, 2023 •

edited by jcsp

Loading

BenPope commented Apr 14, 2023

andrwng commented Apr 14, 2023

BenPope commented Apr 18, 2023

andrwng commented Apr 18, 2023

BenPope commented Apr 18, 2023

piyushredpanda commented Apr 25, 2023 •

edited

Loading

[v22.3.x] cloud_storage: parallelize remote_segment stop #9239

[v22.3.x] cloud_storage: parallelize remote_segment stop #9239

Conversation

andrwng commented Mar 2, 2023 • edited by jcsp Loading

Backports Required

Release Notes

Improvements

BenPope commented Apr 14, 2023

andrwng commented Apr 14, 2023

BenPope commented Apr 18, 2023

andrwng commented Apr 18, 2023

BenPope commented Apr 18, 2023

piyushredpanda commented Apr 25, 2023 • edited Loading

andrwng commented Mar 2, 2023 •

edited by jcsp

Loading

piyushredpanda commented Apr 25, 2023 •

edited

Loading