c/archival_stm: do not reset _last_replicate on timeout #15677

nvartolomei · 2023-12-15T11:58:19Z

For same term syncing we store the replicate commands future so that it can be awaited in the next sync call. However, the std::exchange(_last_replicate, std::nullopt) before the wait is problematic as it "forgets" the last replicate. If sync times out and we retry it then it behaves as if the last replicate command succeeded. This is not necessarily true as the newly added test assertion shows.

Use a shared future for _last_replicate so that it can be awaited multiple times and reset it only after it is resolved. This guarantees that sync will only return after last replicate command actually resolved.

ignore_ready_future call is removed as the shared future does not issue a warning if it wasn't consumed (basically a revert of e221a0a).

Backports Required

Release Notes

none

For same term syncing we store the replicate commands future so that it can be awaited in the next sync call. However, the `std::exchange(_last_replicate, std::nullopt)` before the wait is problematic as it "forgets" the last replicate. If sync times out and we retry it then it behaves as if the last replicate command succeeded. This is not necessarily true as the newly added test assertion shows. Use a shared future for _last_replicate so that it can be awaited multiple times and reset it only after it is resolved. This guarantees that sync will only return after last replicate command actually resolved. `ignore_ready_future` call is removed as the shared future does not issue a warning if it wasn't consumed (basically a revert of e221a0a).

nvartolomei · 2023-12-15T11:59:11Z

/dt

nvartolomei · 2023-12-15T15:11:17Z

/dt
skip-redpanda-build

vbotbuildovich · 2023-12-15T18:52:47Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/42877#018c6e4c-2fb6-4178-8619-2ee6c53c4969

nvartolomei · 2023-12-15T21:05:56Z

/cdt
dt-repeat=3

nvartolomei · 2023-12-16T13:20:29Z

/cdt

piyushredpanda · 2023-12-16T16:01:23Z

If this should be in v23.3.1, then let's ensure we are backporting to v23.3.x now that it is branched.

nvartolomei · 2023-12-17T08:55:12Z

/cdt

nvartolomei · 2023-12-18T20:53:00Z

/cdt
num_nodes=24

andrwng · 2023-12-19T19:28:23Z

src/v/cluster/archival_metadata_stm.h

-    std::optional<ss::future<result<raft::replicate_result>>> _last_replicate;
+    struct last_replicate {
+        model::term_id term;
+        ss::shared_future<result<raft::replicate_result>> result;


nit: could you add a comment explaining the requirement of shared_future here?

Will do in a subsequent PR. Merging to get it before the next CDT run.

andrwng · 2023-12-19T19:45:14Z

src/v/cluster/archival_metadata_stm.cc

-        auto fut = std::exchange(_last_replicate, std::nullopt).value();
-
-        if (!fut.available()) {
+        if (!_last_replicate->result.available()) {


Use a shared future for _last_replicate so that it can be awaited
multiple times and reset it only after it is resolved.

Just trying to understand the remaining problematic scenarios. Does this generalize to: sync() may be called concurrently by different fibers?

Yes, it does. To give one example at least, ListOffsets kafka rpc causes a sync() call too.

vbotbuildovich · 2023-12-19T22:29:06Z

/backport v23.3.x

vbotbuildovich · 2023-12-19T22:29:06Z

/backport v23.2.x

vbotbuildovich · 2023-12-19T22:30:01Z

Failed to create a backport PR to v23.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-15677-v23.2.x-525 remotes/upstream/v23.2.x
git cherry-pick -x 299e12dfb241df84068395c1568f357a1e16ddda

Workflow run logs.

github-actions bot added the area/redpanda label Dec 15, 2023

nvartolomei force-pushed the nv/last-replicate-bug-v2 branch from 8afb056 to 299e12d Compare December 15, 2023 11:59

nvartolomei mentioned this pull request Dec 16, 2023

c/archival_stm: do not reset _last_replicate on timeout #15427

Closed

7 tasks

nvartolomei marked this pull request as ready for review December 18, 2023 12:41

nvartolomei requested review from Lazin and andrwng December 18, 2023 12:41

piyushredpanda added this to the v23.3.1-rc5 milestone Dec 19, 2023

andijcr approved these changes Dec 19, 2023

View reviewed changes

andrwng reviewed Dec 19, 2023

View reviewed changes

nvartolomei merged commit 5f34eed into redpanda-data:dev Dec 19, 2023
20 of 22 checks passed

vbotbuildovich mentioned this pull request Dec 19, 2023

[v23.3.x] c/archival_stm: do not reset _last_replicate on timeout #15782

Merged

vbotbuildovich mentioned this pull request Dec 19, 2023

[v23.2.x] c/archival_stm: do not reset _last_replicate on timeout #15783

Closed

This was referenced Dec 20, 2023

c/archival_stm: downgrade timeout log to warn #15804

Merged

[v23.2.x] c/archival_stm: do not reset _last_replicate on timeout #15816

Merged

nvartolomei mentioned this pull request Jan 3, 2024

CI Failure (Exceptional future ignored) in ShadowIndexingManyPartitionsTest.test_many_partitions_recovery #15491

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

c/archival_stm: do not reset _last_replicate on timeout #15677

c/archival_stm: do not reset _last_replicate on timeout #15677

nvartolomei commented Dec 15, 2023 •

edited

Loading

nvartolomei commented Dec 15, 2023

nvartolomei commented Dec 15, 2023

vbotbuildovich commented Dec 15, 2023

nvartolomei commented Dec 15, 2023

nvartolomei commented Dec 16, 2023

piyushredpanda commented Dec 16, 2023

nvartolomei commented Dec 17, 2023

nvartolomei commented Dec 18, 2023

andrwng Dec 19, 2023

nvartolomei Dec 19, 2023

andrwng Dec 19, 2023

nvartolomei Dec 19, 2023

vbotbuildovich commented Dec 19, 2023

vbotbuildovich commented Dec 19, 2023

vbotbuildovich commented Dec 19, 2023

c/archival_stm: do not reset _last_replicate on timeout #15677

c/archival_stm: do not reset _last_replicate on timeout #15677

Conversation

nvartolomei commented Dec 15, 2023 • edited Loading

Backports Required

Release Notes

nvartolomei commented Dec 15, 2023

nvartolomei commented Dec 15, 2023

vbotbuildovich commented Dec 15, 2023

nvartolomei commented Dec 15, 2023

nvartolomei commented Dec 16, 2023

piyushredpanda commented Dec 16, 2023

nvartolomei commented Dec 17, 2023

nvartolomei commented Dec 18, 2023

andrwng Dec 19, 2023

Choose a reason for hiding this comment

nvartolomei Dec 19, 2023

Choose a reason for hiding this comment

andrwng Dec 19, 2023

Choose a reason for hiding this comment

nvartolomei Dec 19, 2023

Choose a reason for hiding this comment

vbotbuildovich commented Dec 19, 2023

vbotbuildovich commented Dec 19, 2023

vbotbuildovich commented Dec 19, 2023

nvartolomei commented Dec 15, 2023 •

edited

Loading