Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c/archival_stm: do not reset _last_replicate on timeout #15677

Merged
merged 1 commit into from
Dec 19, 2023

Conversation

nvartolomei
Copy link
Contributor

@nvartolomei nvartolomei commented Dec 15, 2023

For same term syncing we store the replicate commands future so that it can be awaited in the next sync call. However, the std::exchange(_last_replicate, std::nullopt) before the wait is problematic as it "forgets" the last replicate. If sync times out and we retry it then it behaves as if the last replicate command succeeded. This is not necessarily true as the newly added test assertion shows.

Use a shared future for _last_replicate so that it can be awaited multiple times and reset it only after it is resolved. This guarantees that sync will only return after last replicate command actually resolved.

ignore_ready_future call is removed as the shared future does not issue a warning if it wasn't consumed (basically a revert of e221a0a).

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

  • none

For same term syncing we store the replicate commands future so that it
can be awaited in the next sync call. However, the
`std::exchange(_last_replicate, std::nullopt)` before the wait is
problematic as it "forgets" the last replicate. If sync times out and we
retry it then it behaves as if the last replicate command succeeded.
This is not necessarily true as the newly added test assertion shows.

Use a shared future for _last_replicate so that it can be awaited
multiple times and reset it only after it is resolved. This guarantees
that sync will only return after last replicate command actually
resolved.

`ignore_ready_future` call is removed as the shared future does not
issue a warning if it wasn't consumed (basically a revert of
e221a0a).
@nvartolomei
Copy link
Contributor Author

/dt

@nvartolomei
Copy link
Contributor Author

/dt
skip-redpanda-build

@vbotbuildovich
Copy link
Collaborator

@nvartolomei
Copy link
Contributor Author

/cdt
dt-repeat=3

@nvartolomei
Copy link
Contributor Author

/cdt

@piyushredpanda
Copy link
Contributor

If this should be in v23.3.1, then let's ensure we are backporting to v23.3.x now that it is branched.

@nvartolomei
Copy link
Contributor Author

/cdt

@nvartolomei nvartolomei marked this pull request as ready for review December 18, 2023 12:41
@nvartolomei
Copy link
Contributor Author

/cdt
num_nodes=24

@piyushredpanda piyushredpanda added this to the v23.3.1-rc5 milestone Dec 19, 2023
std::optional<ss::future<result<raft::replicate_result>>> _last_replicate;
struct last_replicate {
model::term_id term;
ss::shared_future<result<raft::replicate_result>> result;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you add a comment explaining the requirement of shared_future here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do in a subsequent PR. Merging to get it before the next CDT run.

auto fut = std::exchange(_last_replicate, std::nullopt).value();

if (!fut.available()) {
if (!_last_replicate->result.available()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a shared future for _last_replicate so that it can be awaited
multiple times and reset it only after it is resolved.

Just trying to understand the remaining problematic scenarios. Does this generalize to: sync() may be called concurrently by different fibers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does. To give one example at least, ListOffsets kafka rpc causes a sync() call too.

@nvartolomei nvartolomei merged commit 5f34eed into redpanda-data:dev Dec 19, 2023
20 of 22 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-15677-v23.2.x-525 remotes/upstream/v23.2.x
git cherry-pick -x 299e12dfb241df84068395c1568f357a1e16ddda

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants