Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: runtime assertions for double-application and duplicate proposals #115771

Closed
erikgrinaker opened this issue Dec 7, 2023 · 25 comments · Fixed by #116319
Closed

kvserver: runtime assertions for double-application and duplicate proposals #115771

erikgrinaker opened this issue Dec 7, 2023 · 25 comments · Fixed by #116319
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Dec 7, 2023

In the context of #114421, we should ensure we have sufficient runtime assertions to detect double-application/replays of Raft proposals. This may already exist, but we should confirm. Ideally, this would be enabled in roachtests.

Jira issue: CRDB-34202

Epic CRDB-32846

@erikgrinaker erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels Dec 7, 2023
Copy link

blathers-crl bot commented Dec 7, 2023

cc @cockroachdb/replication

Copy link

blathers-crl bot commented Dec 7, 2023

Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@erikgrinaker erikgrinaker added the branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 label Dec 7, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Dec 7, 2023

We have this flag tracking whether the application has happened, and some safety checks for it to not be true in a few places. Although I think this is scoped to one sub-proposal, not the entire proposal.

After the refactoring which made it possible to have multiple ProposalData per single "logical" proposal, this flag is present in multiple copies, so I'm not absolutely sure if setting it to true prevents other parts of the pipeline seeing it as false (if reading from a copied ProposalData).

@pav-kv
Copy link
Collaborator

pav-kv commented Dec 7, 2023

Which again leads me to conclude that the reproposals area needs a cleaner lifetime tracking. There are things "common" across all the (re-)proposals of a single proposal, and things that are unique. The previous design puts everything into a single mutable struct, the new design puts everything into separate semi-immutable structs but copies/moves the shared things between them. The design that we need: a shared part for the "logical" proposal, and a set of per-reproposal structs that link into it.

The "shared" proposal struct then makes it possible to track the entire state of a single proposal rather than piece it together from multiple structs. It also makes invariant checks more reliable, like the "applied only once" check. Issue #115759 is a stepping stone towards such a restructure - we can first build a solid set of invariants in test-only environment, plus cover it with tests, and then slowly transition it to better code with inline assertions.

Previous problem that we had because of this copying: #108775.

@erikgrinaker erikgrinaker self-assigned this Dec 13, 2023
@erikgrinaker
Copy link
Contributor Author

Downgrading to non-blocker, test coverage.

@erikgrinaker
Copy link
Contributor Author

The "shared" proposal struct then makes it possible to track the entire state of a single proposal rather than piece it together from multiple structs. It also makes invariant checks more reliable, like the "applied only once" check. Issue #115759 is a stepping stone towards such a restructure - we can first build a solid set of invariants in test-only environment, plus cover it with tests, and then slowly transition it to better code with inline assertions.

That's a longer-term change though. We need something in the next week to try to track down #114421. Thoughts on this, @pavelkalinnikov?

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 13, 2023

So we already have this assertion:

// INVARIANT: a proposal is consumed (i.e. removed from the proposals map)
// the first time it comes up for application. (If the proposal fails due
// to an illegal LeaseAppliedIndex, a new proposal might be spawned to
// retry but that proposal will be unrelated as far as log application is
// concerned).
//
// INVARIANT: local proposals are not in the proposals map while being
// applied, and they never re-enter the proposals map either during or
// afterwards.
//
// (propBuf.{Insert,ReinsertLocked} ignores proposals that have
// v2SeenDuringApplicationSet to make this true).
if cmd.proposal.v2SeenDuringApplication {
err := errors.AssertionFailedf("ProposalData seen twice during application: %+v", cmd.proposal)
logcrash.ReportOrPanic(d.r.AnnotateCtx(cmd.ctx), &d.r.store.ClusterSettings().SV, "%v", err)

I don't think that does what we want though, because as you say, it only applies to a sub-(re)proposal not the logical client proposal.

I know Tobi added a bunch of other assertions that were later removed because they weren't actually invariant. I'll try to dig that up just to see what we've already tried.

@pav-kv
Copy link
Collaborator

pav-kv commented Dec 13, 2023

In a general sense, asserting/detecting double-apply situations is a costly problem. It basically requires a global map of all the proposal IDs ever, and checking that an applied command is not yet in this map.

On the leaseholder, we can reduce the scope of this checking to a single lease epoch. The Replica.mu.proposals sort of plays the role of "all the proposal IDs ever" (scoped to this lease epoch), but it forgets the proposals after application. And that's the "problem", because what if there is another application soon after we forgot the proposal?

So we can't use this simple map-based ID checking approach. Instead, we must rely on maths (e.g. see INVARIANT things quoted above).

@erikgrinaker
Copy link
Contributor Author

Replica.mu.proposals also only tracks proposals which were submitted by the local node, so it won't catch double-application on followers. It also won't detect double-application with snapshot application.

@pav-kv
Copy link
Collaborator

pav-kv commented Dec 13, 2023

Sort of things that we rely on are along the lines of:

  • there is always at most one appliable proposal in flight
  • we only repropose it with a new LAI if it came back from the application pipeline as a no-op (and now there is exactly zero appliable proposals in flight).

@pav-kv
Copy link
Collaborator

pav-kv commented Dec 13, 2023

so it won't catch double-application on followers. It also won't detect double-application with snapshot application

Exactly. On the followers detecting this is costly: we basically need to introduce a map by some ID to detect dups. On the leaseholder this is somewhat free because we need this map for other purposes. But this isn't exactly the map that would detect all dups flawlessly, because we remove from it.

Snapshot application can happen on followers and (as I learned recently) on the leaseholder, so yeah, this additionally makes it impossible to observe all the applied IDs even on the leaseholder.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 13, 2023

asserting/detecting double-apply situations is a costly problem. It basically requires a global map of all the proposal IDs ever, and checking that an applied command is not yet in this map.

We could assume that double-apply is most likely to happen for recent operations though, and keep a ring buffer of e.g. the last 100k command IDs. We could persist this ring buffer in the state machine, such that it's propagated in snapshots.

Clearly, we could only do this in test builds, since it's likely going to be too costly otherwise. But it doesn't seem terribly complicated to implement something like this (famous last words).

@pav-kv
Copy link
Collaborator

pav-kv commented Dec 13, 2023

For test-only purposes (esp. small-/mid-size unit-tests), we can just have a global map of all the proposals, track their lifetime transitions, and check all the invariants and semantical expectations. This is adjacent to #115759.

@erikgrinaker
Copy link
Contributor Author

Sure, that covers the proposal lifecycle, but doesn't fully address the double-apply problem does it? Because of snapshots, lease transfers, etc.

@pav-kv
Copy link
Collaborator

pav-kv commented Dec 13, 2023

In tests we can intercept all the apply events, both on leaseholders and followers. Every log index will be applied explicitly (not via a snapshot) at least once, so we can build a global picture of applied commands by joining the lifetimes across replicas.

@erikgrinaker
Copy link
Contributor Author

I think we want something we can assert in roachtests and test clusters, which exercise a wider range of scenarios.

@pav-kv
Copy link
Collaborator

pav-kv commented Dec 13, 2023

I guess the same technique can be done in a distributed env. Periodically dump all the applied (index, command ID) to a log/table/whatever. At the end of the test, do a global join of these things and see if there are command ID dups.

@pav-kv
Copy link
Collaborator

pav-kv commented Dec 13, 2023

If this "applied commands" is a SQL table, things could be detected timely by inserting batches of recently applied commands to this table, and having some clever key constraints that would detect dups for us.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 13, 2023

Maybe we can build something like this into kvnemesis. It already does similar tracking and analysis for serializable transaction schedules.

Note that we need something here that can be built, merged, and backported in ~2 days, so let's not get too fancy.

@pav-kv
Copy link
Collaborator

pav-kv commented Dec 13, 2023

Does kvnemesis use TestCluster? In this case, we could inject a testing knob that dumps all the applied (command ID, raft index) pairs to us, which we (the test) then immediately can put into a Go map[commandID]index to detect a commandID dup.

We basically need the TestingPostApplySideEffectsFilter interceptor from PR #114191 or somesuch (crucial: it has to be called on every replica that applies the command, not just the leaseholder; this will guarantee that all the applied commands get recorded irrespective of snapshots).

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 13, 2023

Yes, it uses TestCluster. I think we already have what we need in TestingPostApplyFilter, assuming we can get the log index from ReplicatedEvalResult.State.RaftAppliedIndex or LeaseAppliedIndex or some such.

We'll also want to track this across splits/merges.

@pav-kv
Copy link
Collaborator

pav-kv commented Dec 13, 2023

Command ID (or proposal IDs, need to check how they are called) are likely globally unique, so the across splits/merges aspect should come automatically. Just track all proposals globally.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 13, 2023

likely globally unique

They're not, they're a random 8-byte sequence, so there is a non-zero collision probability.

the across splits/merges aspect should come automatically

If we're going to tie them to log indexes, the log indexes will change with splits/merges. So we'll probably need a range/index tuple.

But I don't know if it even makes sense to track command IDs across ranges, since proposals, logs, and applications are range-local by definition. I don't see how a proposal/command could leak to a different Raft group.

@erikgrinaker
Copy link
Contributor Author

We are probably fine to assume it won't happen at the unit test scale. But to be safer we can mix in the range ID into the key, so that we find dups for (rangeID, commandID).

If it ever becomes a problem, we can inject a different command ID generator that's guaranteed to be unique in unit tests, e.g. UUIDv1.

@pav-kv
Copy link
Collaborator

pav-kv commented Dec 13, 2023

Apparently github can double-apply proposals too. I sent a comment, it appeared twice in the timeline. So I removed the second one. Now there is none :)

I was saying that the proposals can be carried across ranges (stay with the LHS during splits, and inherited from LHS during merges), but will be associated with the same rangeID because the LHS keeps it.

The proposal being a random int64 can become a problem if there were too many proposals that we were tracking at the same time. The probability of a collision is ~50% if we reach ~5 billion proposals (see "birthday paradox").

We are probably fine to assume it won't happen at the unit test scale. But to be safer we can mix in the range ID into the key, so that we find dups for (rangeID, commandID).

I am slightly concerned that this can happen in prod though. We detect it and panic at the insertion time. But since we're removing from this map too, it's possible that 2 different proposals have had the same ID after a while. We're probably fine since it won't happen within a short period. But.

Relying on Go's random number generator for crucial correctness properties seems incorrect by design (if a collision happens, we might ack an incorrect proposal or something). It would better be a UUID, a cryptosecure hash of the command content, or (best of all) the commands would be identified by a (rangeID, lease epoch, LAI) tuple or similar, and processed sequentially [#116020]. A random ID should be best used for sanity checking rather than relied upon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
No open projects
Status: Closed
Development

Successfully merging a pull request may close this issue.

2 participants