kv: readOnlyCmdMu not acquired by executeWriteBatch #46329

nvanbenschoten · 2020-03-19T20:05:32Z

#46234 noticed that unlike on the read-path (executeReadOnlyBatch), we don't synchronize with r.readOnlyCmdMu here. Is that ok? What if the replica is destroyed concurrently with a write? We won't be able to successfully propose as the lease will presumably have changed, but what if we hit an error during evaluation (e.g. a ConditionFailedError)?

@tbg:

I don't have a good answer here. If everything went through Raft, I think this code:

cockroach/pkg/kv/kvserver/replica_destroy.go

Lines 248 to 255 in 6c7b80e

for _, p := range r.mu.proposals {

r.cleanupFailedProposalLocked(p)

// NB: each proposal needs its own version of the error (i.e. don't try to

// share the error across proposals).

p.finishApplication(ctx, proposalResult{

Err: roachpb.NewError(roachpb.NewAmbiguousResultError("removing replica")),

})

}

would do it (assuming we check the destroy status under the same lock as we put new proposals in the map), but the failed CPut would skip the read. I think there's probably at least one bug here.

It's also striking how destroying the replica avoids the latch manager. You would assume that destroying the replica is something like

set destroy status

cancel all open proposals

grab an "everything r/w" latch

delete things

but this isn't how it works.

I look forward to when we unify the rw and ro paths, at which point we'll be forced to confront these head on.

@nvanbenschoten:

It's also striking how destroying the replica avoids the latch manager.

Yeah, I was wondering about this too. It would be very nice to flush out all requests using the latch manager, but I fear we might have some dependencies where a request can't make progress in raft until a replica is destroyed but the replica can't be destroyed (if it grabs latches) until the request releases its latches. Maybe that's just FUD.

@tbg:

but I fear we might have some dependencies where a request can't make progress in raft until a replica is destroyed but the replica can't be destroyed (if it grabs latches) until the request releases its latches.

I think that's a legitimate concern, but we just have to handle it. Roughly, step 1 above sets a destroy status that also prevents handleRaftReady from carrying out any more cycles. step 2 cancels all proposals including the raft-owned ones (which now makes sense since we know the state machine won't transition any more). This starts looking a lot like the current code, but by grabbing the latches in addition before deleting we serialize cleanly with inflight reads, so we address exactly the open questions we have here.

This PR cleans up some handling around replica destruction that scared me when working on cockroachdb#46329 and cockroachdb#55293. Specifically, there were cases during merges where the destroy status on a replica would not be set until after that replicas data was destroyed. This was true of merges applied through entry application, although in those cases, no user data was removed so it seems mostly harmless. More concerning is that is was true of merges applied through snapshot application. This was extra concerning because snapshot application can delete user data if a subsumed range is only partially covered (see `clearSubsumedReplicaDiskData`). So we were hypothetically risking user-visible correctness problems, especially now that we allow follower reads through on subsumed ranges as of cockroachdb#51594. This PR patches up these issues and then adds a few extra assertions that enforce stricter preconditions for the functions at play during replica destruction. Specifically, we now assert that the destroy status is set to `destroyReasonRemoved` _before_ calling `preDestroyRaftMuLocked`. We also assert that if `RemoveOptions.DestroyData` is false when passed to `removeInitializedReplicaRaftMuLocked`, then the destroy status is also set. This unification allows us to remove the `ignoreDestroyStatus` flag on `RemoveOptions`, whose meaning is now exactly the inverse of `DestroyData`. In hindsight, pushing on why this extra flag was needed and what new information it was conveying to the function could have potentially caught these issues a little earlier. I think we'll want to backport this to v20.2, though probably in the first patch release because there is some risk here (especially without sufficient time to bake on master) and we aren't aware of seeing any correctness issues from the combination of the bug fixed here and cockroachdb#51594. Release note (bug fix): A hypothesized bug that could allow a follower read to miss data on a range in the middle of being merged away into its left-hand neighbor was fixed. The bug seems exceedingly unlikely to have materialized in practice.

55477: kv: set destroy status before destroying data on subsumed replicas r=nvanbenschoten a=nvanbenschoten This PR cleans up some handling around replica destruction that scared me when working on #46329 and #55293. Specifically, there were cases during merges where the destroy status on a replica would not be set until after that replicas data was destroyed. This was true of merges applied through entry application, although in those cases, no user data was removed so it seems mostly harmless. More concerning is that is was true of merges applied through snapshot application. This was extra concerning because snapshot application can delete user data if a subsumed range is only partially covered (see `clearSubsumedReplicaDiskData`). So we were hypothetically risking user-visible correctness problems, especially now that we allow follower reads through on subsumed ranges as of #51594. This PR patches up these issues and then adds a few extra assertions that enforce stricter preconditions for the functions at play during replica destruction. Specifically, we now assert that the destroy status is set to `destroyReasonRemoved` _before_ calling `preDestroyRaftMuLocked`. We also assert that if `RemoveOptions.DestroyData` is false when passed to `removeInitializedReplicaRaftMuLocked`, then the destroy status is also set. This unification allows us to remove the `ignoreDestroyStatus` flag on `RemoveOptions`, whose meaning is now exactly the inverse of `DestroyData`. In hindsight, pushing on why this extra flag was needed and what new information it was conveying to the function could have potentially caught these issues a little earlier. I think we'll want to backport this to v20.2, though probably in the first patch release because there is some risk here (especially without sufficient time to bake on master) and we aren't aware of seeing any correctness issues from the combination of the bug fixed here and #51594. Release note (bug fix): A hypothesized bug that could allow a follower read to miss data on a range in the middle of being merged away into its left-hand neighbor was fixed. The bug seems exceedingly unlikely to have materialized in practice. Co-authored-by: Nathan VanBenschoten <[email protected]>

This PR cleans up some handling around replica destruction that scared me when working on cockroachdb#46329 and cockroachdb#55293. Specifically, there were cases during merges where the destroy status on a replica would not be set until after that replicas data was destroyed. This was true of merges applied through entry application, although in those cases, no user data was removed so it seems mostly harmless. More concerning is that is was true of merges applied through snapshot application. This was extra concerning because snapshot application can delete user data if a subsumed range is only partially covered (see `clearSubsumedReplicaDiskData`). So we were hypothetically risking user-visible correctness problems, especially now that we allow follower reads through on subsumed ranges as of cockroachdb#51594. This PR patches up these issues and then adds a few extra assertions that enforce stricter preconditions for the functions at play during replica destruction. Specifically, we now assert that the destroy status is set to `destroyReasonRemoved` _before_ calling `preDestroyRaftMuLocked`. We also assert that if `RemoveOptions.DestroyData` is false when passed to `removeInitializedReplicaRaftMuLocked`, then the destroy status is also set. This unification allows us to remove the `ignoreDestroyStatus` flag on `RemoveOptions`, whose meaning is now exactly the inverse of `DestroyData`. In hindsight, pushing on why this extra flag was needed and what new information it was conveying to the function could have potentially caught these issues a little earlier. I think we'll want to backport this to v20.2, though probably in the first patch release because there is some risk here (especially without sufficient time to bake on master) and we aren't aware of seeing any correctness issues from the combination of the bug fixed here and cockroachdb#51594. Release note (bug fix): A hypothesized bug that could allow a follower read to miss data on a range in the middle of being merged away into its left-hand neighbor was fixed. The bug seems exceedingly unlikely to have materialized in practice.

erikgrinaker · 2021-04-30T11:36:34Z

Confirmed. I adapted the test case from #64324 to do a conditional put on an existing value, giving the current value of the target key, and it failed because the request read nil instead.

blathers-crl · 2021-05-03T07:30:02Z

Hi @tbg, please add branch-* labels to identify which branch(es) this release-blocker affects.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

nvanbenschoten added C-investigation Further steps needed to qualify. C-label will change. A-kv Anything in KV that doesn't belong in a more specific category. labels Mar 19, 2020

nvanbenschoten mentioned this issue Oct 12, 2020

kv: batch evaluation should operate on a consistent storage snapshot #55461

Closed

nvanbenschoten mentioned this issue Oct 13, 2020

kv: set destroy status before destroying data on subsumed replicas #55477

Merged

nvanbenschoten mentioned this issue Oct 19, 2020

release-20.2: kv: set destroy status before destroying data on subsumed replicas #55691

Merged

nvanbenschoten mentioned this issue Jan 5, 2021

kvserver: purging outdated replicas during migrations run into nil pointers #58378

Closed

erikgrinaker self-assigned this Apr 28, 2021

erikgrinaker mentioned this issue Apr 30, 2021

release-21.1: kvserver: synchronize replica removal with read-only requests #64370

Merged

erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed C-investigation Further steps needed to qualify. C-label will change. labels Apr 30, 2021

knz added the S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading. label Apr 30, 2021

erikgrinaker mentioned this issue Apr 30, 2021

kvserver: synchronize replica removal with read-write requests #64471

Merged

erikgrinaker added S-0-visible-logical-error Database stores inconsistent data in some cases, or queries return invalid results silently. and removed S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading. labels May 1, 2021

tbg added the GA-blocker label May 3, 2021

tbg removed the GA-blocker label May 3, 2021

craig bot closed this as completed in d03fe1c May 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: readOnlyCmdMu not acquired by executeWriteBatch #46329

kv: readOnlyCmdMu not acquired by executeWriteBatch #46329

nvanbenschoten commented Mar 19, 2020

erikgrinaker commented Apr 30, 2021

blathers-crl bot commented May 3, 2021

kv: readOnlyCmdMu not acquired by executeWriteBatch #46329

kv: readOnlyCmdMu not acquired by executeWriteBatch #46329

Comments

nvanbenschoten commented Mar 19, 2020

erikgrinaker commented Apr 30, 2021

blathers-crl bot commented May 3, 2021