kvserver: document replica lifecycle #72745

tbg · 2021-11-15T11:36:57Z

Part of the research for #72374, which in turn was inspired by #38322.

Release note: None

cockroach-teamcity · 2021-11-15T11:37:04Z

This change is

pkg/kv/kvserver/store.go

sumeerbhola · 2021-11-15T18:24:44Z

pkg/kv/kvserver/store.go

+initializing it. In principle, since writes to the state machine do not need to
+be made durable, it is conceivable that a split or snapshot could "unapply" due
+to an ill-timed crash (though snapshots currently use SST ingestion, which the
+storage engine performs durably). Similarly, entry application (which only


The split or snapshot "unapply" is something we are actively preventing at this point. So may be worth phrasing this more strongly.

I have to look into what exactly we do again. If you remember off the top of your head, mind spelling it out?

I know we're atomically ingesting the SSTs for the snapshot, so this won't unapply.
But splits? I don't remember us syncing those but we may be doing that somewhere.

pkg/kv/kvserver/store.go

erikgrinaker

This is great, thanks for writing this up! It might be useful to reference other code comments that detail the specifics of each lifecycle events.

pkg/kv/kvserver/store.go

sumeerbhola

Reviewed 1 of 3 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/kv/kvserver/store.go, line 486 at r1 (raw file):

But the descriptor is visible once the txn commits, not when the intent gets resolved, right? Assuming consistent MVCCful reads.

The MVCC read will see the provisional value, and resolve the intent, and then reread. I am assuming you mean that too.

Is another issue here that we are not trying to read the descriptor in an MVCC manner? That is, the latest descriptor should be visible to a read, even if the read is in the past, and that means if it sees a provisional descriptor and is not sure whether the txn making that change has committed or not, it has to try to resolve it. Which means a herd of unnecessary resolution attempts. While by ensuring that the provisional descriptor will be resolved by the txn commit/abort itself, everyone else can assume that the provisional value is still provisional?

pkg/kv/kvserver/store.go, line 637 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I have to look into what exactly we do again. If you remember off the top of your head, mind spelling it out?

I know we're atomically ingesting the SSTs for the snapshot, so this won't unapply.
But splits? I don't remember us syncing those but we may be doing that somewhere.

I believe you are right that we are not syncing for splits (unless I am missing somewhere we set changeRemovesReplica=true).

So say raft entry 100 caused the split of R into R1 and R2.

We don't sync so the state machine state and HardState created for R2 in splitPreApply is not durable.
R2 now has a state machine that has "applied" up to entry 10, and HardState.committed=10. For it to receive 11 and apply it, it has to update HardState.committed=11, which presumably forces a sync (though I just noticed raft.Ready.MustSync and don't understand the cases in which it can be false). So in this case R2 cannot move beyond its initial state without syncing. If the node crashes it will rollback to before the split point of R. We do have the sstable ingestion case (the non-snapshot one) which can cause R1's state machine to be inconsistent and past the split point, as we discussed in https://cockroachlabs.slack.com/archives/C02KHQMF2US/p1635859782038200?thread_ts=1635799562.036400&cid=C02KHQMF2US but perhaps that is ok, and at least it only affects R1 (and it won't necessarily cause the RangeDescriptor for R1 to become durable).
Say after R2 has a state machine with applied=10, it instead receives a snapshot for 20. It can ingest it, which will move R2 beyond the split point, even if the node crashes. We include the update to the HardState in the snapshot sst, so R2 should be consistent. If we crash and recover and look at the state of the engine we will see R and R2 which overlap. I suspect this won't happen since the split batch is in the same memtable, and since the HardState in the snapshot will overlap, it will cause the memtable to get flushed and become durable.

But my head hurts thinking about this! I'd really prefer if we synced when splitting -- it seems a premature optimization given all the syncs we generally need to do. Do we have thorough crash testing of all of this logic -- if not, this makes me even more keen on something like #72795 since we could test it thoroughly with restarts.

pkg/kv/kvserver/store.go, line 535 at r2 (raw file):

RaftTransport.SendAsync method. Raft uses message passing (not
request-response), and outgoing messages will use a gRPC stream that differs
from that used for incoming messages (which asymmetric partitions more likely).

"which asymmetric partitions more likely" is not clear to me.

ajwerner

This really is great. One of my earliest conversations with @sumeerbhola, after a walk in central park, we discussed some of this stuff. In particular, replicas not processing commands when they weren't in the range. I had just paged this stuff in and worked to tighten the invariants ahead of the 19.2 release, and I remarked that I owed a tech note on this stuff. Sorry I never wrote it 😓, thank you for this great contribution.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/kv/kvserver/store.go, line 455 at r1 (raw file):

Quoted 8 lines of code…

- start a transaction

- update the RangeDescriptor (for example, to reflect a split, or a change
to the Replicas comprising the members of the Range)

- update the meta ranges (which form a search index used for request routing)

- commit with a roachpb.InternalCommitTrigger.

Consider giving this a read as a godoc and formatting accordingly. The dashes may look silly. What about unicode bullets for the lists? •

Alternative, you could block quote them.

pkg/kv/kvserver/store.go, line 637 at r1 (raw file):

Say after R2 has a state machine with applied=10, it instead receives a snapshot for 20. It can ingest it, which will move R2 beyond the split point, even if the node crashes. We include the update to the HardState in the snapshot sst, so R2 should be consistent. If we crash and recover and look at the state of the engine we will see R and R2 which overlap.

This problem is avoided by not allowing the snapshot to be applied. We reject snapshots which overlap with existing ranges on the receiver before accepting the snapshot here:

cockroach/pkg/kv/kvserver/store_snapshot.go

Lines 490 to 495 in 9856ebf

    
           // We have a key range [desc.StartKey,desc.EndKey) which we want to apply a 
        
           // snapshot for. Is there a conflicting existing placeholder or an 
        
           // overlapping range? 
        
           if err := s.checkSnapshotOverlapLocked(ctx, snapHeader); err != nil { 
        
           	return nil, err 
        
           }

I'd really prefer if we synced when splitting -- it seems a premature optimization given all the syncs we generally need to do.

Sure, there's certainly no performance reason not to.

The fact that we don't sync here doesn't come from a deliberate choice to optimize this case but rather from a legacy where no writes due to command application. The history on not syncing application is very old (#13481). I added the ability to sync anything here much later when tightening invariants around the replica lifecycle. I convinced myself at the time that this was enough (I think the snapshot prevention was part of my thinking). More tightening and more defensive tightening could have occurred. If I'm being honest, seems extremely reasonable to sync on all changes which touch the range descriptor in any way.

Do we have thorough crash testing of all of this logic

I'm sure it could be more thorough, but I do recall worrying about these cases when I was mucking around here. Hopefully the snapshot rejection helps alleviate some fears. The test we have isn't the most beautiful or maintainable, but it's something.

cockroach/pkg/kv/kvserver/client_raft_test.go

Lines 4813 to 4879 in 9856ebf

    
           // TestProcessSplitAfterRightHandSideHasBeenRemoved tests cases where we have 
        
           // a follower replica which has received information about the RHS of a split 
        
           // before it has processed that split. The replica can't both have an 
        
           // initialized RHS and process the split but it can have (1) an uninitialized 
        
           // RHS with a higher replica ID than in the split and (2) a RHS with an unknown 
        
           // replica ID and a tombstone with a higher replica ID than in the split. 
        
           // It may learn about a newer replica ID than the split without ever hearing 
        
           // about the split replica. If it does not crash (3) it will know that the 
        
           // split replica is too old and will not initialize it. If the node does 
        
           // crash (4) it will forget it had learned about the higher replica ID and 
        
           // will initialize the RHS as the split replica. 
        
           // 
        
           // Starting in 19.2 if a replica discovers from a raft message that it is an 
        
           // old replica then it knows that it has been removed and re-added to the range. 
        
           // In this case the Replica eagerly destroys itself and its data. 
        
           // 
        
           // Given this behavior there are 4 troubling cases with regards to splits. 
        
           // 
        
           //   *  In all cases we begin with s1 processing a presplit snapshot for 
        
           //      r20. After the split the store should have r21/3. 
        
           // 
        
           // In the first two cases the following occurs: 
        
           // 
        
           //   *  s1 receives a message for r21/3 prior to acquiring the split lock 
        
           //      in r21. This will create an uninitialized r21/3 which may write 
        
           //      HardState. 
        
           // 
        
           //   *  Before the r20 processes the split r21 is removed and re-added to 
        
           //      s1 as r21/4. s1 receives a raft message destined for r21/4 and proceeds 
        
           //      to destroy its uninitialized r21/3, laying down a tombstone at 4 in the 
        
           //      process. 
        
           // 
        
           //  (1) s1 processes the split and finds the RHS to be an uninitialized replica 
        
           //      with a higher replica ID. 
        
           // 
        
           //  (2) s1 crashes before processing the split, forgetting the replica ID of the 
        
           //      RHS but retaining its tombstone. 
        
           // 
        
           // In both cases we know that the RHS could not have committed anything because 
        
           // it cannot have gotten a snapshot but we want to be sure to not synthesize a 
        
           // HardState for the RHS that contains a non-zero commit index if we know that 
        
           // the RHS will need another snapshot later. 
        
           // 
        
           // In the third and fourth cases: 
        
           // 
        
           //   *  s1 never receives a message for r21/3. 
        
           // 
        
           //   *  Before the r20 processes the split r21 is removed and re-added to 
        
           //      s1 as r21/4. s1 receives a raft message destined for r21/4 and has never 
        
           //      heard about r21/3. 
        
           // 
        
           //  (3) s1 processes the split and finds the RHS to be an uninitialized replica 
        
           //      with a higher replica ID (but without a tombstone). This case is very 
        
           //      similar to (1) 
        
           // 
        
           //  (4) s1 crashes still before processing the split, forgetting that it had 
        
           //      known about r21/4. When it reboots r21/4 is totally partitioned and 
        
           //      r20 becomes unpartitioned. 
        
           // 
        
           //   *  r20 processes the split successfully and initialized r21/3. 
        
           // 
        
           // In the 4th case we find that until we unpartition r21/4 (the RHS) and let it 
        
           // learn about its removal with a ReplicaTooOldError that it will be initialized 
        
           // with a CommitIndex at 10 as r21/3, the split's value. After r21/4 becomes 
        
           // unpartitioned it will learn it is removed by either catching up on its 
        
           // its log or receiving a ReplicaTooOldError which will lead to a tombstone. 
        
           //

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/kv/kvserver/store.go, line 486 at r1 (raw file):

Previously, sumeerbhola wrote…

But the descriptor is visible once the txn commits, not when the intent gets resolved, right? Assuming consistent MVCCful reads.

The MVCC read will see the provisional value, and resolve the intent, and then reread. I am assuming you mean that too.

Is another issue here that we are not trying to read the descriptor in an MVCC manner? That is, the latest descriptor should be visible to a read, even if the read is in the past, and that means if it sees a provisional descriptor and is not sure whether the txn making that change has committed or not, it has to try to resolve it. Which means a herd of unnecessary resolution attempts. While by ensuring that the provisional descriptor will be resolved by the txn commit/abort itself, everyone else can assume that the provisional value is still provisional?

We need to be clear what we are talking about here. The descriptor is "read" in a few ways:

transactionally, by txns making changes to the range (splits, merge, replication, etc)

by evaluation of some other commands, mostly they are trying to fix the descriptor for time of their evaluation but commands can in general access it if they declare the span, and they'll basically see the in-memory descriptor of the range (which corresponds to its newest visible value, i.e. skipping any intent). Here's an example:

cockroach/pkg/kv/kvserver/batcheval/cmd_revert_range.go

Lines 36 to 39 in a6e4b3e

    
           // We look up the range descriptor key to check whether the span 
        
           // is equal to the entire range for fast stats updating. 
        
           latchSpans.AddNonMVCC(spanset.SpanReadOnly, roachpb.Span{Key: keys.RangeDescriptorKey(rs.GetStartKey())}) 
        
           latchSpans.AddNonMVCC(spanset.SpanReadOnly, roachpb.Span{Key: keys.RangeGCThresholdKey(rs.GetRangeID())})

I don't see what would really go wrong here if we attached all of the triggers to ResolveIntent of the range descriptor (other than this being more challenging to implement that way). The replica would stay in the old configuration until it has resolved the intent. This isn't great, certainly we have to accept much worse lags here than we do with the other strategy, but in the "worst case" the lag is about the same (a follower may just not hear about the range desc change for days, theoretically, and miss any number of changes to it).

pkg/kv/kvserver/store.go, line 533 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Also: "unless when" reads a bit odd -- try either "unless" or "except when".

Done.

pkg/kv/kvserver/store.go, line 539 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Replica corresponds to a State Machine that has yet to apply any entries,

Done.

pkg/kv/kvserver/store.go, line 637 at r1 (raw file):

This problem is avoided by not allowing the snapshot to be applied. We reject snapshots which overlap with existing ranges on the receiver before accepting the snapshot here:

Are you sure? If I read the example correctly, Sumeer is talking about a scenario in which the range has split, then the right-hand side receives a snapshot, and we then crash and find that the left-hand side has "unshrunk" but the right-hand side is still there. He then goes on to say that this shouldn't happen because the snapshot will flush the memtable (since it contains non-durable writes for R2, which overlap the snapshot's bounds) and that this will then also flush the shrinking of R.

My mental model of the storage engine is that it will only ever "lose" a suffix of writes, i.e. when crashing after

1 Put(a)
2 Put(c)
3 Ingest([b,d))

we would see either none of the ops, 1, 1 2, or all three. It looks like Sumeer is worried that we might see 3 without also seeing 1? Note that in the actual example, 1 and 2 are the writes to the LHS and RHS handed to the storage engine in the same write batch.

Let's continue discussing over on #72865, I would like to land this PR sooner rather than later.

pkg/kv/kvserver/store.go, line 535 at r2 (raw file):

Previously, sumeerbhola wrote…

"which asymmetric partitions more likely" is not clear to me.

Done.

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @sumeerbhola)

pkg/kv/kvserver/store.go, line 416 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Perhaps say separate/distinct/independent disk instead, because cloud and SANs.

Done.

pkg/kv/kvserver/store.go, line 421 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

INVARIANT: the set of all Ranges (as determined by, e.g. a transactionally

Done.

pkg/kv/kvserver/store.go, line 455 at r1 (raw file):

Previously, ajwerner wrote…

- start a transaction

- update the RangeDescriptor (for example, to reflect a split, or a change
to the Replicas comprising the members of the Range)

- update the meta ranges (which form a search index used for request routing)

- commit with a roachpb.InternalCommitTrigger.
Consider giving this a read as a godoc and formatting accordingly. The dashes may look silly. What about unicode bullets for the lists? •

Alternative, you could block quote them.

I have actually, which is why this looks so silly. I initially wrote this as

- foo bar baz ....
  asdasdasd

but this actually renders the second line as a code block. If I remove the empty lines, it will render it all as a paragraph, i.e. remove the newlines completely.

I didn't quite have the courage to ignore godoc completely and just write a nice ascii comment so here we are with this trade-off.

I don't think it's really worth optimizing for godoc any more since it's time-consuming and painful and ultimately we barely ever use it as far as I can tell.

Part of the research for cockroachdb#72374, which in turn was inspired by cockroachdb#38322. Release note: None

tbg · 2021-11-17T11:29:35Z

I'm going to run with Sumeer's early approval and merge this, but happy to address follow-up comments. Thanks for the discussions everyone!

bors r=sumeerbhola

craig · 2021-11-17T12:33:16Z

Build succeeded:

GitHub CI (Cockroach)

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/kv/kvserver/store.go, line 637 at r1 (raw file):

My mental model of the storage engine is that it will only ever "lose" a suffix of writes

This is mine too. If the snapshot survives the crash, won't it imply we've sync'd the WAL that had the earlier writes? Is there something I'm missing because I don't have my separate applied state storage engine hat on?

sumeerbhola · 2021-11-17T16:41:56Z

This is mine too. If the snapshot survives the crash, won't it imply we've sync'd the WAL that had the earlier writes?

File ingest doesn't touch the WAL. It directly mutates the LSM manifest and syncs the manifest. It does force a flush of memtables before it that overlap in the key space, but one has to be careful in reasoning about this. There seems to be at least one place where we advance to an inconsistent state https://cockroachlabs.slack.com/archives/C02KHQMF2US/p1635859782038200?thread_ts=1635799562.036400&cid=C02KHQMF2US.

nvanbenschoten

Nice writeup! Apologies for the late comments — I was breathing pretty deeply this week.

Reviewed 1 of 3 files at r1, 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/kv/kvserver/replica_application_state_machine.go, line 473 at r3 (raw file):

	// result was just cleared and this will be a no-op.
	//
	// TODO(tbg): can't this happen in splitPreApply which is called from

I think you're right about this. We shouldn't need this special logic. In fact, we shouldn't need maybeAcquireSplitMergeLock at all, because we're already branching on the Split and Merge fields in runPreApplyTriggersAfterStagingWriteBatch. So we can just include calls to acquireSplitLock and acquireMergeLock in the right places in that function.

That may even allow us to remove some of the assertions we have in splitPreApply and what is effectively the inlined version of mergePreApply (maybe we extract that?), which currently re-validate that the RHS range exists.

pkg/kv/kvserver/replica_application_state_machine.go, line 1271 at r3 (raw file):

			// we do things where the state machine's view of the range descriptor always
			// dictates the active replication config but it is much trickier to prove
			// correct. See:

I didn't want to have to see that again. It hurts my brain.

pkg/kv/kvserver/replica_application_state_machine.go, line 1282 at r3 (raw file):

			// index.
			//
			// INVARIANT: appending a config change to the log (at leader or follower)

Since we're here, do you mind reminding me how this invariant is enforced today? We don't sync the HardState when the only thing that changed is the commit index, and I don't see any special provision for the case where the commit index corresponds to a config change.

Is this invariant trivially enforced because we sync the HardState and the log atomically whenever a new log entry is appended and the leader will always send the follower as up-to-date a view of the commit index as possible with every new MsgApp?

pkg/kv/kvserver/store.go, line 482 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It's a weak argument. As my TODO below mentions, I thought there were stronger reasons than the ones presented here, though having to attach logic to intent resolution (the next bullet point) is probably enough to do it this way.

There are probably other arguments to be made about this simplifying splits and merges. It means that the order that intents are resolved is well-defined. For instance, it guarantees that the intents on both the LHS and RHS of a split are resolved atomically (because we don't paginate intent resolution when a commit trigger is present). It also guarantees that the intent on the LHS of a merge is resolved before the intent on the RHS of a merge. Without that, we could be left with a temporary gap in the keyspace.

pkg/kv/kvserver/store.go, line 637 at r1 (raw file):

I suspect this won't happen since the split batch is in the same memtable, and since the HardState in the snapshot will overlap, it will cause the memtable to get flushed and become durable.

But my head hurts thinking about this!

This is also what keeps me up at night. The semantics here would be easier to reason through if Pebble always flushed the memtable(s) on sst ingestion, regardless of physical overlap between the keys in the memtable and the sst bounds. It's difficult to reason through any logical-but-not-physical relationships we may have between keys, which could be ignored due to this "optimization".

I wonder if it would be worth seeing how valuable that optimization even is in practice.

pkg/kv/kvserver/store.go, line 460 at r3 (raw file):

to the Replicas comprising the members of the Range)

- update the meta ranges (which form a search index used for request routing)

Do you think it would be worth saying more about the meta range copy (copies in some cases) of each range descriptor? Specifically, there's a good amount to say about their relationship with the authoritative RangeDescriptor and how they're kept in sync transactionally but resolved asynchronously and also queried non-transactionally.

pkg/kv/kvserver/store.go, line 480 at r3 (raw file):

INVARIANT: A Store never contains two Replicas from the same Range, nor do the
key ranges for any two of its Replicas overlap. (Note that there is no
requirement that these Replicas come from a consistent set of Ranges).

It might be worth expanding on how this is enforced even without the requirement that the replicas come from a consistent set of ranges. The range descriptor generation counters allow us to resolve overlapping key conflicts in a way that keeps this all monotonic.

pkg/kv/kvserver/store.go, line 518 at r3 (raw file):

A first phenomenon to understand is that of uninitialized Replicas. Such a
Replica corresponds to a State Machine that has yet to apply any entries, and in

Does it make sense to say that an uninitialized replica "corresponds to a state machine ..."? Some of our recent discussions have helped me to understand them as replicas that are missing a replicated state machine and only contain some set of unreplicated metadata. In fact, I thought we were converging on that being the definition of an uninitialized replica.

pkg/kv/kvserver/store.go, line 531 at r3 (raw file):

relies on the concept of a Replica without state that can act as a Raft peer.

- For practical reasons, we don't preserve the entire committed replicated log

Do replicas revert back to uninitialized in this case?

pkg/kv/kvserver/store.go, line 627 at r3 (raw file):

outdated; an uninitialized Replica could in principle leak "forever" if the
Range quickly changes its members such that the triggers mentioned here don't
apply A full scan of meta2 would be required as there is no RangeID-keyed index

s/apply A/apply. A/

pkg/kv/kvserver/store.go, line 657 at r3 (raw file):

INVARIANT: An initialized Replica's RangeDescriptor always includes it as a member.

INVARIANT: A Replica's ReplicaID is constant.

As is its StartKey. We rely on this in a few places, such as associating a tenant ID with a range.

pkg/kv/kvserver/store.go, line 657 at r3 (raw file):

INVARIANT: An initialized Replica's RangeDescriptor always includes it as a member.

INVARIANT: A Replica's ReplicaID is constant.

Consider mentioning some of the invariants around replica IDs on a given store. For instance, thanks to the range tombstone key, I believe we have an invariant that replica IDs for the same range but across replicas are monotonically increasing on a given store.

pkg/kv/kvserver/store.go, line 657 at r3 (raw file):

INVARIANT: An initialized Replica's RangeDescriptor always includes it as a member.

INVARIANT: A Replica's ReplicaID is constant.

Consider mentioning some of the invariants around descriptor generation numbers. Do we have them? Does the replica ID invariant I just mentioned also mean that the generation of a range on a given store is monotonically increasing? And therefore, the generation of a given keyspace on a store is monotonically increasing?

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/kv/kvserver/store.go, line 482 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

There are probably other arguments to be made about this simplifying splits and merges. It means that the order that intents are resolved is well-defined. For instance, it guarantees that the intents on both the LHS and RHS of a split are resolved atomically (because we don't paginate intent resolution when a commit trigger is present). It also guarantees that the intent on the LHS of a merge is resolved before the intent on the RHS of a merge. Without that, we could be left with a temporary gap in the keyspace.

I realized a gap in my understanding of synchronous intent resolution during a split/merge:

After the split txn commits, the code synchronously resolves intents on the same range as the txn record. That range is now split, but the narrower RangeDescriptor is provisional. Does this fact allow it to synchronously resolve both the provisional LHS and RHS descriptor? If not, is this a case where a range can have a provisional RangeDescriptor without a previously committed RangeDescriptor?
Related to the previous I am guessing the merge txn does not resolve the RHS RangeDescriptor synchronously. This seems harmless since after the merge no one will bother reading it.

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/kv/kvserver/store.go, line 482 at r1 (raw file):

Does this fact allow it to synchronously resolve both the provisional LHS and RHS descriptor?

Yes, it does. The split txn's EndTxn (with the split trigger) is committed on the pre-split range, so it is able to resolve both intents atomically with the application of the split trigger. The application of this split trigger is what finalizes the range split.

I am guessing the merge txn does not resolve the RHS RangeDescriptor synchronously. This seems harmless since after the merge no one will bother reading it.

This is correct. The merge txn does not resolve the RHS range descriptor synchronously.

This incorporates post-merge feedback on cockroachdb#72745. Release note: None

tbg

TFTR! #73792 is the follow-up PR.

Dismissed @erikgrinaker and @sumeerbhola from 8 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/kv/kvserver/replica_application_state_machine.go, line 1282 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Since we're here, do you mind reminding me how this invariant is enforced today? We don't sync the HardState when the only thing that changed is the commit index, and I don't see any special provision for the case where the commit index corresponds to a config change.

Is this invariant trivially enforced because we sync the HardState and the log atomically whenever a new log entry is appended and the leader will always send the follower as up-to-date a view of the commit index as possible with every new MsgApp?

Addressed in follow-up PR.

pkg/kv/kvserver/store.go, line 637 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I suspect this won't happen since the split batch is in the same memtable, and since the HardState in the snapshot will overlap, it will cause the memtable to get flushed and become durable.

But my head hurts thinking about this!

This is also what keeps me up at night. The semantics here would be easier to reason through if Pebble always flushed the memtable(s) on sst ingestion, regardless of physical overlap between the keys in the memtable and the sst bounds. It's difficult to reason through any logical-but-not-physical relationships we may have between keys, which could be ignored due to this "optimization".

I wonder if it would be worth seeing how valuable that optimization even is in practice.

I am going to resolve this but courtesy ping to @sumeerbhola to see if there's something that should be filed here.

pkg/kv/kvserver/store.go, line 460 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you think it would be worth saying more about the meta range copy (copies in some cases) of each range descriptor? Specifically, there's a good amount to say about their relationship with the authoritative RangeDescriptor and how they're kept in sync transactionally but resolved asynchronously and also queried non-transactionally.

Done in follow-up.

pkg/kv/kvserver/store.go, line 518 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Does it make sense to say that an uninitialized replica "corresponds to a state machine ..."? Some of our recent discussions have helped me to understand them as replicas that are missing a replicated state machine and only contain some set of unreplicated metadata. In fact, I thought we were converging on that being the definition of an uninitialized replica.

We might have to discuss that some more. Uninitialized replicas conceptually are a state machine at log position zero representing the empty state. In CRDB we special-case this empty state and call it an uninitialized replica.

pkg/kv/kvserver/store.go, line 531 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do replicas revert back to uninitialized in this case?

This section wasn't super coherent, gave it a rewrite in the follow-up PR.

pkg/kv/kvserver/store.go, line 657 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider mentioning some of the invariants around descriptor generation numbers. Do we have them? Does the replica ID invariant I just mentioned also mean that the generation of a range on a given store is monotonically increasing? And therefore, the generation of a given keyspace on a store is monotonically increasing?

This all comes down to durability guarantees. We can apply a replication change (one that doesn't affect us, but it does bump the generation), crash and unapply the entry. So we violate that proposed invariant and don't think we have an incentive to uphold it. Within a running process, it should hold, but it's not entirely obvious to explain why. I also worry that by documenting too many things that are "true but don't matter", we may make it harder to grasp what we actually require (plus, since invariants that we don't need are often not adequately tested, we may say things that are plain wrong). Perhaps you have ideas for organizing this kind of second-class knowledge.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/kv/kvserver/store.go, line 637 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I am going to resolve this but courtesy ping to @sumeerbhola to see if there's something that should be filed here.

I don't know if we need to file something here. We are probably accidentally correct.

ReplicasStorage will properly ensure the invariant by syncing on the split.
If we think that that code is not going to arrive fast enough we could add a replicaAppBatch.changeSplitsRange field and use that as an additional reason to sync the state machine.

This incorporates post-merge feedback on cockroachdb#72745. Release note: None

73792: kvserver: improve Store comment r=[nvanbenschoten,sumeerbhola] a=tbg This incorporates post-merge feedback on #72745. Release note: None Co-authored-by: Tobias Grieger <[email protected]>

tbg force-pushed the doc-replica-lifecycle branch from 04e3273 to 25182d3 Compare November 15, 2021 11:47

tbg marked this pull request as ready for review November 15, 2021 11:47

tbg requested a review from a team as a code owner November 15, 2021 11:47

tbg requested review from nvanbenschoten and erikgrinaker November 15, 2021 11:47

sumeerbhola approved these changes Nov 15, 2021

View reviewed changes

erikgrinaker reviewed Nov 15, 2021

View reviewed changes

tbg force-pushed the doc-replica-lifecycle branch from 25182d3 to 984ef2a Compare November 16, 2021 20:46

sumeerbhola reviewed Nov 17, 2021

View reviewed changes

ajwerner reviewed Nov 17, 2021

View reviewed changes

tbg requested review from erikgrinaker and sumeerbhola November 17, 2021 10:43

tbg commented Nov 17, 2021

View reviewed changes

kvserver: document replica lifecycle

89162cb

Part of the research for cockroachdb#72374, which in turn was inspired by cockroachdb#38322. Release note: None

tbg force-pushed the doc-replica-lifecycle branch from 984ef2a to 89162cb Compare November 17, 2021 11:22

craig bot merged commit 6282767 into cockroachdb:master Nov 17, 2021

ajwerner reviewed Nov 17, 2021

View reviewed changes

nvanbenschoten reviewed Nov 19, 2021

View reviewed changes

sumeerbhola reviewed Nov 22, 2021

View reviewed changes

nvanbenschoten reviewed Nov 22, 2021

View reviewed changes

tbg mentioned this pull request Dec 14, 2021

kvserver: improve Store comment #73792

Merged

tbg added a commit to tbg/cockroach that referenced this pull request Dec 14, 2021

kvserver: improve Store comment

cfd625e

This incorporates post-merge feedback on cockroachdb#72745. Release note: None

tbg commented Dec 14, 2021

View reviewed changes

sumeerbhola reviewed Dec 14, 2021

View reviewed changes

tbg added a commit to tbg/cockroach that referenced this pull request Dec 15, 2021

kvserver: improve Store comment

36f61dc

This incorporates post-merge feedback on cockroachdb#72745. Release note: None

craig bot pushed a commit that referenced this pull request Dec 15, 2021

Merge #73792

d1c137d

73792: kvserver: improve Store comment r=[nvanbenschoten,sumeerbhola] a=tbg This incorporates post-merge feedback on #72745. Release note: None Co-authored-by: Tobias Grieger <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: document replica lifecycle #72745

kvserver: document replica lifecycle #72745

tbg commented Nov 15, 2021

cockroach-teamcity commented Nov 15, 2021

sumeerbhola Nov 15, 2021

tbg Nov 16, 2021

erikgrinaker left a comment

sumeerbhola left a comment

ajwerner left a comment

tbg left a comment

tbg left a comment

tbg commented Nov 17, 2021

craig bot commented Nov 17, 2021

ajwerner left a comment

sumeerbhola commented Nov 17, 2021

nvanbenschoten left a comment

sumeerbhola left a comment

nvanbenschoten left a comment

tbg left a comment

sumeerbhola left a comment

	// We have a key range [desc.StartKey,desc.EndKey) which we want to apply a
	// snapshot for. Is there a conflicting existing placeholder or an
	// overlapping range?
	if err := s.checkSnapshotOverlapLocked(ctx, snapHeader); err != nil {
	return nil, err
	}

	// TestProcessSplitAfterRightHandSideHasBeenRemoved tests cases where we have
	// a follower replica which has received information about the RHS of a split
	// before it has processed that split. The replica can't both have an
	// initialized RHS and process the split but it can have (1) an uninitialized
	// RHS with a higher replica ID than in the split and (2) a RHS with an unknown
	// replica ID and a tombstone with a higher replica ID than in the split.
	// It may learn about a newer replica ID than the split without ever hearing
	// about the split replica. If it does not crash (3) it will know that the
	// split replica is too old and will not initialize it. If the node does
	// crash (4) it will forget it had learned about the higher replica ID and
	// will initialize the RHS as the split replica.
	//
	// Starting in 19.2 if a replica discovers from a raft message that it is an
	// old replica then it knows that it has been removed and re-added to the range.
	// In this case the Replica eagerly destroys itself and its data.
	//
	// Given this behavior there are 4 troubling cases with regards to splits.
	//
	// * In all cases we begin with s1 processing a presplit snapshot for
	// r20. After the split the store should have r21/3.
	//
	// In the first two cases the following occurs:
	//
	// * s1 receives a message for r21/3 prior to acquiring the split lock
	// in r21. This will create an uninitialized r21/3 which may write
	// HardState.
	//
	// * Before the r20 processes the split r21 is removed and re-added to
	// s1 as r21/4. s1 receives a raft message destined for r21/4 and proceeds
	// to destroy its uninitialized r21/3, laying down a tombstone at 4 in the
	// process.
	//
	// (1) s1 processes the split and finds the RHS to be an uninitialized replica
	// with a higher replica ID.
	//
	// (2) s1 crashes before processing the split, forgetting the replica ID of the
	// RHS but retaining its tombstone.
	//
	// In both cases we know that the RHS could not have committed anything because
	// it cannot have gotten a snapshot but we want to be sure to not synthesize a
	// HardState for the RHS that contains a non-zero commit index if we know that
	// the RHS will need another snapshot later.
	//
	// In the third and fourth cases:
	//
	// * s1 never receives a message for r21/3.
	//
	// * Before the r20 processes the split r21 is removed and re-added to
	// s1 as r21/4. s1 receives a raft message destined for r21/4 and has never
	// heard about r21/3.
	//
	// (3) s1 processes the split and finds the RHS to be an uninitialized replica
	// with a higher replica ID (but without a tombstone). This case is very
	// similar to (1)
	//
	// (4) s1 crashes still before processing the split, forgetting that it had
	// known about r21/4. When it reboots r21/4 is totally partitioned and
	// r20 becomes unpartitioned.
	//
	// * r20 processes the split successfully and initialized r21/3.
	//
	// In the 4th case we find that until we unpartition r21/4 (the RHS) and let it
	// learn about its removal with a ReplicaTooOldError that it will be initialized
	// with a CommitIndex at 10 as r21/3, the split's value. After r21/4 becomes
	// unpartitioned it will learn it is removed by either catching up on its
	// its log or receiving a ReplicaTooOldError which will lead to a tombstone.
	//

	// We look up the range descriptor key to check whether the span
	// is equal to the entire range for fast stats updating.
	latchSpans.AddNonMVCC(spanset.SpanReadOnly, roachpb.Span{Key: keys.RangeDescriptorKey(rs.GetStartKey())})
	latchSpans.AddNonMVCC(spanset.SpanReadOnly, roachpb.Span{Key: keys.RangeGCThresholdKey(rs.GetRangeID())})

kvserver: document replica lifecycle #72745

kvserver: document replica lifecycle #72745

Conversation

tbg commented Nov 15, 2021

cockroach-teamcity commented Nov 15, 2021

sumeerbhola Nov 15, 2021

Choose a reason for hiding this comment

tbg Nov 16, 2021

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

tbg commented Nov 17, 2021

craig bot commented Nov 17, 2021

ajwerner left a comment

Choose a reason for hiding this comment

sumeerbhola commented Nov 17, 2021

nvanbenschoten left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment