kvserver: remove maybeMarkReplicaInitializedLockedReplLocked #94912

tbg · 2023-01-09T10:28:19Z

Related to #94911.

We seem to have three paths that initialize a Replica. The first one is

cockroach/pkg/kv/kvserver/replica_init.go

Lines 157 to 238 in 3d6fe9e

    
           // loadRaftMuLockedReplicaMuLocked will load the state of the replica from disk. 
        
           // If desc is initialized, the Replica will be initialized when this method 
        
           // returns. An initialized Replica may not be reloaded. If this method is called 
        
           // with an uninitialized desc it may be called again later with an initialized 
        
           // desc. 
        
           // 
        
           // This method is called in three places: 
        
           // 
        
           //  1. newReplica - used when the store is initializing and during testing 
        
           //  2. tryGetOrCreateReplica - see newUnloadedReplica 
        
           //  3. splitPostApply - this call initializes a previously uninitialized Replica. 
        
           func (r *Replica) loadRaftMuLockedReplicaMuLocked(desc *roachpb.RangeDescriptor) error { 
        
           	ctx := r.AnnotateCtx(context.TODO()) 
        
           	if r.mu.state.Desc != nil && r.IsInitialized() { 
        
           		log.Fatalf(ctx, "r%d: cannot reinitialize an initialized replica", desc.RangeID) 
        
           	} else if r.replicaID == 0 { 
        
           		// NB: This is just a defensive check as r.mu.replicaID should never be 0. 
        
           		log.Fatalf(ctx, "r%d: cannot initialize replica without a replicaID", desc.RangeID) 
        
           	} 
        
           	if desc.IsInitialized() { 
        
           		r.setStartKeyLocked(desc.StartKey) 
        
           	} 
        
           	// Clear the internal raft group in case we're being reset. Since we're 
        
           	// reloading the raft state below, it isn't safe to use the existing raft 
        
           	// group. 
        
           	r.mu.internalRaftGroup = nil 
        
           	var err error 
        
           	if r.mu.state, err = r.mu.stateLoader.Load(ctx, r.Engine(), desc); err != nil { 
        
           		return err 
        
           	} 
        
           	r.mu.lastIndex, err = r.mu.stateLoader.LoadLastIndex(ctx, r.Engine()) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	r.mu.lastTerm = invalidLastTerm 
        
           	// Ensure that we're not trying to load a replica with a different ID than 
        
           	// was used to construct this Replica. 
        
           	replicaID := r.replicaID 
        
           	if replicaDesc, found := r.mu.state.Desc.GetReplicaDescriptor(r.StoreID()); found { 
        
           		replicaID = replicaDesc.ReplicaID 
        
           	} else if desc.IsInitialized() { 
        
           		log.Fatalf(ctx, "r%d: cannot initialize replica which is not in descriptor %v", desc.RangeID, desc) 
        
           	} 
        
           	if r.replicaID != replicaID { 
        
           		log.Fatalf(ctx, "attempting to initialize a replica which has ID %d with ID %d", 
        
           			r.replicaID, replicaID) 
        
           	} 
        
           	r.setDescLockedRaftMuLocked(ctx, desc) 
        
           	// Only do this if there was a previous lease. This shouldn't be important 
        
           	// to do but consider that the first lease which is obtained is back-dated 
        
           	// to a zero start timestamp (and this de-flakes some tests). If we set the 
        
           	// min proposed TS here, this lease could not be renewed (by the semantics 
        
           	// of minLeaseProposedTS); and since minLeaseProposedTS is copied on splits, 
        
           	// this problem would multiply to a number of replicas at cluster bootstrap. 
        
           	// Instead, we make the first lease special (which is OK) and the problem 
        
           	// disappears. 
        
           	if r.mu.state.Lease.Sequence > 0 { 
        
           		r.mu.minLeaseProposedTS = r.Clock().NowAsClockTimestamp() 
        
           	} 
        
           	ssBase := r.Engine().GetAuxiliaryDir() 
        
           	if r.raftMu.sideloaded, err = newDiskSideloadStorage( 
        
           		r.store.cfg.Settings, 
        
           		desc.RangeID, 
        
           		replicaID, 
        
           		ssBase, 
        
           		r.store.limiters.BulkIOWriteRate, 
        
           		r.store.engine, 
        
           	); err != nil { 
        
           		return errors.Wrap(err, "while initializing sideloaded storage") 
        
           	} 
        
           	r.assertStateRaftMuLockedReplicaMuRLocked(ctx, r.store.Engine()) 
        
           	r.sideTransportClosedTimestamp.init(r.store.cfg.ClosedTimestampReceiver, desc.RangeID) 
        
           	return nil 
        
           }

and as explain in its comment, this is what transitions most uninitialized replicas to initialized.

However, then there's also

cockroach/pkg/kv/kvserver/store_create_replica.go

Lines 367 to 430 in e1c24ed

    
           // maybeMarkReplicaInitializedLocked should be called whenever a previously 
        
           // uninitialized replica has become initialized so that the store can update its 
        
           // internal bookkeeping. It requires that Store.mu and Replica.raftMu 
        
           // are locked. 
        
           func (s *Store) maybeMarkReplicaInitializedLockedReplLocked( 
        
           	ctx context.Context, lockedRepl *Replica, 
        
           ) error { 
        
           	desc := lockedRepl.descRLocked() 
        
           	if !desc.IsInitialized() { 
        
           		return errors.Errorf("attempted to process uninitialized range %s", desc) 
        
           	} 
        
           	rangeID := lockedRepl.RangeID 
        
           	if _, ok := s.mu.uninitReplicas[rangeID]; !ok { 
        
           		// Do nothing if the range has already been initialized. 
        
           		return nil 
        
           	} 
        
           	delete(s.mu.uninitReplicas, rangeID) 
        
           	if it := s.getOverlappingKeyRangeLocked(desc); it.item != nil { 
        
           		return errors.AssertionFailedf("%s: cannot initialize replica; %s has overlapping range %s", 
        
           			s, desc, it.Desc()) 
        
           	} 
        
           	// Copy of the start key needs to be set before inserting into replicasByKey. 
        
           	lockedRepl.setStartKeyLocked(desc.StartKey) 
        
           	if it := s.mu.replicasByKey.ReplaceOrInsertReplica(ctx, lockedRepl); it.item != nil { 
        
           		return errors.AssertionFailedf("range for key %v already exists in replicasByKey btree: %+v", 
        
           			it.item.key(), it) 
        
           	} 
        
           	// Unquiesce the replica. We don't allow uninitialized replicas to unquiesce, 
        
           	// but now that the replica has been initialized, we unquiesce it as soon as 
        
           	// possible. This replica was initialized in response to the reception of a 
        
           	// snapshot from another replica. This means that the other replica is not 
        
           	// quiesced, so we don't need to campaign or wake the leader. We just want 
        
           	// to start ticking. 
        
           	// 
        
           	// NOTE: The fact that this replica is being initialized in response to the 
        
           	// receipt of a snapshot means that its r.mu.internalRaftGroup must not be 
        
           	// nil. 
        
           	// 
        
           	// NOTE: Unquiescing the replica here is not strictly necessary. As of the 
        
           	// time of writing, this function is only ever called below handleRaftReady, 
        
           	// which will always unquiesce any eligible replicas before completing. So in 
        
           	// marking this replica as initialized, we have made it eligible to unquiesce. 
        
           	// However, there is still a benefit to unquiecing here instead of letting 
        
           	// handleRaftReady do it for us. The benefit is that handleRaftReady cannot 
        
           	// make assumptions about the state of the other replicas in the range when it 
        
           	// unquieces a replica, so when it does so, it also instructs the replica to 
        
           	// campaign and to wake the leader (see maybeUnquiesceAndWakeLeaderLocked). 
        
           	// We have more information here (see "This means that the other replica ..." 
        
           	// above) and can make assumptions about the state of the other replicas in 
        
           	// the range, so we can unquiesce without campaigning or waking the leader. 
        
           	if !lockedRepl.maybeUnquiesceWithOptionsLocked(false /* campaignOnWake */) { 
        
           		return errors.AssertionFailedf("expected replica %s to unquiesce after initialization", desc) 
        
           	} 
        
           	// Add the range to metrics and maybe gossip on capacity change. 
        
           	s.metrics.ReplicaCount.Inc(1) 
        
           	s.maybeGossipOnCapacityChange(ctx, rangeAddEvent) 
        
           	return nil 
        
           }

which has as its sole caller the applySnapshot path.

We should unify them by replacing the latter use with the former.

Jira issue: CRDB-23225
Epic: CRDB-220

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-01-09T10:28:21Z

cc @cockroachdb/replication

tbg · 2023-01-09T10:39:34Z

Only the second one even removes the uninited replica from store.mu.uninitReplicas, I wondered why we weren't leaking entries in that map on splits (which don't call that method) but it turns out we sort of randomly remove them here (and move the right hand side replica over to the initialized map a few lines down):

cockroach/pkg/kv/kvserver/store_split.go

Lines 312 to 323 in 352d2ef

    
           s.mu.Lock() 
        
           defer s.mu.Unlock() 
        
           if exRng, ok := s.mu.uninitReplicas[rightDesc.RangeID]; rightReplOrNil != nil && ok { 
        
           	// If we have an uninitialized replica of the new range we require pointer 
        
           	// equivalence with rightRepl. See Store.splitTriggerPostApply(). 
        
           	if exRng != rightReplOrNil { 
        
           		log.Fatalf(ctx, "found unexpected uninitialized replica: %s vs %s", exRng, rightReplOrNil) 
        
           	} 
        
           	// NB: We only remove from uninitReplicas here so that we don't leave open a 
        
           	// window where a replica is temporarily not present in Store.mu.replicas. 
        
           	delete(s.mu.uninitReplicas, rightDesc.RangeID) 
        
           }

There are probably reasons for that. We have already called loadRaftMuLocked[...] on the right hand side at that point, we did that earlier in splitPostApply:

cockroach/pkg/kv/kvserver/store_split.go

Lines 253 to 257 in 352d2ef

    
           // Finish initialization of the RHS. 
        
           err := rightRepl.loadRaftMuLockedReplicaMuLocked(&split.RightDesc) 
        
           if err != nil { 
        
           	log.Fatalf(ctx, "%v", err) 
        
           }

Also, it seems sketchy how we're updating the span of the LHS replica and that of the RHS replica in separate s.mu critical sections. For example, say we split r1 [a,c) into r1 [a,b) and r2 [b,c). Then in the code above this paragraph, we've already updated r2 to have span [b,c). But r1 still owns [a,c). We get away with it because at that point, the store still believes that rhsRepl is an uninitialized replicas and so it isn't inserted yet into the data structures that need the key -> replica map to be injective. But this seems really messy.

cockroach/pkg/kv/kvserver/store_split.go

Lines 312 to 325 in 352d2ef

    
           s.mu.Lock() 
        
           defer s.mu.Unlock() 
        
           if exRng, ok := s.mu.uninitReplicas[rightDesc.RangeID]; rightReplOrNil != nil && ok { 
        
           	// If we have an uninitialized replica of the new range we require pointer 
        
           	// equivalence with rightRepl. See Store.splitTriggerPostApply(). 
        
           	if exRng != rightReplOrNil { 
        
           		log.Fatalf(ctx, "found unexpected uninitialized replica: %s vs %s", exRng, rightReplOrNil) 
        
           	} 
        
           	// NB: We only remove from uninitReplicas here so that we don't leave open a 
        
           	// window where a replica is temporarily not present in Store.mu.replicas. 
        
           	delete(s.mu.uninitReplicas, rightDesc.RangeID) 
        
           } 
        
           leftRepl.setDescRaftMuLocked(ctx, newLeftDesc)

tbg · 2023-01-09T16:42:49Z

cc @pavelkalinnikov

pav-kv · 2023-01-11T22:46:29Z

There are mainly two aspects of initializing a Replica:

Initializing a Replica :)
Updating the Store state correspondingly.

Also, the two should probably be done with some synchronization.

The two code paths linked in the issue description reside in Replica and Store correspondingly, so it makes some sense and we could take it as a basis. We could eventually arrive at one function if there is too much commonality in the orchestration / "glue" between these 2 parts across different workflows (split or snapshot application).

Seemingly, it is the "glue" between these two things that is currently ad-hoc / scattered all over the place. For example, the split trigger application path does (1) through a "good" function, but not so well on (2) - it's hidden in split trigger code. The snapshot application does the opposite: on (2) it uses the "maybe mark" func, but does (1) in applySnapshot instead of using a shared helper.

94943: kvserver: don't load uninitialized replicas r=tbg a=pavelkalinnikov This change stops using `loadRaftMuLockedReplicaMuLocked` when creating an uninitialized replica, to make it clearer what happens in this case. It turns out that what it did for uninitialized desc is equivalent to setting an empty `ReplicaState` and asserting the match between in-memory and on-disk state. Informs #94912, #93898 Epic: CRDB-220 Release note: None Co-authored-by: Pavel Kalinnikov <[email protected]>

96031: sql: add mixed version test for system.role_members user ids upgrade r=rafiss a=andyyang890 This patch adds a mixed version logictest that ensures that GRANT ROLE continues to work properly in a cluster with both 22.2 and 23.1 nodes (i.e. nodes that have run the system.role_members user ids upgrade). Part of #92342 Release note: None 96127: kvserver: introduce cpu rebalancing r=nvanbenschoten a=kvoli This patch allows the store rebalancer to use CPU in place of QPS when balancing load on a cluster. This patch adds `cpu` as an option with the cluster setting: `kv.allocator.load_based_rebalancing.objective` When set to `cpu`, rather than `qps`. The store rebalancer will perform a mostly identical function, however target balancing the sum of all replica's cpu time on each store, rather than qps. The default remains as `qps` here. Similar to QPS, the rebalance threshold can be set to allow controlling the range above and below the mean store CPU is considered imbalanced, either overfull or underfull respectively: `kv.allocator.cpu_rebalance_threshold`: 0.1 In order to manage with mixed versions during upgrade and some architectures not supporting the cpu sampling method, a rebalance objective manager is introduced in `rebalance_objective.go`. The manager mediates access to the rebalance objective and overwrites it in cases where the objective set in the cluster setting cannot be supported. The results when using CPU in comparison to QPS can be found [here](https://docs.google.com/document/d/1QLhD20BTamjj3-dSG9F1gW7XMBy9miGPpJpmu2Dn3yo/edit#) (internal). <details> <summary>Results Summary</summary> ![image](https://user-images.githubusercontent.com/39606633/215580650-b12ff509-5cf5-4ffa-880d-8387e2ef0afa.png) ![image](https://user-images.githubusercontent.com/39606633/215580626-3d748ba1-e9a4-4abb-8acd-2c319203932e.png) ![image](https://user-images.githubusercontent.com/39606633/215580585-58e6000d-b6cf-430a-b4b7-d14a77eab3bd.png) </details> <details> <summary>Detailed Allocbench Results</summary> ``` kv/r=0/access=skew master median cost(gb):05.81 cpu(%):14.97 write(%):37.83 stddev cost(gb):01.87 cpu(%):03.98 write(%):07.01 cpu rebalancing median cost(gb):08.76 cpu(%):14.42 write(%):36.61 stddev cost(gb):02.66 cpu(%):01.85 write(%):04.80 kv/r=0/ops=skew master median cost(gb):06.23 cpu(%):26.05 write(%):57.33 stddev cost(gb):02.92 cpu(%):05.83 write(%):08.20 cpu rebalancing median cost(gb):04.28 cpu(%):11.45 write(%):31.28 stddev cost(gb):02.25 cpu(%):02.51 write(%):06.68 kv/r=50/ops=skew master median cost(gb):04.36 cpu(%):22.84 write(%):48.09 stddev cost(gb):01.12 cpu(%):02.71 write(%):05.51 cpu rebalancing median cost(gb):04.64 cpu(%):13.49 write(%):43.05 stddev cost(gb):01.07 cpu(%):01.26 write(%):08.58 kv/r=95/access=skew master median cost(gb):00.00 cpu(%):09.51 write(%):01.24 stddev cost(gb):00.00 cpu(%):01.74 write(%):00.27 cpu rebalancing median cost(gb):00.00 cpu(%):05.66 write(%):01.31 stddev cost(gb):00.00 cpu(%):01.56 write(%):00.26 kv/r=95/ops=skew master median cost(gb):0.00 cpu(%):47.29 write(%):00.93 stddev cost(gb):0.09 cpu(%):04.30 write(%):00.17 cpu rebalancing median cost(gb):0.00 cpu(%):08.16 write(%):01.30 stddev cost(gb):0.01 cpu(%):04.59 write(%):00.20 ``` </details> resolves: #95380 Release note (ops change) Add option to balance cpu time (cpu) instead of queries per second (qps) among stores in a cluster. This is done by setting `kv.allocator.load_based_rebalancing.objective='cpu'`. `kv.allocator.cpu_rebalance_threshold` is also added, similar to `kv.allocator.qps_rebalance_threshold` to control the target range for store cpu above and below the cluster mean. 96440: ui: add execution insights to statement and transaction fingerprint details r=ericharmeling a=ericharmeling This commit adds execution insights to the Statement Fingerprint and Transaction Fingerprint Details pages. Part of #83780. Loom: https://www.loom.com/share/98d2023b672e43fa8016829aa641a829 Note that the SQL queries against the `*_execution_insights` tables are updated to `SELECT DISTINCT ON (*_fingerprint_id, problems, causes)` (equivalent to `GROUP BY (*_fingerprint_id, problems, causes)`) from the latest results in the tables, rather than `row_number() OVER ( PARTITION BY stmt_fingerprint_id, problem, causes ORDER BY end_time DESC ) AS rank... WHERE rank = 1`. Both patterns return the same result, but one uses aggregation and the other uses a window function. I find the `DISTINCT ON/GROUP BY` pattern easier to understand, I'm not seeing much difference in the planning/execution time between the two over the same set of data, and I'm seeing `DISTINCT ON/GROUP BY` coming up as more performant in almost all the secondary sources I've encountered. Release note (ui change): Added execution insights to the Statement Fingerprint Details and Transaction Fingerprint Details Pages. 96828: collatedstring: support default, C, and POSIX in expressions r=otan a=rafiss fixes #50734 fixes #95667 informs #57255 --- ### collatedstring: create new package Move the small amount of code from tree/collatedstring.go --- ### collatedstring: support C and POSIX in expressions Release note (sql change): Expressions of the form `COLLATE "default"`, `COLLATE "C"`, and `COLLATE "POSIX"` are now supported. Since the default collation cannot be changed currently, these expressions are all equivalent. The expressions are evaluated by treating the input as a normal string, and ignoring the collation. This means that comparisons between strings and collated strings that use "default", "C", or "POSIX" are now supported. Creating a column with the "C" or "POSIX" collations is still not supported. 96870: kvserver: use replicasByKey addition func in snapshot path r=tbg a=pavelkalinnikov This commit makes one step towards better code sharing between `Replica` initialization paths: split trigger and snapshot application. It makes both to use the same method to check and insert the initialized `Replica` to `replicasByKey` map. Touches #94912 96874: roachtest: run scheduled backup only on clusters with enterprise license r=stevendanna a=msbutler Epic: none Release note: None 96883: go.mod: bump Pebble to 829675f94811 r=RaduBerinde a=RaduBerinde 829675f9 db: fix ObsoleteSize stat 2f086b74 db: refactor compaction splitting to reduce key comparisons Release note: None Epic: none Co-authored-by: Andy Yang <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Eric Harmeling <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Radu Berinde <[email protected]>

96943: kvserver: refactor Replica and Store updates on replica init r=tbg a=pavelkalinnikov This PR further increases code sharing between the two replica initialization paths: split trigger and snapshot application. It also improves readability of these paths. Touches #94912 Co-authored-by: Pavel Kalinnikov <[email protected]>

tbg added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. T-kv-replication labels Jan 9, 2023

pav-kv mentioned this issue Jan 9, 2023

kvserver: don't load uninitialized replicas #94943

Merged

exalate-issue-sync bot assigned pav-kv Jan 9, 2023

tbg mentioned this issue Jan 11, 2023

kvserver: prevent race in Replica creation #94996

Merged

This was referenced Feb 2, 2023

kvserver: factor out replica loading phase #96424

Merged

kvserver: use replicasByKey addition func in snapshot path #96870

Merged

pav-kv mentioned this issue Feb 10, 2023

kvserver: refactor Replica and Store updates on replica init #96943

Merged

exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Incoming in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: remove maybeMarkReplicaInitializedLockedReplLocked #94912

kvserver: remove maybeMarkReplicaInitializedLockedReplLocked #94912

tbg commented Jan 9, 2023 •

edited

Loading

blathers-crl bot commented Jan 9, 2023

tbg commented Jan 9, 2023 •

edited

Loading

tbg commented Jan 9, 2023

pav-kv commented Jan 11, 2023 •

edited

Loading

kvserver: remove maybeMarkReplicaInitializedLockedReplLocked #94912

kvserver: remove maybeMarkReplicaInitializedLockedReplLocked #94912

Comments

tbg commented Jan 9, 2023 • edited Loading

blathers-crl bot commented Jan 9, 2023

tbg commented Jan 9, 2023 • edited Loading

tbg commented Jan 9, 2023

pav-kv commented Jan 11, 2023 • edited Loading

tbg commented Jan 9, 2023 •

edited

Loading

tbg commented Jan 9, 2023 •

edited

Loading

pav-kv commented Jan 11, 2023 •

edited

Loading