Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: remove maybeMarkReplicaInitializedLockedReplLocked #94912

Open
tbg opened this issue Jan 9, 2023 · 4 comments
Open

kvserver: remove maybeMarkReplicaInitializedLockedReplLocked #94912

tbg opened this issue Jan 9, 2023 · 4 comments
Assignees
Labels
C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Jan 9, 2023

Related to #94911.

We seem to have three paths that initialize a Replica. The first one is

// loadRaftMuLockedReplicaMuLocked will load the state of the replica from disk.
// If desc is initialized, the Replica will be initialized when this method
// returns. An initialized Replica may not be reloaded. If this method is called
// with an uninitialized desc it may be called again later with an initialized
// desc.
//
// This method is called in three places:
//
// 1. newReplica - used when the store is initializing and during testing
// 2. tryGetOrCreateReplica - see newUnloadedReplica
// 3. splitPostApply - this call initializes a previously uninitialized Replica.
func (r *Replica) loadRaftMuLockedReplicaMuLocked(desc *roachpb.RangeDescriptor) error {
ctx := r.AnnotateCtx(context.TODO())
if r.mu.state.Desc != nil && r.IsInitialized() {
log.Fatalf(ctx, "r%d: cannot reinitialize an initialized replica", desc.RangeID)
} else if r.replicaID == 0 {
// NB: This is just a defensive check as r.mu.replicaID should never be 0.
log.Fatalf(ctx, "r%d: cannot initialize replica without a replicaID", desc.RangeID)
}
if desc.IsInitialized() {
r.setStartKeyLocked(desc.StartKey)
}
// Clear the internal raft group in case we're being reset. Since we're
// reloading the raft state below, it isn't safe to use the existing raft
// group.
r.mu.internalRaftGroup = nil
var err error
if r.mu.state, err = r.mu.stateLoader.Load(ctx, r.Engine(), desc); err != nil {
return err
}
r.mu.lastIndex, err = r.mu.stateLoader.LoadLastIndex(ctx, r.Engine())
if err != nil {
return err
}
r.mu.lastTerm = invalidLastTerm
// Ensure that we're not trying to load a replica with a different ID than
// was used to construct this Replica.
replicaID := r.replicaID
if replicaDesc, found := r.mu.state.Desc.GetReplicaDescriptor(r.StoreID()); found {
replicaID = replicaDesc.ReplicaID
} else if desc.IsInitialized() {
log.Fatalf(ctx, "r%d: cannot initialize replica which is not in descriptor %v", desc.RangeID, desc)
}
if r.replicaID != replicaID {
log.Fatalf(ctx, "attempting to initialize a replica which has ID %d with ID %d",
r.replicaID, replicaID)
}
r.setDescLockedRaftMuLocked(ctx, desc)
// Only do this if there was a previous lease. This shouldn't be important
// to do but consider that the first lease which is obtained is back-dated
// to a zero start timestamp (and this de-flakes some tests). If we set the
// min proposed TS here, this lease could not be renewed (by the semantics
// of minLeaseProposedTS); and since minLeaseProposedTS is copied on splits,
// this problem would multiply to a number of replicas at cluster bootstrap.
// Instead, we make the first lease special (which is OK) and the problem
// disappears.
if r.mu.state.Lease.Sequence > 0 {
r.mu.minLeaseProposedTS = r.Clock().NowAsClockTimestamp()
}
ssBase := r.Engine().GetAuxiliaryDir()
if r.raftMu.sideloaded, err = newDiskSideloadStorage(
r.store.cfg.Settings,
desc.RangeID,
replicaID,
ssBase,
r.store.limiters.BulkIOWriteRate,
r.store.engine,
); err != nil {
return errors.Wrap(err, "while initializing sideloaded storage")
}
r.assertStateRaftMuLockedReplicaMuRLocked(ctx, r.store.Engine())
r.sideTransportClosedTimestamp.init(r.store.cfg.ClosedTimestampReceiver, desc.RangeID)
return nil
}

and as explain in its comment, this is what transitions most uninitialized replicas to initialized.

However, then there's also

// maybeMarkReplicaInitializedLocked should be called whenever a previously
// uninitialized replica has become initialized so that the store can update its
// internal bookkeeping. It requires that Store.mu and Replica.raftMu
// are locked.
func (s *Store) maybeMarkReplicaInitializedLockedReplLocked(
ctx context.Context, lockedRepl *Replica,
) error {
desc := lockedRepl.descRLocked()
if !desc.IsInitialized() {
return errors.Errorf("attempted to process uninitialized range %s", desc)
}
rangeID := lockedRepl.RangeID
if _, ok := s.mu.uninitReplicas[rangeID]; !ok {
// Do nothing if the range has already been initialized.
return nil
}
delete(s.mu.uninitReplicas, rangeID)
if it := s.getOverlappingKeyRangeLocked(desc); it.item != nil {
return errors.AssertionFailedf("%s: cannot initialize replica; %s has overlapping range %s",
s, desc, it.Desc())
}
// Copy of the start key needs to be set before inserting into replicasByKey.
lockedRepl.setStartKeyLocked(desc.StartKey)
if it := s.mu.replicasByKey.ReplaceOrInsertReplica(ctx, lockedRepl); it.item != nil {
return errors.AssertionFailedf("range for key %v already exists in replicasByKey btree: %+v",
it.item.key(), it)
}
// Unquiesce the replica. We don't allow uninitialized replicas to unquiesce,
// but now that the replica has been initialized, we unquiesce it as soon as
// possible. This replica was initialized in response to the reception of a
// snapshot from another replica. This means that the other replica is not
// quiesced, so we don't need to campaign or wake the leader. We just want
// to start ticking.
//
// NOTE: The fact that this replica is being initialized in response to the
// receipt of a snapshot means that its r.mu.internalRaftGroup must not be
// nil.
//
// NOTE: Unquiescing the replica here is not strictly necessary. As of the
// time of writing, this function is only ever called below handleRaftReady,
// which will always unquiesce any eligible replicas before completing. So in
// marking this replica as initialized, we have made it eligible to unquiesce.
// However, there is still a benefit to unquiecing here instead of letting
// handleRaftReady do it for us. The benefit is that handleRaftReady cannot
// make assumptions about the state of the other replicas in the range when it
// unquieces a replica, so when it does so, it also instructs the replica to
// campaign and to wake the leader (see maybeUnquiesceAndWakeLeaderLocked).
// We have more information here (see "This means that the other replica ..."
// above) and can make assumptions about the state of the other replicas in
// the range, so we can unquiesce without campaigning or waking the leader.
if !lockedRepl.maybeUnquiesceWithOptionsLocked(false /* campaignOnWake */) {
return errors.AssertionFailedf("expected replica %s to unquiesce after initialization", desc)
}
// Add the range to metrics and maybe gossip on capacity change.
s.metrics.ReplicaCount.Inc(1)
s.maybeGossipOnCapacityChange(ctx, rangeAddEvent)
return nil
}

which has as its sole caller the applySnapshot path.

We should unify them by replacing the latter use with the former.

Jira issue: CRDB-23225
Epic: CRDB-220

@tbg tbg added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. T-kv-replication labels Jan 9, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jan 9, 2023

cc @cockroachdb/replication

@tbg
Copy link
Member Author

tbg commented Jan 9, 2023

Only the second one even removes the uninited replica from store.mu.uninitReplicas, I wondered why we weren't leaking entries in that map on splits (which don't call that method) but it turns out we sort of randomly remove them here (and move the right hand side replica over to the initialized map a few lines down):

s.mu.Lock()
defer s.mu.Unlock()
if exRng, ok := s.mu.uninitReplicas[rightDesc.RangeID]; rightReplOrNil != nil && ok {
// If we have an uninitialized replica of the new range we require pointer
// equivalence with rightRepl. See Store.splitTriggerPostApply().
if exRng != rightReplOrNil {
log.Fatalf(ctx, "found unexpected uninitialized replica: %s vs %s", exRng, rightReplOrNil)
}
// NB: We only remove from uninitReplicas here so that we don't leave open a
// window where a replica is temporarily not present in Store.mu.replicas.
delete(s.mu.uninitReplicas, rightDesc.RangeID)
}

There are probably reasons for that. We have already called loadRaftMuLocked[...] on the right hand side at that point, we did that earlier in splitPostApply:

// Finish initialization of the RHS.
err := rightRepl.loadRaftMuLockedReplicaMuLocked(&split.RightDesc)
if err != nil {
log.Fatalf(ctx, "%v", err)
}

Also, it seems sketchy how we're updating the span of the LHS replica and that of the RHS replica in separate s.mu critical sections. For example, say we split r1 [a,c) into r1 [a,b) and r2 [b,c). Then in the code above this paragraph, we've already updated r2 to have span [b,c). But r1 still owns [a,c). We get away with it because at that point, the store still believes that rhsRepl is an uninitialized replicas and so it isn't inserted yet into the data structures that need the key -> replica map to be injective. But this seems really messy.

s.mu.Lock()
defer s.mu.Unlock()
if exRng, ok := s.mu.uninitReplicas[rightDesc.RangeID]; rightReplOrNil != nil && ok {
// If we have an uninitialized replica of the new range we require pointer
// equivalence with rightRepl. See Store.splitTriggerPostApply().
if exRng != rightReplOrNil {
log.Fatalf(ctx, "found unexpected uninitialized replica: %s vs %s", exRng, rightReplOrNil)
}
// NB: We only remove from uninitReplicas here so that we don't leave open a
// window where a replica is temporarily not present in Store.mu.replicas.
delete(s.mu.uninitReplicas, rightDesc.RangeID)
}
leftRepl.setDescRaftMuLocked(ctx, newLeftDesc)

@tbg
Copy link
Member Author

tbg commented Jan 9, 2023

cc @pavelkalinnikov

@pav-kv
Copy link
Collaborator

pav-kv commented Jan 11, 2023

There are mainly two aspects of initializing a Replica:

  1. Initializing a Replica :)
  2. Updating the Store state correspondingly.

Also, the two should probably be done with some synchronization.

The two code paths linked in the issue description reside in Replica and Store correspondingly, so it makes some sense and we could take it as a basis. We could eventually arrive at one function if there is too much commonality in the orchestration / "glue" between these 2 parts across different workflows (split or snapshot application).

Seemingly, it is the "glue" between these two things that is currently ad-hoc / scattered all over the place. For example, the split trigger application path does (1) through a "good" function, but not so well on (2) - it's hidden in split trigger code. The snapshot application does the opposite: on (2) it uses the "maybe mark" func, but does (1) in applySnapshot instead of using a shared helper.

craig bot pushed a commit that referenced this issue Jan 16, 2023
94943: kvserver: don't load uninitialized replicas r=tbg a=pavelkalinnikov

This change stops using `loadRaftMuLockedReplicaMuLocked` when creating an uninitialized replica, to make it clearer what happens in this case. It turns out that what it did for uninitialized desc is equivalent to setting an empty `ReplicaState` and asserting the match between in-memory and on-disk state.

Informs #94912, #93898
Epic: CRDB-220
Release note: None

Co-authored-by: Pavel Kalinnikov <[email protected]>
craig bot pushed a commit that referenced this issue Feb 9, 2023
96031: sql: add mixed version test for system.role_members user ids upgrade r=rafiss a=andyyang890

This patch adds a mixed version logictest that ensures that GRANT ROLE
continues to work properly in a cluster with both 22.2 and 23.1 nodes
(i.e. nodes that have run the system.role_members user ids upgrade).

Part of #92342

Release note: None

96127: kvserver: introduce cpu rebalancing r=nvanbenschoten a=kvoli

This patch allows the store rebalancer to use CPU in place of QPS when
balancing load on a cluster. This patch adds `cpu` as an option with the
cluster setting:

`kv.allocator.load_based_rebalancing.objective`

When set to `cpu`, rather than `qps`. The store rebalancer will perform
a mostly identical function, however target balancing the sum of all
replica's cpu time on each store, rather than qps. The default remains
as `qps` here.

Similar to QPS, the rebalance threshold can be set to allow controlling
the range above and below the mean store CPU is considered imbalanced,
either overfull or underfull respectively:

`kv.allocator.cpu_rebalance_threshold`: 0.1

In order to manage with mixed versions during upgrade and some
architectures not supporting the cpu sampling method, a rebalance
objective manager is introduced in `rebalance_objective.go`. The manager
mediates access to the rebalance objective and overwrites it in cases
where the objective set in the cluster setting cannot be supported.

The results when using CPU in comparison to QPS can be found [here](https://docs.google.com/document/d/1QLhD20BTamjj3-dSG9F1gW7XMBy9miGPpJpmu2Dn3yo/edit#) (internal).


<details>
<summary>Results Summary</summary>

![image](https://user-images.githubusercontent.com/39606633/215580650-b12ff509-5cf5-4ffa-880d-8387e2ef0afa.png)

![image](https://user-images.githubusercontent.com/39606633/215580626-3d748ba1-e9a4-4abb-8acd-2c319203932e.png)

![image](https://user-images.githubusercontent.com/39606633/215580585-58e6000d-b6cf-430a-b4b7-d14a77eab3bd.png)

</details>


<details>
<summary>Detailed Allocbench Results</summary>

```
kv/r=0/access=skew
master
    median cost(gb):05.81 cpu(%):14.97 write(%):37.83
    stddev cost(gb):01.87 cpu(%):03.98 write(%):07.01
cpu rebalancing
    median cost(gb):08.76 cpu(%):14.42 write(%):36.61
    stddev cost(gb):02.66 cpu(%):01.85 write(%):04.80
kv/r=0/ops=skew
master
    median cost(gb):06.23 cpu(%):26.05 write(%):57.33
    stddev cost(gb):02.92 cpu(%):05.83 write(%):08.20
cpu rebalancing
    median cost(gb):04.28 cpu(%):11.45 write(%):31.28
    stddev cost(gb):02.25 cpu(%):02.51 write(%):06.68
kv/r=50/ops=skew
master
    median cost(gb):04.36 cpu(%):22.84 write(%):48.09
    stddev cost(gb):01.12 cpu(%):02.71 write(%):05.51
cpu rebalancing
    median cost(gb):04.64 cpu(%):13.49 write(%):43.05
    stddev cost(gb):01.07 cpu(%):01.26 write(%):08.58
kv/r=95/access=skew
master
    median cost(gb):00.00 cpu(%):09.51 write(%):01.24
    stddev cost(gb):00.00 cpu(%):01.74 write(%):00.27
cpu rebalancing
    median cost(gb):00.00 cpu(%):05.66 write(%):01.31
    stddev cost(gb):00.00 cpu(%):01.56 write(%):00.26
kv/r=95/ops=skew
master
    median cost(gb):0.00 cpu(%):47.29 write(%):00.93
    stddev cost(gb):0.09 cpu(%):04.30 write(%):00.17
cpu rebalancing
    median cost(gb):0.00 cpu(%):08.16 write(%):01.30
    stddev cost(gb):0.01 cpu(%):04.59 write(%):00.20
```


</details>

resolves: #95380

Release note (ops change) Add option to balance cpu time (cpu)
instead of queries per second (qps) among stores in a cluster. This is
done by setting `kv.allocator.load_based_rebalancing.objective='cpu'`.
`kv.allocator.cpu_rebalance_threshold` is also added, similar to
`kv.allocator.qps_rebalance_threshold` to control the target range for
store cpu above and below the cluster mean.

96440: ui: add execution insights to statement and transaction fingerprint details r=ericharmeling a=ericharmeling

This commit adds execution insights to the Statement Fingerprint and Transaction Fingerprint Details pages.

Part of #83780.

Loom: https://www.loom.com/share/98d2023b672e43fa8016829aa641a829

Note that the SQL queries against the `*_execution_insights` tables are updated to `SELECT DISTINCT ON (*_fingerprint_id, problems, causes)` (equivalent to `GROUP BY (*_fingerprint_id, problems, causes)`) from the latest results in the tables, rather than `row_number() OVER ( PARTITION BY stmt_fingerprint_id, problem, causes ORDER BY end_time DESC ) AS rank... WHERE rank = 1`. Both patterns return the same result, but one uses aggregation and the other uses a window function. I find the `DISTINCT ON/GROUP BY` pattern easier to understand, I'm not seeing much difference in the planning/execution time between the two over the same set of data, and I'm seeing `DISTINCT ON/GROUP BY` coming up as more performant in almost all the secondary sources I've encountered.

Release note (ui change): Added execution insights to the Statement Fingerprint Details and Transaction Fingerprint Details Pages.

96828: collatedstring: support default, C, and POSIX in expressions r=otan a=rafiss

fixes #50734
fixes #95667
informs #57255

---
### collatedstring: create new package 

Move the small amount of code from tree/collatedstring.go

---

### collatedstring: support C and POSIX in expressions

Release note (sql change): Expressions of the form `COLLATE "default"`,
`COLLATE "C"`, and `COLLATE "POSIX"` are now supported. Since the
default collation cannot be changed currently, these expressions are all
equivalent. The expressions are evaluated by treating the input as a
normal string, and ignoring the collation.

This means that comparisons between strings and collated strings that
use "default", "C", or "POSIX" are now supported.

Creating a column with the "C" or "POSIX" collations is still not
supported.

96870: kvserver: use replicasByKey addition func in snapshot path r=tbg a=pavelkalinnikov

This commit makes one step towards better code sharing between `Replica` initialization paths: split trigger and snapshot application. It makes both to use the same method to check and insert the initialized `Replica` to `replicasByKey` map.

Touches #94912

96874: roachtest: run scheduled backup only on clusters with enterprise license r=stevendanna a=msbutler

Epic: none

Release note: None

96883: go.mod: bump Pebble to 829675f94811 r=RaduBerinde a=RaduBerinde

829675f9 db: fix ObsoleteSize stat
2f086b74 db: refactor compaction splitting to reduce key comparisons

Release note: None
Epic: none

Co-authored-by: Andy Yang <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Eric Harmeling <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
craig bot pushed a commit that referenced this issue Feb 16, 2023
96943: kvserver: refactor Replica and Store updates on replica init r=tbg a=pavelkalinnikov

This PR further increases code sharing between the two replica initialization paths: split trigger and snapshot application. It also improves readability of these paths.

Touches #94912

Co-authored-by: Pavel Kalinnikov <[email protected]>
@exalate-issue-sync exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024
@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

2 participants