Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rac2: admitted vector handling #129508

Closed
Tracked by #123509
sumeerbhola opened this issue Aug 22, 2024 · 2 comments
Closed
Tracked by #123509

rac2: admitted vector handling #129508

sumeerbhola opened this issue Aug 22, 2024 · 2 comments
Assignees
Labels
A-replication-admission-control-v2 Related to introduction of replication AC v2 C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Aug 22, 2024

This needs to be done outside Raft, since it (a) allows for more efficient attachment of the admitted vector when sending a MsgAppResp, (b) probably more maintainable, since it will sit inside the RACv2 code, etc.

Prototype is in #129447


Key changes:

(see full history of changes below)


Jira issue: CRDB-41569
Epic: CRDB-37515

@sumeerbhola sumeerbhola added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Aug 22, 2024
@sumeerbhola sumeerbhola added the A-replication-admission-control-v2 Related to introduction of replication AC v2 label Aug 23, 2024
craig bot pushed a commit that referenced this issue Sep 3, 2024
129751: rac2: introduce log storage tracker r=sumeerbhola a=pav-kv

This PR introduces the data structure that observes all raft log storage writes, syncs and logical admissions, and tracks the stable/admitted log state correspondingly.

Log writes are ordered by leader term, then by index. Log writes have no gaps, except when a snapshot clears the log. All writes/syncs must be registered. Admissions can be registered selectively, for entries that are subject to RACv2.

Part of #129508

Co-authored-by: Pavel Kalinnikov <[email protected]>
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Sep 3, 2024
The AdmittedTracker provides the latest AdmittedVector, and will be
implemented by processorImpl.

Informs cockroachdb#129508

Epic: CRDB-37515

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Sep 4, 2024
The AdmittedTracker provides the latest AdmittedVector, and will be
implemented by processorImpl.

Informs cockroachdb#129508

Epic: CRDB-37515

Release note: None
craig bot pushed a commit that referenced this issue Sep 4, 2024
130076: replica_rac2: use LogMark to convey semantics r=sumeerbhola a=pav-kv

Part of #129508

Co-authored-by: Pavel Kalinnikov <[email protected]>
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Sep 4, 2024
The AdmittedTracker provides the latest AdmittedVector, and will be
implemented by processorImpl.

Informs cockroachdb#129508

Epic: CRDB-37515

Release note: None
craig bot pushed a commit that referenced this issue Sep 4, 2024
130074: replica_rac2: add admitted state protocol r=sumeerbhola a=pav-kv

This PR introduces `AdmittedState` and `PiggybackedAdmittedState` protos, and converts replica- and leader-side code (such as `AdmittedPiggybacker` and raft transport) to use them.

The protos are not populated at the moment, and the leader-side code is a bunch of TODOs. More conversion code will follow, to replace the `AdmittedResponseForRange` flows.

Part of #129508

Co-authored-by: Pavel Kalinnikov <[email protected]>
craig bot pushed a commit that referenced this issue Sep 4, 2024
129758: parser,mt: remove CREATE TENANT ... LIKE r=dt a=dt

The problem -- needing to apply a large number of configuration knobs -- solved by the 'template tenant' functionality has since been solved instead by having named service modes for each distinct common set of configurations, for example 'shared' which implies all capabilities in addition to in-process service. The template functionality was never pushed on PCR UA users, and is not used in cloud, and thus has no users and is accordingly removed here.

Release note: none.
Epic: none.

130054: rac2: add AdmittedTracker interface for use by RangeController r=pav-kv,kvoli a=sumeerbhola

The AdmittedTracker provides the latest AdmittedVector, and will be implemented by processorImpl.

Informs #129508

Epic: CRDB-37515

Release note: None

130096: release: update_versions: continue, even if one repo attempt fails r=celiala a=celiala

Follow-up to #130063: after merging 130063, update_versions failed to complete[1], due to encountering another / different error.

This commit adjusts update_versions so that it continues & attempts to create PRs for all the remaining repos, despite encountering errors along the way.

[1] Latest "Create version update PRs" job: https://teamcity.cockroachdb.com/buildConfiguration/Internal_Cockroach_Release_Publish_CreateVersionUpdatePRs/16752534?buildTab=log&focusLine=3552&logView=flowAware&expandAll=true

Release note: None
Epic: None
Release justification: release-infra only change.

130111: execinfrapb: move MockDistSQLServer to flowinfra r=dt a=dt

pb packages are intended to be leaf packages with few or no dependencies. The mock server in the test utils was pulling in dependencies on packages that encode business logic, including logic that wants to depend on pb packages such as pkg/rpc's auth logic that depends on request types.

Release note: none.
Epic: none.

Co-authored-by: David Taylor <[email protected]>
Co-authored-by: sumeerbhola <[email protected]>
Co-authored-by: Celia La <[email protected]>
craig bot pushed a commit that referenced this issue Sep 6, 2024
130200: rac2: indicate admitted updates from LogTracker r=sumeerbhola a=pav-kv

This PR makes `LogTracker` return a bit indicating when the `Admitted` vector has changed. The admitted vector is considered "changed" when its leader term goes up (and the admitted indices either stay intact or regress), or the term stays intact but the admitted index at any priority goes up.

Part of #129508

Co-authored-by: Pavel Kalinnikov <[email protected]>
craig bot pushed a commit that referenced this issue Sep 9, 2024
129727: kvserver,rac2: initialize RaftNode early r=sumeerbhola a=pav-kv

This PR moves the `RaftNode` interface initialization to the point when `raft.RawNode` is created. This will be needed for correct initialization of `LogTracker` from the initial raft state.

The `RaftNode` interface is moved to `replica_rac2` integration package, to reduce cluttering of `kvserver`.

Part of #129508

Co-authored-by: Pavel Kalinnikov <[email protected]>
craig bot pushed a commit that referenced this issue Sep 10, 2024
130171: replica_rac2: integrate LogTracker r=sumeerbhola a=pav-kv

This PR integrates the `LogTracker` into RACv2 `Processor`. The design sketch is below.

- `LogTracker` is created when `raft.RawNode` is initialized. It reads the initial stable state from `RawNode.LogMark()`, and initializes the stable and admitted indices to match this mark.
- `LogTracker` observes all log storage appends in `handleRaftReady`, and all log and snapshot syncs in `OnLogSync` and `OnSnapSync` handlers. This guarantees that the stable mark in `LogTracker` is always accurate.
- `LogTracker` observes all entries subject to admission control, and their corresponding admissions. This allows updating the admitted vector accurately.

The admitted vector is sent to the leader from two places:

- In `sendRaftMessage`, any successful `MsgAppResp` message is intercepted, and the corresponding `RaftMessageRequest` is annotated with the admitted vector if it is in the coordinate system of the receiver, i.e. has the same leader term. This flow supports the fast path when logical admission happens without delays: by the time the `MsgAppResp` is sent, the entries are already admitted, and the admitted vector has advanced.
- In `handleRaftReady`, the admitted vector is sent to the `Piggybacker`, which then attaches them to any `RaftMessageRequestBatch` going to the same node as the receiver replica. This serves cases when admission is lagging the log syncs, and there might be no `MsgAppResp` to attach the vector to. Such admissions are batched into one `Ready` cycle for efficiency reasons.

Part of #129508

Co-authored-by: Pavel Kalinnikov <[email protected]>
craig bot pushed a commit that referenced this issue Sep 12, 2024
130427: replica_rac2: integrate admitted tracking on leader r=sumeerbhola,kvoli a=pav-kv

This PR plumbs the admitted vectors into the RACv2 `Processor`. Admitted vectors are applied to `replicaSendStream` and cause releasing the tokens held by the leader.

The admitted vectors are plumbed to `replicaSendStream` via 3 paths:
- The leader's own admitted vector is applied from `HandleRaftReadyRaftMuLocked`, calling into `RangeController.AdmitRaftMuLocked` directly.
- The followers' admitted vectors in most cases are received via annotated `RaftMessageRequest.AdmittedState`, which is dispatched from `stepRaftGroupRaftMuLocked` into the `Processor` via `Processor.AdmitRaftMuLocked` method.
- The followers' piggybacked admitted vectors from `RaftMessageRequestBatch` are queued on the `Processor` via `EnqueuePiggybackedAdmittedAtLeader` method, and later applied from `HandleRaftReadyRaftMuLocked`.

Part of #129508

Co-authored-by: Pavel Kalinnikov <[email protected]>
craig bot pushed a commit that referenced this issue Sep 12, 2024
130459: replica_rac2: remove processorImpl mutex r=sumeerbhola a=pav-kv

This PR removes the `processorImpl` mutex. Almost all `processorImpl` fields are used under `raftMu` (only by methods containing `RaftMuLocked` infix). Exceptions:

- `logTracker`. It has an internal mutex for interaction with log syncs/admissions out of band.
- `leader.rc` has `rcReferenceUpdateMu` for high-contention interaction with out-of-band `AdmitForEval`.
- `leader.pendingAdmittedMu` has a narrow mutex for quick interaction with appends from `RaftTransport`.
- `enabledWhenLeader` is an atomic and doesn't need a mutex.

Related to #129508

130590: util: fix race in cidr startup r=jaylim-crl a=andrewbaptist

Previously if the `server.cidr_mapping_url` was set and a node restarted, there was a race condition where `SetOnChange` for the setting could be called before the `Start` was called. This could result in it blocking while attempting to submit to the channel.

Fixes: #130589

Release note: None

Co-authored-by: Pavel Kalinnikov <[email protected]>
Co-authored-by: Andrew Baptist <[email protected]>
craig bot pushed a commit that referenced this issue Sep 12, 2024
130522: changefeedccl: add roachtesting for network metrics r=rharding6373 a=asg0451

Add a roachtest helper that checks that the new
cdc network metrics are being emitted. For now it
simply checks that they are > 0.

Part of: #130097

Release note: None

130585: replica_rac2: rm Ready scheduling on piggybacked r=sumeerbhola a=pav-kv

Since the piggybacked admitted vectors are applied immediately from within the method that clears the queued updates, we no longer need to schedule a Ready cycle. Previously the token release would happen subsequently in Ready handler.

Related to #129508

Co-authored-by: Miles Frankel <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
craig bot pushed a commit that referenced this issue Sep 13, 2024
…130695

130178: roachtest: update multitenant-multiregion to use roachprod API r=herkolategan,fqazi a=renatolabs

This updates the `multitenant-multiregion` test to use the "official"
multitenant roachprod API instead of the deprecated functions in
`multitenant_utils.go`.

Fixes: #124029

Release note: None

130461: replica_rac2: clean up admitted vector sending r=sumeerbhola a=pav-kv

This PR makes `processorImpl` track the raft node `term` unconditionally. It then factors out the admitted vector sending into a well-documented function, and uses the new `term` field to filter out no-op admitted vectors that don't match the leader's term.

Part of #129508

130667: opt: remove volatile histogram benchmarks r=mgartner a=mgartner

Benchmarks for `(*Histogram).ValuesCount`, `(*Histogram).MaxFrequency`,
and `(*Histogram).ApplySelectivity` have been removed. These methods are
compiled into very few instructions and are very fast. This makes them
susceptible to "phantom regressions" where unlucky interactions with the
CPU cache or other resources can degrade performance between commits
even when the methods compile to the same set of instructions. See
https://cockroachlabs.slack.com/archives/C4X2J0RH6/p1725978728168389
for more details.

Epic: None

Release note: None


130670: schemachanger: block zone config changes if schema_locked is set r=rafiss a=annrpom

Fixes: #130668
Release note (bug fix): Fixed a bug where zone configuration changes issued by
the declarative schema changer were not blocked if a table had `schema_locked`
set. For more information about the declarative schema changer, see:
https://www.cockroachlabs.com/docs/stable/online-schema-changes#declarative-schema-changer/.

130684: rac2: return tracked deductions on close/removal r=sumeerbhola a=kvoli

When a replica was removed from a range, or the range controller closed, any tracked deductions did not return tokens to the respective stream token counts.

Update `CloseRaftMuLocked` and `SetReplicasRaftMuLocked` to close any replica send streams for replicas which are no longer part of the replica set, closing all when the range controller closes.

Also immediately close the send stream when a replica transitions to StateSnapshot.

Resolves: #130683
Release note: None

130685: cluster-ui: copy some base components from cloud r=xinhaoz a=xinhaoz

This commit ports over some basic components from cloud. They are
placed in a new directory, `sharedFromCloud` and will be deleted
once the new package is ready. This is simply a temporary solution.

Components ported over, with their dependencies:
- Search
- PageHeader
- PageCount (for displaying table pagination)

Epic: CRDB-37558

Release note: None

130692: tablemetadatacache: use store ids from response when updating cache r=xinhaoz a=xinhaoz

Now that #129060 is done we can use the StoreIDs from the span stats response when updating systme.table_metadata.

Epic: CRDB-37558
Release note: None

130694: go.mod: bump Pebble to d5fd9927fa4d r=sumeerbhola a=jbowens

Changes:

 * [`d5fd9927`](cockroachdb/pebble@d5fd9927) sstable: introduce RawColumnWriter
 * [`0665a3e1`](cockroachdb/pebble@0665a3e1) .github: use upload-artifact@v4
 * [`e0afd551`](cockroachdb/pebble@e0afd551) base: consolidate internal key parsing functions
 * [`117aa7d9`](cockroachdb/pebble@117aa7d9) colblk: allow finishing data, index blocks without last row
 * [`2cc212bb`](cockroachdb/pebble@2cc212bb) binfmt: add relative offsets
 * [`7de517e6`](cockroachdb/pebble@7de517e6) colblk: allow UintBuilder.Get into sparse, default rows
 * [`065a044c`](cockroachdb/pebble@065a044c) colblk: fix up zero-row structures
 * [`934a2235`](cockroachdb/pebble@934a2235) colblk: expose block reader Describe methods
 * [`376d455b`](cockroachdb/pebble@376d455b) sstable: adapt EstimateDiskUsage to be agnostic to index block impl
 * [`da5055ad`](cockroachdb/pebble@da5055ad) sstable: remove RawWriter.IsStrictObsolete
 * [`fc6eab06`](cockroachdb/pebble@fc6eab06) manual: add cgo memory accounting
 * [`c8c3806e`](cockroachdb/pebble@c8c3806e) sstable/block: fix compression roundtrip test
 * [`b3a81bd6`](cockroachdb/pebble@b3a81bd6) sstable: refactor sstable layout facilities, add decodeLayout
 * [`729b6f72`](cockroachdb/pebble@729b6f72) treesteps: add depth limits
 * [`90953236`](cockroachdb/pebble@90953236) sstable: remove unnecessary loop while identifying filter block
 * [`1c5e99fc`](cockroachdb/pebble@1c5e99fc) update crlib and use crstrings.Lines in a few tests
 * [`0c4e2c6c`](cockroachdb/pebble@0c4e2c6c) treesteps: infrastructure for visualization of operations on trees
 * [`d3ef91b8`](cockroachdb/pebble@d3ef91b8) sstable: return interface type from NewRawWriter
 * [`a0f6b9ae`](cockroachdb/pebble@a0f6b9ae) cache: use a specific type for the cache ID
 * [`19e3a699`](cockroachdb/pebble@19e3a699) cache: remove entry.refcnt

Release note: none.
Epic: none.

130695: schemachanger: refactor deprecated call in mustRetrieveIndexNameElem r=fqazi a=annrpom

This patch refactors `mustRetrieveIndexNameElem` to be more aligned with how we get elems in the DSC today; we also move this in the `helpers.go` file as zone config partitions code uses it to support `ALTER PARTITION ... OF TABLE`.

Informs: #129889
Release note: None

Co-authored-by: Renato Costa <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Annie Pompa <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Xin Hao Zhang <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
craig bot pushed a commit that referenced this issue Sep 16, 2024
130688: replica_rac2: move raft write registering r=sumeerbhola a=pav-kv

This commit moves registering all raft log/snapshot writes for RACv2 into `HandleRaftReadyRaftMuLocked`, so that `maybeSendAdmittedRaftMuLocked` call always happens after registering the latest write.

This way, the admitted vector in `logTracker` carries the latest leader term we have observed from raft, and `maybeSendAdmittedRaftMuLocked` won't skip it when the term changes.

Part of #129508

Co-authored-by: Pavel Kalinnikov <[email protected]>
craig bot pushed a commit that referenced this issue Sep 17, 2024
130854: logstore: fix bug in getting LogMark from message r=kvoli a=pav-kv

The `Entries` field is not used in `MsgStorageAppendResp`. Instead, a non-zero `Index` field indicates that there has been at least one entry in the corresponding `MsgStorageAppend` message.

Related to #129508

Co-authored-by: Pavel Kalinnikov <[email protected]>
angles-n-daemons pushed a commit to angles-n-daemons/cockroach that referenced this issue Sep 17, 2024
130688: replica_rac2: move raft write registering r=sumeerbhola a=pav-kv

This commit moves registering all raft log/snapshot writes for RACv2 into `HandleRaftReadyRaftMuLocked`, so that `maybeSendAdmittedRaftMuLocked` call always happens after registering the latest write.

This way, the admitted vector in `logTracker` carries the latest leader term we have observed from raft, and `maybeSendAdmittedRaftMuLocked` won't skip it when the term changes.

Part of cockroachdb#129508

Co-authored-by: Pavel Kalinnikov <[email protected]>
craig bot pushed a commit that referenced this issue Sep 19, 2024
130382: replica_rac2: rm unused waitingForAdmissionState r=sumeerbhola a=pav-kv

Also copy the comment explaining why we chose to track individual log indices rather then compressing them into a single "waiting" interval.

Part of #129508

Co-authored-by: Pavel Kalinnikov <[email protected]>
craig bot pushed a commit that referenced this issue Sep 23, 2024
130619: kvserver: enable admitted vector protocol r=sumeerbhola,kvoli a=pav-kv

This commit enables `AdmittedState` annotations in `RaftMessageRequest` and `RaftMessageRequestBatch` protocol. The leader starts receiving admitted vector updates from all replicas.

Part of #129508

131172: cloud/amazon: test kms against mock server r=dt a=dt

Release note: none.
Epic: none.

Co-authored-by: Pavel Kalinnikov <[email protected]>
Co-authored-by: David Taylor <[email protected]>
@kvoli
Copy link
Collaborator

kvoli commented Sep 24, 2024

@pav-kv is all that is left #129508 to complete this issue?

@pav-kv
Copy link
Collaborator

pav-kv commented Sep 24, 2024

Only need to merge #130444 after some fixups. Will do today/tomorrow.

craig bot pushed a commit that referenced this issue Sep 25, 2024
130444: rac2: add MsgApp pings to tickle admitted vectors r=sumeerbhola a=pav-kv

This commit adds another condition for sending `MsgApp` pings to peers, on top of existing `MsgApp` pings that raft leader sends if its knowledge about the follower's state (stable log index and commit index) hasn't converged to its own.

The additional condition is that the leader holds some RACv2 eval/flow tokens, meaning that the follower's storage hasn't admitted all the pending writes. Our knowledge may be lagging due to various reasons, e.g. the messages are dropped. We want to learn about admissions in a timely manner, so we send `MsgApp` pings that trigger `MsgAppResp` flow, which delivers the follower's current state to the leader. Eventually, the leader learns the follower's state and quiesces the pings.

We integrate with `MsgApp` pings for a couple of reasons:
 - simplicity
 - we don't want to send excessive amount of pings. If raft sends pings already (because its knowledge is lagging), RACv2 pings will not be sent in addition to that.

Part of #129508

Co-authored-by: Pavel Kalinnikov <[email protected]>
@pav-kv pav-kv closed this as completed Sep 25, 2024
craig bot pushed a commit that referenced this issue Oct 14, 2024
132338: pgwire: reduce allocations when writing CHAR(N) datums r=mgartner a=mgartner

`ResolveBlankPaddedCharBytes` has been replaced with a more efficient
method of padding `CHAR(N)` datums with trailing spaces.

Epic: None

Release note: None


132501: opt: remove outdated  TODO r=mgartner a=mgartner

A TODO that was addressed in #129007 has been removed.

Release note: None

132563: kvserver: skip empty RACv1 dispatches r=kvoli a=pav-kv

This commit skips creating a `RaftMessageRequest` in the fallback RAC admission dispatch code if there are no pending RACv1 dispatches. Previously, it would send an empty `RaftMessageRequest` which would cause an error in the receiver's `handleRaftRequest`, [message drops](https://github.com/cockroachdb/cockroach/blob/9e12f67ff4ad860651c40dcef489a1556d1d11b7/pkg/kv/kvserver/raft_transport.go#L451-L456), and the receiver's message queue to restart after the following warning:

```
unable to accept Raft message from (n0,s0):?: no handler registered for (n0,s0):?
```

Related to #129508

Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-replication-admission-control-v2 Related to introduction of replication AC v2 C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

3 participants