Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvstorage: complete RaftReplicaID migration #95513

Merged
merged 2 commits into from
Feb 6, 2023

Conversation

tbg
Copy link
Member

@tbg tbg commented Jan 19, 2023

As of v22.11, we always write the RaftReplicaID when creating a
Replica or updating a snapshot. However, since this is
persisted state that could've originated in older versions and not
updated yet, we couldn't rely on a persisted ReplicaID yet.

This commit adds code to the (*Store).Start boot sequence that

  • persists a RaftReplicaID for all initialized replicas (using the
    ReplicaID from the descriptor)
  • deletes all uninitialized replicas lacking RaftReplicaID (since we don't know their
    ReplicaID at this point).

The second item in theory violates Raft invariants, as uninitialized
Replicas are allowed to vote (though they then cannot accept log
entries). So in theory:

  • an uninitialized replica casts a decisive vote for a leader
  • it restarts
  • code in this commit removes the uninited replica (and its vote)
  • delayed MsgVote from another leader arrives
  • it casts another vote for the same term for a dueling leader
  • now there are two leaders in the same term.

The above in addition presupposes that the two leaders cannot
communicate with each other. Also, even if that is the case, since the
two leaders cannot append to the uninitialized replica (it doesn't
accept entries), we also need additional voters to return at the exact
right time.

Since an uninitialized replica without RaftReplicaID in is necessarily
at least one release old, this is exceedingly unlikely and we will
live with this theoretical risk.

This PR also adds a first stab at a datadriven test harness for
kvstorage which is likely to be of use for #93247.

Epic: CRDB-220
Release note: None

Footnotes

  1. https://github.com/cockroachdb/cockroach/pull/75761

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the kvstorage-datadriven branch from 6d30d5d to 16d1a33 Compare January 20, 2023 15:26
@tbg tbg changed the title kvstorage: backfill RaftReplicaID kvstorage: complete RaftReplicaID migration Jan 20, 2023
@tbg tbg force-pushed the kvstorage-datadriven branch 3 times, most recently from d3e9028 to d2c0515 Compare January 20, 2023 15:49
@tbg tbg marked this pull request as ready for review January 20, 2023 15:49
@tbg tbg requested a review from a team as a code owner January 20, 2023 15:49
@tbg tbg requested a review from a team January 20, 2023 15:49
@tbg tbg requested review from a team as code owners January 20, 2023 15:49
@tbg tbg requested review from jbowens and pav-kv and removed request for a team and jbowens January 20, 2023 15:49
@tbg

This comment was marked as resolved.

@tbg tbg force-pushed the kvstorage-datadriven branch 2 times, most recently from 26f4edc to 18a7ff7 Compare January 23, 2023 16:37
@tbg
Copy link
Member Author

tbg commented Jan 23, 2023

Now it's good to review.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pavelkalinnikov and @tbg)


-- commits line 36 at r4:
wasn't it v22.1?

There is also a similar comment in

// Compatibility:
// - v21.2 and v22.1: v22.1 unilaterally introduces RaftReplicaID (an
// unreplicated range-id local key). If a v22.1 binary is rolled back at
// a node, the fact that RaftReplicaID was written is harmless to a
// v21.2 node since it does not read it. When a v21.2 drops an
// initialized range, the RaftReplicaID will also be deleted because the
// whole range-ID local key space is deleted.
//
// - v22.2: we will start relying on the presence of RaftReplicaID, and
// remove any unitialized replicas that have a HardState but no
// RaftReplicaID. This removal will happen in ReplicasStorage.Init and
// allow us to tighten invariants. Additionally, knowing the ReplicaID
// for an unitialized range could allow a node to somehow contact the
// raft group (say by broadcasting to all nodes in the cluster), and if
// the ReplicaID is stale, would allow the node to remove the HardState
// and RaftReplicaID. See
// https://github.com/cockroachdb/cockroach/issues/75740.
//
// There is a concern that there could be some replica that survived
// from v21.2 to v22.1 to v22.2 in unitialized state and will be
// incorrectly removed in ReplicasStorage.Init causing the loss of the
// HardState.{Term,Vote} and lead to a "split-brain" wrt leader
// election.
//
// Even though this seems theoretically possible, it is considered
// practically impossible, and not just because a replica's vote is
// unlikely to stay relevant across 2 upgrades. For one, we're always
// going through learners and don't promote until caught up, so
// uninitialized replicas generally never get to vote. Second, even if
// their vote somehow mattered (perhaps we sent a learner a snap which
// was not durably persisted - which we also know is impossible, but
// let's assume it - and then promoted the node and it immediately
// power-cycled, losing the snapshot) the fire-and-forget way in which
// raft votes are requested (in the same raft cycle) makes it extremely
// unlikely that the restarted node would then receive it.
which could use some adjustment in light of this PR doing this Replica cleanup.

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pavelkalinnikov and @sumeerbhola)


-- commits line 36 at r4:

wasn't it v22.1?

Yep, thanks.

Updated the comment.

@tbg tbg force-pushed the kvstorage-datadriven branch from 18a7ff7 to 9731f35 Compare January 24, 2023 08:07
pkg/kv/kvserver/kvstorage/datadriven_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/kvstorage/testdata/init Outdated Show resolved Hide resolved
pkg/kv/kvserver/kvstorage/init.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/kvstorage/init.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/kvstorage/init.go Outdated Show resolved Hide resolved
@tbg tbg force-pushed the kvstorage-datadriven branch 2 times, most recently from 5327822 to 749969b Compare February 2, 2023 20:44
@tbg
Copy link
Member Author

tbg commented Feb 2, 2023

Ok, RFAL.

@tbg
Copy link
Member Author

tbg commented Feb 3, 2023

https://teamcity.cockroachdb.com/repository/download/Cockroach_Ci_Tests_Testrace/8563775:id/bazel-testlogs/pkg/storage/storage_test/test.log

panic: concurrent write operations detected on file [recovered]
	panic: concurrent write operations detected on file

goroutine 12121 [running]:
github.com/cockroachdb/pebble.(*DB).runCompaction.func1()
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2258 +0x23d
panic({0x27ac820, 0x31d3a20})
	GOROOT/src/runtime/panic.go:890 +0x262
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).timeDiskOp(0xc000584910, 0x2, 0xc000988558)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:262 +0x1a5
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).Sync(0xc000584910)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:219 +0x69
github.com/cockroachdb/pebble/vfs.(*enospcFile).Sync(0xc000138648)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_full.go:391 +0x6d
github.com/cockroachdb/pebble.(*DB).runCompaction(0xc00090a500, 0x7d, 0xc000539200)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2836 +0x312f
github.com/cockroachdb/pebble.(*DB).flush1(0xc00090a500)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:1685 +0x61d
github.com/cockroachdb/pebble.(*DB).flush.func1({0x31f15b8, 0xc000a0f2c0})
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:1624 +0x137
runtime/pprof.Do({0x31f1548, 0xc0000600a8}, {{0xc0000acf60?, 0xc00090a500?, 0x0?}}, 0xc00073a7a0)
	GOROOT/src/runtime/pprof/runtime.go:40 +0x123
github.com/cockroachdb/pebble.(*DB).flush(0xc00090a500)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:1617 +0x92
created by github.com/cockroachdb/pebble.(*DB).maybeScheduleFlush
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:1533 +0x186
I230202 22:16:32.090098 1 (gostd) testmain.go:626  [T1] 1  Test //pkg/storage:storage_test exited with error code 2

@knz
Copy link
Contributor

knz commented Feb 3, 2023

@RaduBerinde requested that we file these concurrency issues inside pebble as issues on the pebble repo.

@tbg
Copy link
Member Author

tbg commented Feb 3, 2023

Filed cockroachdb/pebble#2301

See cockroachdb#93310. This is
also the beginning of
cockroachdb#93247.

Epic: CRDB-220
Release note: None
@tbg tbg force-pushed the kvstorage-datadriven branch from 749969b to 20685eb Compare February 3, 2023 09:13
Copy link
Collaborator

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking much clearer now, thanks for implementing the suggestion! LGTM in principle, but left some tips/nits.

pkg/kv/kvserver/kvstorage/datadriven_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/kvstorage/datadriven_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/kvstorage/datadriven_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/kvstorage/testdata/init Show resolved Hide resolved
pkg/kv/kvserver/kvstorage/init.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/kvstorage/init.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/kvstorage/init.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/kvstorage/init.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/kvstorage/init.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/store_create_replica.go Show resolved Hide resolved
@tbg
Copy link
Member Author

tbg commented Feb 3, 2023

Still have to push the updates, please ignore my comment pings until I request.

@tbg tbg force-pushed the kvstorage-datadriven branch from 20685eb to b0cf469 Compare February 3, 2023 16:20
Copy link
Collaborator

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final bits and bobs.

pkg/kv/kvserver/kvstorage/init.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/kvstorage/init.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/kvstorage/init.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/kvstorage/init.go Outdated Show resolved Hide resolved
As of v22.1[^1], we always write the RaftReplicaID when creating a
Replica or updating a snapshot. However, since this is
persisted state that could've originated in older versions and not
updated yet, we couldn't rely on a persisted ReplicaID yet.

This commit adds code to the `(*Store).Start` boot sequence that

- persists a RaftReplicaID for all initialized replicas (using the
  ReplicaID from the descriptor)
- deletes all uninitialized replicas that don't have a RaftReplicaID
  (since we don't know their ReplicaID at this point).

The second item in theory violates Raft invariants, as uninitialized
Replicas are allowed to vote (though they then cannot accept log
entries). So in theory:

- an uninitialized replica casts a decisive vote for a leader
- it restarts
- code in this commit removes the uninited replica (and its vote)
- delayed MsgVote from another leader arrives
- it casts another vote for the same term for a dueling leader
- now there are two leaders in the same term.

The above in addition presupposes that the two leaders cannot
communicate with each other. Also, even if that is the case, since the
two leaders cannot append to the uninitialized replica (it doesn't
accept entries), we also need additional voters to return at the exact
right time.

Since an uninitialized replica without RaftReplicaID in is necessarily
at least one release old, this is exceedingly unlikely and we will
live with this theoretical risk.

This commit also introduces a few assertions that make sure that
we don't have overlapping initialized replicas (which would be
detected at Store.Start time otherwise while inserting in the
btree, but it's nice to catch this earlier) or duplicate
RangeIDs.

[^1]: cockroachdb#75761

Epic: CRDB-220
Release note: None
@tbg tbg force-pushed the kvstorage-datadriven branch from 402bc02 to 672e8b1 Compare February 6, 2023 07:15
@tbg
Copy link
Member Author

tbg commented Feb 6, 2023

bors r=pavelkalinnikov

@craig
Copy link
Contributor

craig bot commented Feb 6, 2023

Build succeeded:

@craig craig bot merged commit 622956b into cockroachdb:master Feb 6, 2023
@tbg tbg deleted the kvstorage-datadriven branch February 6, 2023 11:05
tbg added a commit to tbg/cockroach that referenced this pull request Feb 6, 2023
It now has to be there, so turn this into an assertion failure.
See cockroachdb#95513.

Epic: CRDB-220
Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Feb 8, 2023
It now has to be there, so turn this into an assertion failure.
See cockroachdb#95513.

Epic: CRDB-220
Release note: None
craig bot pushed a commit that referenced this pull request Jan 4, 2024
115884: kvserver: remove ReplicaID migration r=erikgrinaker a=erikgrinaker

This migration ensured every replica had a persisted replica ID. It is no longer needed after `MinSupportedVersion` >= 23.1, since the migration has been applied on every finalized 23.1 node. Instead, we assert during startup that all replicas have a replica ID.

Resolves #115869.
Touches #95513.
Epic: none
Release note: None

116443: roachtest: port multitenant/shared-process/basic to new APIs r=srosenberg a=herkolategan

Converts multitenant/shared-process/basic to use the new roachprod multitenant APIs.

Fixes: #115868

Epic: CRDB-31933
Release Note: None

Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Herko Lategan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants