Skip to content

Commit

Permalink
Merge #72795 #73858
Browse files Browse the repository at this point in the history
72795: storage: interface for ReplicasStorage r=tbg a=sumeerbhola

ReplicasStorage provides an interface to manage the persistent state that
includes the lifecycle of a range replica, its raft log, and the state
machine state. The implementation(s) are expected to be a stateless wrapper
around persistent state in the underlying engine(s) (any state they
maintain in-memory would be simply a performance optimization and always
be in-sync with the persistent state).

We consider the following distinct kinds of persistent state:
- State machine state: It contains all replicated keys: replicated range-id
  local keys, range local keys, range lock keys, lock table keys, global
  keys. This includes the RangeAppliedState and the RangeDescriptor.

- Raft and replica life-cycle state: This includes all the unreplicated
  range-ID local key names prefixed by Raft, and the RangeTombstoneKey.
  We will loosely refer to all of these as "raft state".

The interface requires that any mutation (batch or sst) only touch one of
these kinds of state. This discipline will allow us to eventually separate
the engines containing these two kinds of state. This interface is not
relevant for store local keys though they will be in the latter engine. The
interface does not allow the caller to specify whether to sync a mutation
to the raft log or state machine state -- that decision is left to the
implementation of ReplicasStorage. So the hope is that even when we don't
separate the state machine and raft engines, this abstraction will force us
to reason more carefully about effects of crashes, and when to sync, and
allow us to test more thoroughly (including "crash" testing using
strict-mem FS).

ReplicasStorage does not interpret most of the data in the state machine.
It expects mutations to that state to be provided as an opaque batch, or a
set of files to be ingested. There are a few exceptions where it can read
state machine state, mainly when recovering from a crash, so as to make
changes to get to a consistent state.
- RangeAppliedStateKey: needs to read this in order to truncate the log,
  both as part of regular log truncation and on crash recovery.
- RangeDescriptorKey: needs to read this to discover ranges whose state
  machine state needs to be discarded on crash recovery.

A corollary to this lack of interpretation is that reads of the state
machine are not handled by this interface, though it does expose some
metadata in case the reader want to be sure that the range it is trying to
read actually exists in storage. ReplicasStorage also does not offer an
interface to construct changes to the state machine state. It simply
applies changes, and requires the caller to obey some simple invariants to
not cause inconsistencies. It is aware of the keyspace occupied by a range
and the difference between rangeID keys and range keys -- it needs this
awareness to restore internal consistency when initializing (say after a
crash), by clearing the state machine state for replicas that should no
longer exist.

ReplicasStorage does interpret the raft state (all the unreplicated
range-ID local key names prefixed by Raft), and the RangeTombstoneKey. This
is necessary for it to be able to maintain invariants spanning the raft log
and the state machine (related to raft log truncation, replica lifetime
etc.), including reapplying raft log entries on restart to the state
machine. All accesses (read or write) to the raft log and RangeTombstoneKey
must happen via ReplicasStorage.

Since this abstraction is mutating the same underlying engine state that
was previously mutated via lower-level interfaces, and is not a
data-structure in the usual sense, we should be able to migrate callers
incrementally to use this interface. That is, callers that use this
interface, and those that use the lower-level engine interfaces could
co-exist correctly.

Informs #38322

Release note: None

73858: rttanalysisccl: adjust alter primary region bench r=RichardJCai a=rafiss

fixes #73775

This is flaky, and less is better anyway.

Release note: None

Co-authored-by: sumeerbhola <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
  • Loading branch information
3 people committed Dec 15, 2021
3 parents 5e70321 + 5699d37 + 861d5f9 commit ef789b1
Show file tree
Hide file tree
Showing 3 changed files with 848 additions and 1 deletion.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
exp,benchmark
21,AlterPrimaryRegion/alter_empty_database_alter_primary_region
26,AlterPrimaryRegion/alter_empty_database_set_initial_primary_region
25-26,AlterPrimaryRegion/alter_empty_database_set_initial_primary_region
21,AlterPrimaryRegion/alter_populated_database_alter_primary_region
27,AlterPrimaryRegion/alter_populated_database_set_initial_primary_region
20,AlterRegions/alter_empty_database_add_region
Expand Down
2 changes: 2 additions & 0 deletions pkg/storage/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ go_library(
"pebble_iterator.go",
"pebble_merge.go",
"pebble_mvcc_scanner.go",
"replicas_storage.go",
"resource_limiter.go",
"row_counter.go",
"slice.go",
Expand Down Expand Up @@ -82,6 +83,7 @@ go_library(
"@com_github_dustin_go_humanize//:go-humanize",
"@com_github_elastic_gosigar//:gosigar",
"@com_github_gogo_protobuf//proto",
"@io_etcd_go_etcd_raft_v3//raftpb",
],
)

Expand Down
Loading

0 comments on commit ef789b1

Please sign in to comment.