Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
storage: interface for ReplicasStorage
ReplicasStorage provides an interface to manage the persistent state that includes the lifecycle of a range replica, its raft log, and the state machine state. The implementation(s) are expected to be a stateless wrapper around persistent state in the underlying engine(s) (any state they maintain in-memory would be simply a performance optimization and always be in-sync with the persistent state). We consider the following distinct kinds of persistent state: - State machine state: It contains all replicated keys: replicated range-id local keys, range local keys, range lock keys, lock table keys, global keys. This includes the RangeAppliedState and the RangeDescriptor. - Raft and replica life-cycle state: This includes all the unreplicated range-ID local key names prefixed by Raft, and the RangeTombstoneKey. We will loosely refer to all of these as "raft state". The interface requires that any mutation (batch or sst) only touch one of these kinds of state. This discipline will allow us to eventually separate the engines containing these two kinds of state. This interface is not relevant for store local keys though they will be in the latter engine. The interface does not allow the caller to specify whether to sync a mutation to the raft log or state machine state -- that decision is left to the implementation of ReplicasStorage. So the hope is that even when we don't separate the state machine and raft engines, this abstraction will force us to reason more carefully about effects of crashes, and when to sync, and allow us to test more thoroughly (including "crash" testing using strict-mem FS). ReplicasStorage does not interpret most of the data in the state machine. It expects mutations to that state to be provided as an opaque batch, or a set of files to be ingested. There are a few exceptions where it can read state machine state, mainly when recovering from a crash, so as to make changes to get to a consistent state. - RangeAppliedStateKey: needs to read this in order to truncate the log, both as part of regular log truncation and on crash recovery. - RangeDescriptorKey: needs to read this to discover ranges whose state machine state needs to be discarded on crash recovery. A corollary to this lack of interpretation is that reads of the state machine are not handled by this interface, though it does expose some metadata in case the reader want to be sure that the range it is trying to read actually exists in storage. ReplicasStorage also does not offer an interface to construct changes to the state machine state. It simply applies changes, and requires the caller to obey some simple invariants to not cause inconsistencies. It is aware of the keyspace occupied by a range and the difference between rangeID keys and range keys -- it needs this awareness to restore internal consistency when initializing (say after a crash), by clearing the state machine state for replicas that should no longer exist. ReplicasStorage does interpret the raft state (all the unreplicated range-ID local key names prefixed by Raft), and the RangeTombstoneKey. This is necessary for it to be able to maintain invariants spanning the raft log and the state machine (related to raft log truncation, replica lifetime etc.), including reapplying raft log entries on restart to the state machine. All accesses (read or write) to the raft log and RangeTombstoneKey must happen via ReplicasStorage. Since this abstraction is mutating the same underlying engine state that was previously mutated via lower-level interfaces, and is not a data-structure in the usual sense, we should be able to migrate callers incrementally to use this interface. That is, callers that use this interface, and those that use the lower-level engine interfaces could co-exist correctly. Informs #38322 Release note: None
- Loading branch information