Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: separate log and state storage logically #132030

Open
pav-kv opened this issue Oct 5, 2024 · 0 comments
Open

raft: separate log and state storage logically #132030

pav-kv opened this issue Oct 5, 2024 · 0 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 5, 2024

In service of the separate raft log [#16624] and witness projects, it appears that the best results can be achieved with the support from the raft package. While the log and state storage in CRDB physically sit in the same storage engine, it is possible to start separating them logically in raft, and introducing assumptions that the two can work asynchronously.

When two storages are separated, the model becomes:

  • The log storage supports all the functions of an "acceptor" (voting+log). It requires regular fsyncs so that entries accepted into it can be committed at the "proposer"/"learner" level (when a sufficient quorum is collected).
  • The state storage backs the "learner" concept, and is updated with committed entries and snapshots. It does not require eager fsyncs, except in some limited circumstances (TBD; might include things like config changes). It still has to do fsyncs somewhat regularly so that the RawNode can report the durable applied state, which gets input into the log compaction decisions, and prevents long apply catchups upon restart.
  • Due to the physical separation and fsync asynchrony between the two storages, it may happen that they are out of sync (one "happens before" the other, or vice versa) when the RawNode restarts. It must be possible to reconciliate a correct in-memory raft state from the initial state read from both storages.

To support the above, both storages have to provide logical clocks (HardState-style) that make it possible to compare the two states. Initially, both can be sourced from the unified HardState that we have today (which means they will always be in sync), and eventually they can become asynchronous at the physical level.

Doing the logical separation first gives the benefit of being able to test it extensively in raft datadriven tests long before the actual physical separation happens. The CRDB-specific aspects of the physical separation would be added on top.

Jira issue: CRDB-42792

@pav-kv pav-kv added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. labels Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

1 participant