-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: clean unstable log early #122438
Comments
cc @cockroachdb/replication |
cc @cockroachdb/replication |
@sumeerbhola With RACv2, this issue is still applicable, but to a lesser degree. We admit / return tokens when the entry is both synced and admitted. So the "admission" happens at step 4 of the flow, or later. After this, token returns to the leader are racing with step 5. If step 5 is delayed (e.g. the node is overloaded or has scheduling latency issue, or This is probably a rare case. |
Agreed. I don't think we should bother doing something here unless we actually see a problem. AC (modulo deficiencies) is supposed to keep goroutine scheduling latency within some acceptable bound. And if the raft scheduler is clogged up since |
Background
The unstable log structure in
pkg/raft
holds log entries until they have been written to storage and fsync-ed.After the introduction of async log writes, the flow of entries from memory to
Storage
is:unstable
.handleRaftReady
, theunstable
entries are extracted and paired with aMsgStorageAppend
message.MsgStorageAppendResp
responses back to the raft instance.unstable
1.Improvement
In this flow, there is a period of time (between steps 3-5) when an entry has already been written to Pebble and sits in memtables, but still resides in the
unstable
struct. When async writes are enabled, this can last for multipleReady
iterations. Holding these entries inunstable
is not strictly necessary, because they are already readable from the logStorage
. We should clear them in step (3). This will, effectively, become a "transfer" of entries fromunstable
toStorage
.In Replication AC, entry tokens are admitted and returned to the leader in step (3), too. Clearing the
unstable
entries at this point effectively includes them into the replication token "lifetime", and protects the node from OOMs caused byunstable
build-ups.The modification will be along the lines of having a new method/message to raft saying that some/all entries in
unstable
have been (non-durably) written, so raft can clear them. There can be some complications in the interaction with the async writes protocol.Alternatively, we can go full on the "transfer" semantics, and remove entries from
unstable
whenReady
returns them. We would still need to deliver "acks" to raft when entries are synced.Jira issue: CRDB-37890
Epic CRDB-37515
Footnotes
Some entries may have already been cleared from
unstable
by this time, e.g. if the leadership changed and the new leader has overwritten some entries. We only remove the entries that are a guaranteed to be matched by storage, and there are no in-flight appends overwriting them. See this comment for some details. ↩The text was updated successfully, but these errors were encountered: