raft: clean unstable log early #122438

pav-kv · 2024-04-16T12:17:57Z

Background

The unstable log structure in pkg/raft holds log entries until they have been written to storage and fsync-ed.

After the introduction of async log writes, the flow of entries from memory to Storage is:

Entries are appended to unstable.
On handleRaftReady, the unstable entries are extracted and paired with a MsgStorageAppend message.
The batch of entries is written to Pebble.
- If async log writes are enabled, and the batch qualifies for an async write, the batch is written to Pebble, but not synced.
- If the write doesn't qualify for async write, the entries are written and synced.
When the entries have been synced, we/Pebble invoke a callback which sends a MsgStorageAppendResp responses back to the raft instance.
When handling the append response, raft removes entries from unstable ¹.

Improvement

In this flow, there is a period of time (between steps 3-5) when an entry has already been written to Pebble and sits in memtables, but still resides in the unstable struct. When async writes are enabled, this can last for multiple Ready iterations. Holding these entries in unstable is not strictly necessary, because they are already readable from the log Storage. We should clear them in step (3). This will, effectively, become a "transfer" of entries from unstable to Storage.

In Replication AC, entry tokens are admitted and returned to the leader in step (3), too. Clearing the unstable entries at this point effectively includes them into the replication token "lifetime", and protects the node from OOMs caused by unstable build-ups.

The modification will be along the lines of having a new method/message to raft saying that some/all entries in unstable have been (non-durably) written, so raft can clear them. There can be some complications in the interaction with the async writes protocol.

Alternatively, we can go full on the "transfer" semantics, and remove entries from unstable when Ready returns them. We would still need to deliver "acks" to raft when entries are synced.

Jira issue: CRDB-37890

Epic CRDB-37515

Some entries may have already been cleared from unstable by this time, e.g. if the leadership changed and the new leader has overwritten some entries. We only remove the entries that are a guaranteed to be matched by storage, and there are no in-flight appends overwriting them. See this comment for some details. ↩

The text was updated successfully, but these errors were encountered:

blathers-crl · 2024-04-16T12:18:01Z

cc @cockroachdb/replication

blathers-crl · 2024-04-16T12:18:47Z

cc @cockroachdb/replication

pav-kv · 2024-10-02T01:26:00Z

@sumeerbhola With RACv2, this issue is still applicable, but to a lesser degree.

We admit / return tokens when the entry is both synced and admitted. So the "admission" happens at step 4 of the flow, or later. After this, token returns to the leader are racing with step 5. If step 5 is delayed (e.g. the node is overloaded or has scheduling latency issue, or Ready handling is slow), the token return can outpace step 5, and a subsequent MsgApp is received before the RawNode was able to clear the synced entries from unstable. We now pile up more entries to unstable.

This is probably a rare case.

sumeerbhola · 2024-10-23T13:41:24Z

@sumeerbhola With RACv2, this issue is still applicable, but to a lesser degree.

Agreed. I don't think we should bother doing something here unless we actually see a problem. AC (modulo deficiencies) is supposed to keep goroutine scheduling latency within some acceptable bound. And if the raft scheduler is clogged up since Ready handling is slow, say due to slow storage reads, this will also affect other ranges and so we would need to find a solution for that anyway.

pav-kv added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. labels Apr 16, 2024

blathers-crl bot added the T-kv-replication label Apr 16, 2024

exalate-issue-sync bot added T-kv-replication and removed T-kv-replication labels Apr 16, 2024

pav-kv mentioned this issue May 20, 2024

raft: revamp async log storage protocol #124440

Open

9 tasks

exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Incoming in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: clean unstable log early #122438

raft: clean unstable log early #122438

pav-kv commented Apr 16, 2024 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Apr 16, 2024

blathers-crl bot commented Apr 16, 2024

pav-kv commented Oct 2, 2024

sumeerbhola commented Oct 23, 2024

raft: clean unstable log early #122438

raft: clean unstable log early #122438

Comments

pav-kv commented Apr 16, 2024 • edited by exalate-issue-sync bot Loading

Background

Improvement

Footnotes

blathers-crl bot commented Apr 16, 2024

blathers-crl bot commented Apr 16, 2024

pav-kv commented Oct 2, 2024

sumeerbhola commented Oct 23, 2024

pav-kv commented Apr 16, 2024 •

edited by exalate-issue-sync bot

Loading