Enhancement: persist commit index in LogStore to accelerate recovery #549

banks · 2023-03-15T13:57:34Z

Today I realised my mental model of how our Raft library works is subtly wrong.

I assumed that when a node restarts, it loads the most recent snapshot into the FSM and replays committed logs from disk into the FSM before rejoining the cluster.

In fact, we only load from snapshot since currently we don't persist any information about which logs are known to be committed which means that replaying them all might apply a few uncommitted ones from the end of the log.

In practice this works OK in most cases because if the node is elected leader then it will write a new log which as soon as it's committed will trigger applying all the other committed logs in to the FSM too. If it's a follower it doesn't catch up it's FSM until the first AppendEntries RPC from the leader tells it the current commit index in the cluster.

As we consider reducing snapshot frequency with WAL LogStore being able to store lots more logs performantly, this will become more of an issue because it will mean followers are much more "stale" when they start serving requests.

To fix this correctly, we'd need to persist the last known commit index so that we know which prefix of the logs it's safe to apply on startup. We could periodically persist that to the StableStore but that increases amount of disk transactions and trades that off against how recent the index stays. Currently StableStore performance is not an issue because it's only updated infrequently during elections. This would make it much more critical.

An alternative that would be more efficient but a little more complicate would be:

Introduce a new optional interface called something like:

type CommitTrackingLogStore interface {
    SetCommitIndex(idx uint64) error
    ReadCommitIndex() (uint64, error)
}

Have Raft detect that the LogStore also implements this just before calling StoreLogs during replication or leader's dispatchLogs, if it does, call SetCommitIndex first with the most recent commitIndex we know
Implementations may store that index in memory until the next StoreLogs call and then write it out on disk along with the log data.
When the LogStore recovers it's state on open, it would recover the last persisted commit index to and load it into memory
When Raft starts up it can restore snapshot like now then...
Check if the LogStore implements CommitTrackingLogStore. If it does, load the commit index from the store and replay all logs from lastSnapshotIndex up to and including the commit index that was persisted, before completing startup.

The text was updated successfully, but these errors were encountered:

otoolep · 2023-06-22T17:34:00Z

This would be very useful, as rqlite (which uses this library) can also encounter long start up times. Knowing which logs, which are not contained in the latest snapshot, are actually committed, would solve this problem. I could optimize replay at start-up.

Is there anyway to detect which are committed at this time? I may modify my own fork of Raft if needed, until this repo makes your suggested improvement (or something like it). It doesn't to be perfect, even knowing up to, say, 75%, of the what logs were definitely committed would help. So calling SetCommitIndex even every so often would be help.

otoolep · 2023-06-22T18:27:22Z

Though I might just add a goroutine to my code, which polls raft.AppliedIndex() every few seconds. It would write it to the disk. It doesn't have to be up-to-date, just enough to make a difference to start-up times, but primarily to make sure I don't go too far in the Raft log during replay, and apply uncommitted entires.

banks · 2023-11-20T22:40:00Z

@otoolep sorry I didn't see your comments here until now.

Yeah what you proposed would work too either inside the library or from the outside. It would be easy enough to trade off how stale things might be on startup vs, how frequently you write to disk but even writing to disk once per second would have minimal impact on modern SSDs but still bound the staleness after start up to just 1 second.

If you do try this please update here so we can hear from your experience. I'd still like the library to handle this eventually and it's not that difficult to do but no idea when we'd be able to prioritize that.

otoolep · 2023-11-21T14:23:37Z

Yes, that's what I did for a while -- added a goroutine that wrotes the Applied Index to BoltDB every few seconds. Then, at start-up, rqlite was putting its SQLite database in a "fast-update" mode while applying logs I knew for a fact were Applied (using the index retrieved from BoltDB at startup). Once that was done, rqlite would change its SQLite instance back to more conservative update strategy (which results in slower writes however).

However I've since backed out the change, and rqlite now just runs SQLite in the "fast update" mode (PRAGMA sychronous=off) all the time. I did this because the SQLite docs state that the worst that could happen if rqlite was to crash, with SQLite in this mode, would be a corrupt SQLite database. But since rqlite deletes and rebuilds the entire DB from scratch on restart, this potential for corruption is a non-issue. So I've got fast start-up times now in all circumstances.

TBH I should have run SQLite in PRAGMA sychronous=off from the very start of my project's development, but wasn't aware of it at the time.

lalalalatt · 2024-08-24T07:50:04Z

@banks Hi, I am interested in this, and also I can help on this.

Introduce a new optional interface called something like:
type CommitTrackingLogStore interface {
    SetCommitIndex(idx uint64) error
    ReadCommitIndex() (uint64, error)
}
Have Raft detect that the LogStore also implements this just before calling StoreLogs during replication or leader's dispatchLogs, if it does, call SetCommitIndex first with the most recent commitIndex we know

Implementations may store that index in memory until the next StoreLogs call and then write it out on disk along with the log data.

When the LogStore recovers it's state on open, it would recover the last persisted commit index to and load it into memory

When Raft starts up it can restore snapshot like now then...

Check if the LogStore implements CommitTrackingLogStore. If it does, load the commit index from the store and replay all logs from lastSnapshotIndex up to and including the commit index that was persisted, before completing startup.

Maybe we can add a feature flag called --fast-recover and some warning to show the impact, then in step 2 and step 6, the conditions also check if the flag is enable, if both is ok, then...

banks · 2024-08-27T10:10:22Z

@lalalalatt great! Yeap a feature flag (or optional config in this library that apps can enable with a flag) would be a great step to making it more safe to roll out a larger change like this.

If you'd like to work on it that's awesome. I'd like to see if done. I do want to be transparent that I'm not sure how quickly we could commit to reviewing or merging. Large changes to raft are allways full or risk and need close attention even when they are conceptually simple and it can be hard to prioritize a relatively large amount of review or testing on a PR that is not something that is directly impacting users too much right now. That's not to say we're not interested, but just hoping to be up-front rather than let you do a lot of work and not be able to get it merged any time soon!

If you still would like to work on this, I suggest breaking it down into bits and proposing (ideally before spending lots of time writing) a few different PRs that we can review individually as a branch - you'll need to PR changes to the LogStore implementations anyway separately.

lalalalatt · 2024-09-01T08:33:55Z

I think we could contribute to https://github.com/hashicorp/raft-boltdb after finishing all milestones in this issue.

ncabatoff added the enhancement label Feb 12, 2024

banks mentioned this issue Jun 24, 2024

Apply non-persistent log? #600

Closed

lalalalatt mentioned this issue Sep 1, 2024

Enhancement: persist commit index in LogStore to accelerate recovery #613

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: persist commit index in LogStore to accelerate recovery #549

Enhancement: persist commit index in LogStore to accelerate recovery #549

banks commented Mar 15, 2023

otoolep commented Jun 22, 2023

otoolep commented Jun 22, 2023

banks commented Nov 20, 2023

otoolep commented Nov 21, 2023

lalalalatt commented Aug 24, 2024 •

edited

Loading

banks commented Aug 27, 2024

lalalalatt commented Sep 1, 2024

Enhancement: persist commit index in LogStore to accelerate recovery #549

Enhancement: persist commit index in LogStore to accelerate recovery #549

Comments

banks commented Mar 15, 2023

otoolep commented Jun 22, 2023

otoolep commented Jun 22, 2023

banks commented Nov 20, 2023

otoolep commented Nov 21, 2023

lalalalatt commented Aug 24, 2024 • edited Loading

banks commented Aug 27, 2024

lalalalatt commented Sep 1, 2024

lalalalatt commented Aug 24, 2024 •

edited

Loading