Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: allow committing entries not in leader's stable storage #88699

Open
tbg opened this issue Sep 26, 2022 · 9 comments
Open

kvserver: allow committing entries not in leader's stable storage #88699

tbg opened this issue Sep 26, 2022 · 9 comments
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Sep 26, 2022

Is your feature request related to a problem? Please describe.

The entire point of quorum replication is to provide high availability for
writes. As long as a quorum of voters can write to disk & communicate, forward
progress ought to be possible.

Unfortunately in practice, etcd/raft forces the leader to append to its local
log before disseminating new log entries.

A leader with a write-degraded storage thus renders the range inoperable to
varying degrees.1 In other words, the blast radius is larger than necessary.

Generally, since users are sensitive even to short blips in latency, avoiding
the high tail latencies of synchronous writes is a good idea. More symmetry
helps with that.

Describe the solution you'd like

Improve the interface of RawNode such that it decouples the operations that
move data to stable storage (voting, log appends, persisting the occasional
commit index) from the other operations (in particular, sending
messages). This requires or at least strongly suggests also doing #17500, since
a leader that cannot apply entries that were committed without being locally
durable will not release latches and thus continues to propagate the effects of
its poor disk health to foreground traffic.

Describe alternatives you've considered

We could more aggressively fail over away from leaders that are not performing
as expected. In a sense, this would be the more "sturdy" alternative and
possibly a better ROI if the main scenario we're trying to address are
persistent degradations.

Decoupling the various Ready tasks should noticeable smooth out kinks for which
a fail-over would be too heavyhanded a solution.
It also allows us to markedly improve the performance in the steady state.

So we should really do both anyway.

Additional context

#88442
#88596
#17500

Jira issue: CRDB-19840

Epic CRDB-40197

Footnotes

  1. until the disk degrades to the point where the leader fails to heartbeat
    followers at a sufficient frequency, at which point a new leader can be
    elected

@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv-replication labels Sep 26, 2022
@blathers-crl
Copy link

blathers-crl bot commented Sep 26, 2022

cc @cockroachdb/replication

@nvanbenschoten
Copy link
Member

+1 to the premise of this issue, thanks for writing this up.

An interesting observation that came out of our earlier discussion about this is that quorum replication can minimize the effect of fsync latency on writes (since raft commit and state machine application would see the K-th lowest fsync latency where K is the quorum size). However, a slow disk on the leaseholder will not be hidden from a reader attempting to read from that disk (after a block/page cache miss). For that, some form of hedging seems necessary. Strong consistency, even with consistent follower reads, complicates our ability to hedge reads. I don't understand the internals of cloud block storage well enough to know whether variable write latency is more of a problem than variable read latency.

It's also worth pointing out in this issue that (as of #38954) we don't wait for Raft entry application before acknowledging a write to the client. However, we do have to wait for the application before releasing that write's latches. As a result, any latency incurred by log application will not be paid by the writer itself, but will be paid by any reader or writer that contends with the writer.

@tbg
Copy link
Member Author

tbg commented Sep 29, 2022

However, a slow disk on the leaseholder will not be hidden from a reader attempting to read from that disk (after a block/page cache miss)

Yes, that's a concern, too. I would suspect that write throughput suffers more frequently than read throughput (since the latter is easier to scale), but both scenarios exist.

I'm not aware of great hedges here. There is always the option of transferring the lease (when there is a suitable target) but that is a very binary option and it's hard to determine the right moment to do so.

It's also worth pointing out in this issue that (as of #38954) we don't wait for Raft entry application before acknowledging a write to the client. However, we do have to wait for the application before releasing that write's latches. As a result, any latency incurred by log application will not be paid by the writer itself, but will be paid by any reader or writer that contends with the writer.

Just to spell this out for myself, in the ideal case, if we have a leader with a slow disk, and we've done all of the work in #17500 with an eye on this, a writer could come in, the write would commit with two followers, the append to the leader would be stuck, but the leader would learn that this entry committed and this would be enough to signal the waiting writer.

What's more, the leader doesn't have to wait for entries to be durable locally before it applies them (let's ignore conf changes and the like, where we can be stricter), assuming they got committed using other voters. The only gotcha is that if the leader crashes, and we have separate engines (= no ordering constraint between log write and state machine apply), we might come up with an AppliedIndex that's ahead of what's actually in the logs. We need to prevent this from tripping us up (today we never have such a gap) but the solution is to simply throw away the log, i.e. truncate to the AppliedIndex. This wouldn't even cause snapshots, since we'll be a follower now, and the more up to date log is elsewhere and can catch us up just fine.

One practical issue might be that in order to apply an entry, you have to have the entry. If it's not in the log, it has to be in memory somewhere, and there are limits to that. The raft entry cache is already well suited to this sort of thing, though it isn't clear how to interface with etcd/raft. It needs to tell us about new committed indexes, but today it will only ever communicate indexes for which it also surfaces the entries (i.e. it will never tell us 100 is committed, even if it is, unless it also gives us 100 to apply). So we'd likely want to relax that, i.e. have etcd/raft always surface the highest known commit index, even if that index isn't even in the stable or unstable log. In effect, we've got to take another pair of batteries out of raft (the unstable log).

@nvanbenschoten
Copy link
Member

So we'd likely want to relax that, i.e. have etcd/raft always surface the highest known commit index, even if that index isn't even in the stable or unstable log. In effect, we've got to take another pair of batteries out of raft (the unstable log).

Could we talk more about this? I understand the problem, but came to the conclusion that the unstable log in etcd/raft is exactly the thing that will help us here, as it means that etcd/raft will have a way to find the entries without needing to rely on CRDB being able to locate them somewhere in the append pipeline before the append completes. It also means that we don't need to open the can of worms that is pinning entries in the raft entry cache and then promising that some subset of the cache will not be evicted.

For context, the way I've adjusted the unstable log in etcd/raft to support #17500 is actually to not change its meaning (though I had to come full circle on this). Entries in the unstable log remain in the unstable log until the caller informs etcd/raft that the entries are stable and can be reliably retrieved from Storage. Here's the comment I added to the struct, if it's helpful (it was to me):

// unstable contains "unstable" log entries and snapshot state that has
// not yet been written to Storage. The type serves two roles. First, it
// holds on to new log entries and an optional snapshot until they are
// handed to a Ready struct for persistence. Second, it continues to
// hold on to this state after it has been handed off to provide raftLog
// with a view of the in-progress log entries and snapshot until their
// writes have been stabilized and are guaranteed to be reflected in
// queries of Storage. After this point, the corresponding log entries
// and/or snapshot can be cleared from unstable.

@tbg
Copy link
Member Author

tbg commented Oct 14, 2022

I understand the problem

Apologies, bringing up the raft entry cache here was premature. My main concern is being able to control memory use across a possibly large number of raft instances in a given process. Anything that lives "inside" raft typically gets a configuration parameter that limits memory usage for that raft instance alone, but this is insufficient for our usage1.

Taking a step back, our memory control issues already start way above raft, when messages enter the raft receive queue. From there they get stepped into RawNode, where they now linger until they've been fully processed and can get GC'ed.
At the same time, RawNode can pull entries from the log into memory, via raft.Storage. And the allocs still get referenced from the outgoing msg queue.

Ideally we can curb and prioritize all of these in a coherent way that additionally allows for prioritization between ranges (system ranges > user ranges).

When a raft message arrives from "outside" (i.e. MsgApp), we could acquire from a mem budget2. The message then enters raft and likely the unstable log, though this is not guaranteed. Finding out when that entry goes out of scope seems really tricky and yet that's what ultimately we'd like to be able to do. Essentially it's out of scope when raft no longer holds a reference to it (i.e. it has left the unstable log, if it even entered it in the first place), and we don't hold a reference to it (it has been made durable, we've sent MsgApp to all followers, and it's been evicted from the raft entry cache). Of course we have the outgoing raft transport which has its own queue, further complicating the lifecycle.

Conceptually, it would be nice to live in one of two worlds, where either raft owns all memory or no memory. If it owns all memory, rather than having a "raft receive queue", we synchronously pass every incoming MsgApp to raft which will either accept it (accounting for it) or force us to drop it. Similarly, we ack every message sent to raft once it has actually been sent so that it can update its accounting. Or raft owns no memory (but we need to synchronize with it quite a bit, which doesn't seem great). So instead of an unstable log, it holds on to some abstraction that we own, and we keep doing raft receive queues, etc, and the abstractions are fallible and may force raft to defer work it was planning on doing. Of course there is a part of allocations that can never be "inside" of raft, namely the creation of new proposals.

This all seems fairly involved and deciding what to do about it isn't in scope for this issue.

But async append will hold on to memory for longer, so we need at least to make sure that there is some mechanism by which we degrade to sync appends after a certain inflight size. This could happen outside of raft, by waiting for the append to become durable (thus emptying out the unstable log) before continuing.

The second comment which was sort of jumbled up with my other one in the post above is that raft "hides" the true commit index from peers; it will only tell them about the one they can actually pull up the entry for (from stable storage). I thought this would have to change but I no longer think so because the committed index is sent on the MsgApp. So we just have to remember it somewhere and it becomes effective only when the append has become durable. So nothing surprising happens here.

Footnotes

  1. not something we do today but we're sort of close.

  2. (if we can't, we need to react and dropping the message isn't particularly helpful but let's talk about that another time).

tbg added a commit to tbg/cockroach that referenced this issue Mar 17, 2023
This introduces a `vfs.FS` implementation that can be toggled to block
on any write activity.

Add two tests:

1. check that even though a durable commit is inflight, we can still
   write non-durably. I.e. at the pebble level there is no immediate
   reason why we couldn't keep *applying* the raft log even though log
   *appends* are stuck.
2. end-to-end test where we set up a three node cluster, send unique
   writes and stall the leaseholder's disk, hoping for the writes to
   keep going through. This fails, as is expected, because raft does
   not even put up entries for application if they haven't been durably
   appended to the leader's log (which they can't be due to the stall
   we're injecting).

If we can get 2) to pass, we can have reason to believe that we can
address cockroachdb#88699 (though we'd need to verify with "real" disk stalls
and higher write volumes/stall durations and would certainly need to
configure pebble to allow much more data to accumulate in memory).

The additional benefit test 2) provides is that it prints who is
blocked. At the time of writing, it's encouraging - we see just
the LogWriter blocked on a call to `Write` (via `flushPending`).o

The next step would be to hack in the ability to apply entries
that aren't durable on the leader and to see what we get.

Touches cockroachdb#88699.

Release note: None
@tbg
Copy link
Member Author

tbg commented Mar 20, 2023

There is a prototype in #98852 which allows a small number of writes to go through with a stalled disk at the leader. They currently stall because of logSyncQSem and @jbowens anticipates that the next two roadblocks would be

once the logSyncQSem semaphore is out of the way, i expect the next two limits are: a) the log writer's 512kb buffers and b) memtable rotation, which can't proceed on a stalled disk.

So it looks like we have a good idea of what needs to be done here.

I'll note that #98852 injects the "disk" stall artificially via a custom vfs.FS, so before we declare that we know how to fix this issue we should verify with a real disk stall. Concretely, any read that hits the disk (i.e. doesn't get served by pebble block cache or OS page cache) might prove fatal. In particular, if we find that a leaseholder cannot evaluate a single command on account of such a read, then it's effectively stalled since it won't produce any new log entries to apply.

@tbg
Copy link
Member Author

tbg commented May 26, 2023

Another gotcha that we would need to address (from this thread), if we're applying entries that are not stable locally, these entries might actually become stable before the log entries and so after restart we can end up in a new situation where log and applied index are not reachable from each other via the log (today, first index ≤ applied index ≤ last index).

jbowens added a commit to jbowens/pebble that referenced this issue Jul 20, 2023
Previously, a LogWriter would allocate up to 16 blocks of 32 KiB for buffering
WAL writes. If all 16 blocks had been allocated and no free blocks were
available, a batch writing to the WAL would queue until the flushing goroutine
freed blocks. In testing of write-heavy workloads, especially with larger value
sizes, we've seen queueing at the LogWriter. This queueing blocks the commit
pipeline, preventing any batches from committing regardless of priority and
whether they require waiting for fsync.

This commit modifies LogWriter to allow the queueing of an unbounded number of
blocks. In practice, for the current WAL, the memtable size serves as an upper
bound. With a 64 MiB memtable, at most 64 MiB / 32 KiB = 2,048 blocks may
queue. This is not an unreasonable of additional memory overhead for a
write-heavy workload.

Beyond improving throughput for write-heavy workloads, removing this hard bound
improves tolerance of momentary fsync stalls.

Informs cockroachdb/cockroach#88699.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 21, 2023
Previously, a LogWriter would allocate up to 16 blocks of 32 KiB for buffering
WAL writes. If all 16 blocks had been allocated and no free blocks were
available, a batch writing to the WAL would queue until the flushing goroutine
freed blocks. In testing of write-heavy workloads, especially with larger value
sizes, we've seen queueing at the LogWriter. This queueing blocks the commit
pipeline, preventing any batches from committing regardless of priority and
whether they require waiting for fsync.

This commit modifies LogWriter to allow the queueing of an unbounded number of
blocks. In practice, for the current WAL, the memtable size serves as an upper
bound. With a 64 MiB memtable, at most 64 MiB / 32 KiB = 2,048 blocks may
queue. This is not an unreasonable of additional memory overhead for a
write-heavy workload.

Beyond improving throughput for write-heavy workloads, removing this hard bound
improves tolerance of momentary disk stalls.

Informs cockroachdb/cockroach#88699.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 24, 2023
Previously, a LogWriter would allocate up to 16 blocks of 32 KiB for buffering
WAL writes. If all 16 blocks had been allocated and no free blocks were
available, a batch writing to the WAL would queue until the flushing goroutine
freed blocks. In testing of write-heavy workloads, especially with larger value
sizes, we've seen queueing at the LogWriter. This queueing blocks the commit
pipeline, preventing any batches from committing regardless of priority and
whether they require waiting for fsync.

This commit modifies LogWriter to allow the queueing of an unbounded number of
blocks. In practice, for the current WAL, the memtable size serves as an upper
bound. With a 64 MiB memtable, at most 64 MiB / 32 KiB = 2,048 blocks may
queue. This is not an unreasonable of additional memory overhead for a
write-heavy workload.

Beyond improving throughput for write-heavy workloads, removing this hard bound
improves tolerance of momentary disk stalls.

Informs cockroachdb/cockroach#88699.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 24, 2023
Previously, a LogWriter would allocate up to 16 blocks of 32 KiB for buffering
WAL writes. If all 16 blocks had been allocated and no free blocks were
available, a batch writing to the WAL would queue until the flushing goroutine
freed blocks. In testing of write-heavy workloads, especially with larger value
sizes, we've seen queueing at the LogWriter. This queueing blocks the commit
pipeline, preventing any batches from committing regardless of priority and
whether they require waiting for fsync.

This commit modifies LogWriter to allow the queueing of an unbounded number of
blocks. In practice, for the current WAL, the memtable size serves as an upper
bound. With a 64 MiB memtable, at most 64 MiB / 32 KiB = 2,048 blocks may
queue. This is not an unreasonable of additional memory overhead for a
write-heavy workload.

Beyond improving throughput for write-heavy workloads, removing this hard bound
improves tolerance of momentary disk stalls.

Informs cockroachdb/cockroach#88699.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 24, 2023
Previously, a LogWriter would allocate up to 16 blocks of 32 KiB for buffering
WAL writes. If all 16 blocks had been allocated and no free blocks were
available, a batch writing to the WAL would queue until the flushing goroutine
freed blocks. In testing of write-heavy workloads, especially with larger value
sizes, we've seen queueing at the LogWriter. This queueing blocks the commit
pipeline, preventing any batches from committing regardless of priority and
whether they require waiting for fsync.

This commit modifies LogWriter to allow the queueing of an unbounded number of
blocks. In practice, for the current WAL, the memtable size serves as an upper
bound. With a 64 MiB memtable, at most 64 MiB / 32 KiB = 2,048 blocks may
queue. This is not an unreasonable of additional memory overhead for a
write-heavy workload.

Beyond improving throughput for write-heavy workloads, removing this hard bound
improves tolerance of momentary disk stalls.

Informs cockroachdb/cockroach#88699.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 24, 2023
Previously, a LogWriter would allocate up to 16 blocks of 32 KiB for buffering
WAL writes. If all 16 blocks had been allocated and no free blocks were
available, a batch writing to the WAL would queue until the flushing goroutine
freed blocks. In testing of write-heavy workloads, especially with larger value
sizes, we've seen queueing at the LogWriter. This queueing blocks the commit
pipeline, preventing any batches from committing regardless of priority and
whether they require waiting for fsync.

This commit modifies LogWriter to allow the queueing of an unbounded number of
blocks. In practice, for the current WAL, the memtable size serves as an upper
bound. With a 64 MiB memtable, at most 64 MiB / 32 KiB = 2,048 blocks may
queue. This is not an unreasonable of additional memory overhead for a
write-heavy workload.

Beyond improving throughput for write-heavy workloads, removing this hard bound
improves tolerance of momentary disk stalls.

Informs cockroachdb/cockroach#88699.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 24, 2023
Previously, a LogWriter would allocate up to 16 blocks of 32 KiB for buffering
WAL writes. If all 16 blocks had been allocated and no free blocks were
available, a batch writing to the WAL would queue until the flushing goroutine
freed blocks. In testing of write-heavy workloads, especially with larger value
sizes, we've seen queueing at the LogWriter. This queueing blocks the commit
pipeline, preventing any batches from committing regardless of priority and
whether they require waiting for fsync.

This commit modifies LogWriter to allow the queueing of an unbounded number of
blocks. In practice, for the current WAL, the memtable size serves as an upper
bound. With a 64 MiB memtable, at most 64 MiB / 32 KiB = 2,048 blocks may
queue. This is not an unreasonable of additional memory overhead for a
write-heavy workload.

Beyond improving throughput for write-heavy workloads, removing this hard bound
improves tolerance of momentary disk stalls.

Informs cockroachdb/cockroach#88699.
jbowens added a commit to cockroachdb/pebble that referenced this issue Jul 24, 2023
Previously, a LogWriter would allocate up to 16 blocks of 32 KiB for buffering
WAL writes. If all 16 blocks had been allocated and no free blocks were
available, a batch writing to the WAL would queue until the flushing goroutine
freed blocks. In testing of write-heavy workloads, especially with larger value
sizes, we've seen queueing at the LogWriter. This queueing blocks the commit
pipeline, preventing any batches from committing regardless of priority and
whether they require waiting for fsync.

This commit modifies LogWriter to allow the queueing of an unbounded number of
blocks. In practice, for the current WAL, the memtable size serves as an upper
bound. With a 64 MiB memtable, at most 64 MiB / 32 KiB = 2,048 blocks may
queue. This is not an unreasonable of additional memory overhead for a
write-heavy workload.

Beyond improving throughput for write-heavy workloads, removing this hard bound
improves tolerance of momentary disk stalls.

Informs cockroachdb/cockroach#88699.
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Aug 23, 2023

some form of hedging seems necessary

Seems generally useful:

@nvanbenschoten
Copy link
Member

Another gotcha that we would need to address (from this thread), if we're applying entries that are not stable locally, these entries might actually become stable before the log entries and so after restart we can end up in a new situation where log and applied index are not reachable from each other via the log (today, first index ≤ applied index ≤ last index).

In a discussion with @sumeerbhola, we discovered that this is not the problem that it first appears in today's code. That's because as of #94165 and until we do #94853, raft log writes are synchronously sequenced with state machine writes in the same pebble engine and on the same raft scheduler goroutine. The durability of these raft log writes is async from the perspective of the raft state machine, just like the durabilities of state machine writes, but pebble ensures that the raft log writes will make it to stable storage before the state machine writes. In other words, even with async log writes and async state machine writes, state machine application will never outpace log writes for a given log entry to stable storage. So index ≤ applied index ≤ last index will be preserved.

More so, the synchronous call to CommitNoSyncWait() with each batch of log entries provides immediate read-your-writes visibility, even before the corresponding SyncWait() has returned, so Cockroach should not observe issues with self-referential entry application like log truncation which reaches back around and manipulates the raft log. As long as these writes also go through pebble, they don't actually need to assume durability of the log entries that they manipulate. One concern would be self-referential entry application which reaches around pebble, but @pav-kv just took care of the one case I'm aware of with #114191.

let's ignore conf changes and the like

This has been a cause for concern with this issue, but it's not clear to me now if anything in the apply-time config change protocol would break. The visibility property above is pretty general. It means that if a leader were to apply a conf change before the corresponding log entry was durable locally, it wouldn't notice unless it crashed. And if it crashed, the leadership must fail over to one of the followers in the quorum who had voted for the conf change. The new leader would then need to apply the conf change before appending any new conf changes to its log. From the perspective of followers who comprise the quorum which committed conf change, the old leader not having the entry durably in its log before it crashed doesn't make a difference.


One edge case we haven't discussed yet in this issue is split leader/leaseholder scenarios. For kv write availability, it's important that the leaseholder applies and acks entries, not the leader. So this issue would be better titled "allow committing entries not in leaseholder's stable storage". Regardless, this situation is interesting because raft leaders do not always (ever?) communicate commit indexes to followers in excess of the follower's raft log tail. This is important because

// The leader MUST NOT forward the follower's commit to an unmatched index.

I'm not sure whether that's an implementation limitation (tests panic without this check) of whether it's more fundamental. For instance, it may be unsafe to tell a follower about a commit index if the leader isn't certain that the follower doesn't have a divergent log which has yet to be reconciled.

We would need to change this in order for a follower/leaseholder replica to apply and acknowledge writes that are not durable in its log. This case might have other availability issues as well. However, it may be rare enough to ignore for the initial version of the work.


The TL;DR of this is that things mostly "may just work", at least at the raft level. Many of the gotchas fade away because we perform the non-durable portion of raft log writes synchronously and pebble provides read-your-writes on these non-durable log writes up to the point where the leader crashes.

If this all is safe, then the question becomes: how much of a disk stall can be buffer over in pebble? To avoid impact to the workload, we at least need to be able to buffer over enough writes to give us time to transfer away or shed leases. Even if we can't buffer over a 30s stall, if we can buffer over a 6s stall and then add logic to not renew leases during a local disk stall, we should be able to survive.

@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

3 participants