-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: allow committing entries not in leader's stable storage #88699
Comments
cc @cockroachdb/replication |
+1 to the premise of this issue, thanks for writing this up. An interesting observation that came out of our earlier discussion about this is that quorum replication can minimize the effect of fsync latency on writes (since raft commit and state machine application would see the K-th lowest fsync latency where K is the quorum size). However, a slow disk on the leaseholder will not be hidden from a reader attempting to read from that disk (after a block/page cache miss). For that, some form of hedging seems necessary. Strong consistency, even with consistent follower reads, complicates our ability to hedge reads. I don't understand the internals of cloud block storage well enough to know whether variable write latency is more of a problem than variable read latency. It's also worth pointing out in this issue that (as of #38954) we don't wait for Raft entry application before acknowledging a write to the client. However, we do have to wait for the application before releasing that write's latches. As a result, any latency incurred by log application will not be paid by the writer itself, but will be paid by any reader or writer that contends with the writer. |
Yes, that's a concern, too. I would suspect that write throughput suffers more frequently than read throughput (since the latter is easier to scale), but both scenarios exist. I'm not aware of great hedges here. There is always the option of transferring the lease (when there is a suitable target) but that is a very binary option and it's hard to determine the right moment to do so.
Just to spell this out for myself, in the ideal case, if we have a leader with a slow disk, and we've done all of the work in #17500 with an eye on this, a writer could come in, the write would commit with two followers, the append to the leader would be stuck, but the leader would learn that this entry committed and this would be enough to signal the waiting writer. What's more, the leader doesn't have to wait for entries to be durable locally before it applies them (let's ignore conf changes and the like, where we can be stricter), assuming they got committed using other voters. The only gotcha is that if the leader crashes, and we have separate engines (= no ordering constraint between log write and state machine apply), we might come up with an One practical issue might be that in order to apply an entry, you have to have the entry. If it's not in the log, it has to be in memory somewhere, and there are limits to that. The raft entry cache is already well suited to this sort of thing, though it isn't clear how to interface with |
Could we talk more about this? I understand the problem, but came to the conclusion that the unstable log in For context, the way I've adjusted the unstable log in // unstable contains "unstable" log entries and snapshot state that has
// not yet been written to Storage. The type serves two roles. First, it
// holds on to new log entries and an optional snapshot until they are
// handed to a Ready struct for persistence. Second, it continues to
// hold on to this state after it has been handed off to provide raftLog
// with a view of the in-progress log entries and snapshot until their
// writes have been stabilized and are guaranteed to be reflected in
// queries of Storage. After this point, the corresponding log entries
// and/or snapshot can be cleared from unstable. |
Apologies, bringing up the raft entry cache here was premature. My main concern is being able to control memory use across a possibly large number of raft instances in a given process. Anything that lives "inside" raft typically gets a configuration parameter that limits memory usage for that raft instance alone, but this is insufficient for our usage1. Taking a step back, our memory control issues already start way above raft, when messages enter the raft receive queue. From there they get stepped into RawNode, where they now linger until they've been fully processed and can get GC'ed. Ideally we can curb and prioritize all of these in a coherent way that additionally allows for prioritization between ranges (system ranges > user ranges). When a raft message arrives from "outside" (i.e. MsgApp), we could acquire from a mem budget2. The message then enters raft and likely the unstable log, though this is not guaranteed. Finding out when that entry goes out of scope seems really tricky and yet that's what ultimately we'd like to be able to do. Essentially it's out of scope when raft no longer holds a reference to it (i.e. it has left the unstable log, if it even entered it in the first place), and we don't hold a reference to it (it has been made durable, we've sent MsgApp to all followers, and it's been evicted from the raft entry cache). Of course we have the outgoing raft transport which has its own queue, further complicating the lifecycle. Conceptually, it would be nice to live in one of two worlds, where either raft owns all memory or no memory. If it owns all memory, rather than having a "raft receive queue", we synchronously pass every incoming MsgApp to raft which will either accept it (accounting for it) or force us to drop it. Similarly, we ack every message sent to raft once it has actually been sent so that it can update its accounting. Or raft owns no memory (but we need to synchronize with it quite a bit, which doesn't seem great). So instead of an unstable log, it holds on to some abstraction that we own, and we keep doing raft receive queues, etc, and the abstractions are fallible and may force raft to defer work it was planning on doing. Of course there is a part of allocations that can never be "inside" of raft, namely the creation of new proposals. This all seems fairly involved and deciding what to do about it isn't in scope for this issue. But async append will hold on to memory for longer, so we need at least to make sure that there is some mechanism by which we degrade to sync appends after a certain inflight size. This could happen outside of raft, by waiting for the append to become durable (thus emptying out the unstable log) before continuing. The second comment which was sort of jumbled up with my other one in the post above is that raft "hides" the true commit index from peers; it will only tell them about the one they can actually pull up the entry for (from stable storage). I thought this would have to change but I no longer think so because the committed index is sent on the MsgApp. So we just have to remember it somewhere and it becomes effective only when the append has become durable. So nothing surprising happens here. Footnotes |
This introduces a `vfs.FS` implementation that can be toggled to block on any write activity. Add two tests: 1. check that even though a durable commit is inflight, we can still write non-durably. I.e. at the pebble level there is no immediate reason why we couldn't keep *applying* the raft log even though log *appends* are stuck. 2. end-to-end test where we set up a three node cluster, send unique writes and stall the leaseholder's disk, hoping for the writes to keep going through. This fails, as is expected, because raft does not even put up entries for application if they haven't been durably appended to the leader's log (which they can't be due to the stall we're injecting). If we can get 2) to pass, we can have reason to believe that we can address cockroachdb#88699 (though we'd need to verify with "real" disk stalls and higher write volumes/stall durations and would certainly need to configure pebble to allow much more data to accumulate in memory). The additional benefit test 2) provides is that it prints who is blocked. At the time of writing, it's encouraging - we see just the LogWriter blocked on a call to `Write` (via `flushPending`).o The next step would be to hack in the ability to apply entries that aren't durable on the leader and to see what we get. Touches cockroachdb#88699. Release note: None
There is a prototype in #98852 which allows a small number of writes to go through with a stalled disk at the leader. They currently stall because of
So it looks like we have a good idea of what needs to be done here. I'll note that #98852 injects the "disk" stall artificially via a custom |
Another gotcha that we would need to address (from this thread), if we're applying entries that are not stable locally, these entries might actually become stable before the log entries and so after restart we can end up in a new situation where log and applied index are not reachable from each other via the log (today, first index ≤ applied index ≤ last index). |
Previously, a LogWriter would allocate up to 16 blocks of 32 KiB for buffering WAL writes. If all 16 blocks had been allocated and no free blocks were available, a batch writing to the WAL would queue until the flushing goroutine freed blocks. In testing of write-heavy workloads, especially with larger value sizes, we've seen queueing at the LogWriter. This queueing blocks the commit pipeline, preventing any batches from committing regardless of priority and whether they require waiting for fsync. This commit modifies LogWriter to allow the queueing of an unbounded number of blocks. In practice, for the current WAL, the memtable size serves as an upper bound. With a 64 MiB memtable, at most 64 MiB / 32 KiB = 2,048 blocks may queue. This is not an unreasonable of additional memory overhead for a write-heavy workload. Beyond improving throughput for write-heavy workloads, removing this hard bound improves tolerance of momentary fsync stalls. Informs cockroachdb/cockroach#88699.
Previously, a LogWriter would allocate up to 16 blocks of 32 KiB for buffering WAL writes. If all 16 blocks had been allocated and no free blocks were available, a batch writing to the WAL would queue until the flushing goroutine freed blocks. In testing of write-heavy workloads, especially with larger value sizes, we've seen queueing at the LogWriter. This queueing blocks the commit pipeline, preventing any batches from committing regardless of priority and whether they require waiting for fsync. This commit modifies LogWriter to allow the queueing of an unbounded number of blocks. In practice, for the current WAL, the memtable size serves as an upper bound. With a 64 MiB memtable, at most 64 MiB / 32 KiB = 2,048 blocks may queue. This is not an unreasonable of additional memory overhead for a write-heavy workload. Beyond improving throughput for write-heavy workloads, removing this hard bound improves tolerance of momentary disk stalls. Informs cockroachdb/cockroach#88699.
Previously, a LogWriter would allocate up to 16 blocks of 32 KiB for buffering WAL writes. If all 16 blocks had been allocated and no free blocks were available, a batch writing to the WAL would queue until the flushing goroutine freed blocks. In testing of write-heavy workloads, especially with larger value sizes, we've seen queueing at the LogWriter. This queueing blocks the commit pipeline, preventing any batches from committing regardless of priority and whether they require waiting for fsync. This commit modifies LogWriter to allow the queueing of an unbounded number of blocks. In practice, for the current WAL, the memtable size serves as an upper bound. With a 64 MiB memtable, at most 64 MiB / 32 KiB = 2,048 blocks may queue. This is not an unreasonable of additional memory overhead for a write-heavy workload. Beyond improving throughput for write-heavy workloads, removing this hard bound improves tolerance of momentary disk stalls. Informs cockroachdb/cockroach#88699.
Previously, a LogWriter would allocate up to 16 blocks of 32 KiB for buffering WAL writes. If all 16 blocks had been allocated and no free blocks were available, a batch writing to the WAL would queue until the flushing goroutine freed blocks. In testing of write-heavy workloads, especially with larger value sizes, we've seen queueing at the LogWriter. This queueing blocks the commit pipeline, preventing any batches from committing regardless of priority and whether they require waiting for fsync. This commit modifies LogWriter to allow the queueing of an unbounded number of blocks. In practice, for the current WAL, the memtable size serves as an upper bound. With a 64 MiB memtable, at most 64 MiB / 32 KiB = 2,048 blocks may queue. This is not an unreasonable of additional memory overhead for a write-heavy workload. Beyond improving throughput for write-heavy workloads, removing this hard bound improves tolerance of momentary disk stalls. Informs cockroachdb/cockroach#88699.
Previously, a LogWriter would allocate up to 16 blocks of 32 KiB for buffering WAL writes. If all 16 blocks had been allocated and no free blocks were available, a batch writing to the WAL would queue until the flushing goroutine freed blocks. In testing of write-heavy workloads, especially with larger value sizes, we've seen queueing at the LogWriter. This queueing blocks the commit pipeline, preventing any batches from committing regardless of priority and whether they require waiting for fsync. This commit modifies LogWriter to allow the queueing of an unbounded number of blocks. In practice, for the current WAL, the memtable size serves as an upper bound. With a 64 MiB memtable, at most 64 MiB / 32 KiB = 2,048 blocks may queue. This is not an unreasonable of additional memory overhead for a write-heavy workload. Beyond improving throughput for write-heavy workloads, removing this hard bound improves tolerance of momentary disk stalls. Informs cockroachdb/cockroach#88699.
Previously, a LogWriter would allocate up to 16 blocks of 32 KiB for buffering WAL writes. If all 16 blocks had been allocated and no free blocks were available, a batch writing to the WAL would queue until the flushing goroutine freed blocks. In testing of write-heavy workloads, especially with larger value sizes, we've seen queueing at the LogWriter. This queueing blocks the commit pipeline, preventing any batches from committing regardless of priority and whether they require waiting for fsync. This commit modifies LogWriter to allow the queueing of an unbounded number of blocks. In practice, for the current WAL, the memtable size serves as an upper bound. With a 64 MiB memtable, at most 64 MiB / 32 KiB = 2,048 blocks may queue. This is not an unreasonable of additional memory overhead for a write-heavy workload. Beyond improving throughput for write-heavy workloads, removing this hard bound improves tolerance of momentary disk stalls. Informs cockroachdb/cockroach#88699.
Previously, a LogWriter would allocate up to 16 blocks of 32 KiB for buffering WAL writes. If all 16 blocks had been allocated and no free blocks were available, a batch writing to the WAL would queue until the flushing goroutine freed blocks. In testing of write-heavy workloads, especially with larger value sizes, we've seen queueing at the LogWriter. This queueing blocks the commit pipeline, preventing any batches from committing regardless of priority and whether they require waiting for fsync. This commit modifies LogWriter to allow the queueing of an unbounded number of blocks. In practice, for the current WAL, the memtable size serves as an upper bound. With a 64 MiB memtable, at most 64 MiB / 32 KiB = 2,048 blocks may queue. This is not an unreasonable of additional memory overhead for a write-heavy workload. Beyond improving throughput for write-heavy workloads, removing this hard bound improves tolerance of momentary disk stalls. Informs cockroachdb/cockroach#88699.
Previously, a LogWriter would allocate up to 16 blocks of 32 KiB for buffering WAL writes. If all 16 blocks had been allocated and no free blocks were available, a batch writing to the WAL would queue until the flushing goroutine freed blocks. In testing of write-heavy workloads, especially with larger value sizes, we've seen queueing at the LogWriter. This queueing blocks the commit pipeline, preventing any batches from committing regardless of priority and whether they require waiting for fsync. This commit modifies LogWriter to allow the queueing of an unbounded number of blocks. In practice, for the current WAL, the memtable size serves as an upper bound. With a 64 MiB memtable, at most 64 MiB / 32 KiB = 2,048 blocks may queue. This is not an unreasonable of additional memory overhead for a write-heavy workload. Beyond improving throughput for write-heavy workloads, removing this hard bound improves tolerance of momentary disk stalls. Informs cockroachdb/cockroach#88699.
Seems generally useful: |
In a discussion with @sumeerbhola, we discovered that this is not the problem that it first appears in today's code. That's because as of #94165 and until we do #94853, raft log writes are synchronously sequenced with state machine writes in the same pebble engine and on the same raft scheduler goroutine. The durability of these raft log writes is async from the perspective of the raft state machine, just like the durabilities of state machine writes, but pebble ensures that the raft log writes will make it to stable storage before the state machine writes. In other words, even with async log writes and async state machine writes, state machine application will never outpace log writes for a given log entry to stable storage. So More so, the synchronous call to
This has been a cause for concern with this issue, but it's not clear to me now if anything in the apply-time config change protocol would break. The visibility property above is pretty general. It means that if a leader were to apply a conf change before the corresponding log entry was durable locally, it wouldn't notice unless it crashed. And if it crashed, the leadership must fail over to one of the followers in the quorum who had voted for the conf change. The new leader would then need to apply the conf change before appending any new conf changes to its log. From the perspective of followers who comprise the quorum which committed conf change, the old leader not having the entry durably in its log before it crashed doesn't make a difference. One edge case we haven't discussed yet in this issue is split leader/leaseholder scenarios. For kv write availability, it's important that the leaseholder applies and acks entries, not the leader. So this issue would be better titled "allow committing entries not in leaseholder's stable storage". Regardless, this situation is interesting because raft leaders do not always (ever?) communicate commit indexes to followers in excess of the follower's raft log tail. This is important because
I'm not sure whether that's an implementation limitation (tests panic without this check) of whether it's more fundamental. For instance, it may be unsafe to tell a follower about a commit index if the leader isn't certain that the follower doesn't have a divergent log which has yet to be reconciled. We would need to change this in order for a follower/leaseholder replica to apply and acknowledge writes that are not durable in its log. This case might have other availability issues as well. However, it may be rare enough to ignore for the initial version of the work. The TL;DR of this is that things mostly "may just work", at least at the raft level. Many of the gotchas fade away because we perform the non-durable portion of raft log writes synchronously and pebble provides read-your-writes on these non-durable log writes up to the point where the leader crashes. If this all is safe, then the question becomes: how much of a disk stall can be buffer over in pebble? To avoid impact to the workload, we at least need to be able to buffer over enough writes to give us time to transfer away or shed leases. Even if we can't buffer over a 30s stall, if we can buffer over a 6s stall and then add logic to not renew leases during a local disk stall, we should be able to survive. |
Is your feature request related to a problem? Please describe.
The entire point of quorum replication is to provide high availability for
writes. As long as a quorum of voters can write to disk & communicate, forward
progress ought to be possible.
Unfortunately in practice,
etcd/raft
forces the leader to append to its locallog before disseminating new log entries.
A leader with a write-degraded storage thus renders the range inoperable to
varying degrees.1 In other words, the blast radius is larger than necessary.
Generally, since users are sensitive even to short blips in latency, avoiding
the high tail latencies of synchronous writes is a good idea. More symmetry
helps with that.
Describe the solution you'd like
Improve the interface of
RawNode
such that it decouples the operations thatmove data to stable storage (voting, log appends, persisting the occasional
commit index) from the other operations (in particular, sending
messages). This requires or at least strongly suggests also doing #17500, since
a leader that cannot apply entries that were committed without being locally
durable will not release latches and thus continues to propagate the effects of
its poor disk health to foreground traffic.
Describe alternatives you've considered
We could more aggressively fail over away from leaders that are not performing
as expected. In a sense, this would be the more "sturdy" alternative and
possibly a better ROI if the main scenario we're trying to address are
persistent degradations.
Decoupling the various Ready tasks should noticeable smooth out kinks for which
a fail-over would be too heavyhanded a solution.
It also allows us to markedly improve the performance in the steady state.
So we should really do both anyway.
Additional context
#88442
#88596
#17500
Jira issue: CRDB-19840
Epic CRDB-40197
Footnotes
until the disk degrades to the point where the leader fails to heartbeat
followers at a sufficient frequency, at which point a new leader can be
elected ↩
The text was updated successfully, but these errors were encountered: