-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: track per-replica and aggregate to-apply entry bytes #97044
Comments
cc @cockroachdb/replication |
X-ref (+cc @sumeerbhola): the review discussions and |
I don't quite understand these experiment results, hence some questions:
Is this showing the activity on the range at the leaseholder, and not the appends as received by the catching up node? |
Fortunately the graphs from this experiment are still available here.
Quorum is achieved, yes.
I have a naive question. In the graphs below there's a clear point at which we stop receiving MsgApps. We continue applying commands long past that, from those earlier MsgApps. Does the MsgApp receive point correspond to log append? No, right? We could very well have received a MsgApp and not scheduled the raft processor to append (+apply) it? I'm confused since in the graphs above, @tbg annotated that point where we stopped receiving MsgApps as the point where we were "done appending".
See graphs below. It would've kicked in almost instantly, while we were still issuing catchup MsgApps.
The middle graph above is showing the rate of MsgApps received. See my naive question above, asking whether that maps 1:1 to log appends on the catching up node. |
The delay should be very short, since an MsgApp goes into the raft receive queue and this queue is fully emptied into the raft group on the next handle loop, which will emit them to storage in the subsequent one. So ~seconds on an overloaded node. However, log application is bounded, we do at most 64mb per ready, so if we manage to stuff in more than 64mb per ready of log appends, then we would build up a backlog. 64mb per ready is quite a lot so I'm not sure this can really happen - it was our best guess as to what's happening here. |
Random idea: can we optimize application of a large number of Raft log entries? I believe log entries are applied via individual (unindexed?) Pebble batches. There would be some benefit to using fewer Pebble batches. An even bigger benefit if we could figure out how to convert the log entry application into an sstable to be ingested. That feels hard because log entry application might be overwriting keys written by earlier log entries. There might be something doable here by providing a way to transform a Pebble batch into an sstable, omitting overwrites. The idea here might be a non-starter. Feels worthwhile to loop in some Storage folks to think about the possibilities here. |
Wrote up #98363 with a few comments. |
I am not sure if there is anything to do for AC here given the "it was our best guess as to what's happening here" comment in #97044 (comment). If log append and state machine application happen near each other (< 15s apart) the existing AC flow control mechanisms should suffice. And for node restarts there is an existing issue #98710. So I am removing the AC label. |
Is your feature request related to a problem? Please describe.
In recent experiments @andrewbaptist, found1 that because log entry append is sequential whereas log entry application can cause random writes to the LSM, a node that catches up on a log of raft log (for example after restarting following a period of downtime) can end up with an inverted LSM that results from applying a large amount of rapidly appended log entries.
This can be seen in the screenshot below. The top graph shows the inverted LSM - higher is more inverted. The middle graph shows log appends - they level off way before the LSM fully inverts. (As a result, replica pausing in the bottom graph comes too late, as that only delays appending which at this point in time has already completed).
Similar to how writes to the LSM don't account for compaction debt, appends to the raft log are even a step further back, because they don't even account for the writes to the LSM (which may have very different characteristics than the appends themselves, as we see here).
It's desirable to be able to change something about this behavior, but this issue does not attempt to propose a solution. Instead, we note that it would be helpful to at least have a quantity that can detect this debt before it inverts the LSM, as this will likely play a role in both short-term and long-term mitigations to this kind of overload.
Describe the solution you'd like
We could track the number of unapplied entry bytes per replica and also maintain a store-global aggregate. Per Replica, we would have a gauge
unappliedEntryBytes
which tracks all bytes between theAppliedIndex
and theLastIndex
2. Any change to this gauge would be reflected in a store-aggregate gauge, which could be read off quickly without a need to visit all replicas.The gauge would need to be updated on raft log appends (taking care to do this properly on sideloading and also when the log tail is replaced), and on snapshots. Thanks to the local copy of the gauge, the global one can be "partially reset" accordingly.
Describe alternatives you've considered
Additional context
Jira issue: CRDB-24481
Epic CRDB-39898
Footnotes
https://cockroachlabs.slack.com/archives/G01G8LK77DK/p1675457831537829?thread_ts=1675371760.534349&cid=G01G8LK77DK ↩
it's likely unimportant whether we use the durable or inflight last-index, the inflight one likely makes more sense. ↩
The text was updated successfully, but these errors were encountered: