-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: divergent entry application can lead to request stalls on Range splits #31330
Comments
It's surprising that this can happen, though I can't think of anything that currently prevents it. Do you have any theory as to why the followers aren't keeping up with applying Raft commands? The follower still has the commands in its log. Are we not properly draining the log on followers? etcd-io/etcd#9982 added pagination to committed entries. Are we properly re-enqueuing the replica on the Raft scheduler when we only process a portion of the Raft log? Cc @bdarnell |
I know that TPCC sees the pagination fire at least on the districts table. I remember that from #28918. |
I explored that change when looking into this. My theory was that if leaders and followers repeatedly hit a different code path in I'm going to re-run this with etcd-io/etcd@7a8ab37 and see if I can still reproduce. |
I tested again with etcd-io/etcd@7a8ab37. This time, things made more sense. I still saw the divergent Raft applications between the leader and the followers but no longer saw the divergent commit indexes that I had before. My next step is to crank up the
That's a great question. I suspect that we're not. With the added knowledge that we're likely running into the FWIW, we can detect this case with etcd-io/etcd@7a8ab37 merged by comparing |
That did the trick! Nice call @petermattis. To summarize: when a Range is under a large amount of continuous load, it's possible that every Raft ready iteration it performs hits the pagination limit introduced in etcd-io/etcd#9982. If we don't detect these truncated Raft Ready structs and proactively requeue the Range to be processed again immediately then we fall into a situation where whichever peer runs through the most Raft ready iterations ends up applying the most entries. Naturally, the leader is the one who processes Raft the most because it's getting messages from both followers, so it pulls ahead. I have two take-away questions from this, both for @bdarnell:
|
How are you detecting that you have a truncated ready? I was thinking that
There is probably a cleaner way to do this, but you get the idea. |
I was comparing My plan was just to re-enqueue the Range update check again directly with the scheduler in these cases. That will be a little more expensive than looping directly above |
So the change is as simple as:
|
When I made the raft change I was thinking MaxSizePerMsg was effectively "the amount of data we're willing to load into memory at once", and it would be nice to have one setting for that instead of multiplying them. But MaxSizePerMsg was changed in #10929 from 1MB to 16KB (I think in an attempt to reduce the time spent blocking to process an individual message, but these values seem awfully low to me today. Maybe we should revisit) I think we should split things up to give CommittedEntries a separate limit (64MB?) |
Yeah, I think we need to dramatically increase the CommittedEntries size limit. 64MB sounds good. Or we can rip it out after we have etcd-io/etcd#10167, which will limit the uncommitted entry size to 2MB. Once we do that, I think the patch in #31330 (comment) should be sufficient. I tested this out on a single-node cluster on my Mac using the |
etcd-io/etcd#10167 limits the portion of the log that is not committed anywhere to 2MB. We still need separate pagination for CommittedEntries for followers who may be catching up (the original motivation was a follower that had a GB of logs to catch up on, even though those logs had been committed and applied on the other two nodes). |
That makes sense. For now, I'll push to get the fix proposed in this issue into 2.1.1. Adding a new limit to |
Good to finally see this coming to a close. BTW, were there any metrics that caught this? |
I don't think |
Ah, right. Should there be a new metric for this? Or a one-off check somewhere cheap that at least logs a warning when we're obviously farther behind than we'd like? This seems like one of the problems that will just go away when you fix it, but still. |
Yeah, I'm not sure how useful that would be. Once we fix this, we should get back to the previous state of always applying all committed entries immediately on all followers, at which point a new metric wouldn't show anything useful. Of course, I hope we can move to a world where it would be absolutely essential - #17500. |
Fixes cockroachdb#31330. This change re-enqueues Raft groups for processing immediately if they still have more to do after a Raft ready iteration. This comes up in practice when a Range has sufficient load to force Raft application pagination. See cockroachdb#31330 for a discussion on the symptoms this can cause. Release note (bug fix): Fix bug where Raft followers could fall behind leaders will entry application, causing stalls during splits.
Fixes cockroachdb#31330. This change re-enqueues Raft groups for processing immediately if they still have more to do after a Raft ready iteration. This comes up in practice when a Range has sufficient load to force Raft application pagination. See cockroachdb#31330 for a discussion on the symptoms this can cause. Release note (bug fix): Fix bug where Raft followers could fall behind leaders will entry application, causing stalls during splits.
Fixes cockroachdb#31330. This change re-enqueues Raft groups for processing immediately if they still have more to do after a Raft ready iteration. This comes up in practice when a Range has sufficient load to force Raft application pagination. See cockroachdb#31330 for a discussion on the symptoms this can cause. Release note (bug fix): Fix bug where Raft followers could fall behind leaders will entry application, causing stalls during splits.
31568: storage: re-enqueue Raft groups on paginated application r=nvanbenschoten a=nvanbenschoten Fixes #31330. This change re-enqueues Raft groups for processing immediately if they still have more to do after a Raft ready iteration. This comes up in practice when a Range has sufficient load to force Raft application pagination. See #31330 for a discussion on the symptoms this can cause. Release note (bug fix): Fix bug where Raft followers could fall behind leaders will entry application, causing stalls during splits. Co-authored-by: Nathan VanBenschoten <[email protected]>
Fixes cockroachdb#31330. This change re-enqueues Raft groups for processing immediately if they still have more to do after a Raft ready iteration. This comes up in practice when a Range has sufficient load to force Raft application pagination. See cockroachdb#31330 for a discussion on the symptoms this can cause. Release note (bug fix): Fix bug where Raft followers could fall behind leaders will entry application, causing stalls during splits.
Fixes cockroachdb#31330. This change re-enqueues Raft groups for processing immediately if they still have more to do after a Raft ready iteration. This comes up in practice when a Range has sufficient load to force Raft application pagination. See cockroachdb#31330 for a discussion on the symptoms this can cause. Release note (bug fix): Fix bug where Raft followers could fall behind leaders will entry application, causing stalls during splits.
Fixes cockroachdb#31330. This change re-enqueues Raft groups for processing immediately if they still have more to do after a Raft ready iteration. This comes up in practice when a Range has sufficient load to force Raft application pagination. See cockroachdb#31330 for a discussion on the symptoms this can cause. Release note (bug fix): Fix bug where Raft followers could fall behind leaders will entry application, causing stalls during splits.
Most of this is taken from #30064 (comment). See that comment and its responses for a discussion on what's happening and why. The summary is that all followers in a Range can fall behind their leader with regard to the entries that they are applying at a given point in time. This separation can grow to the order of minutes. When this happens, splits will be applied on the leader minutes before they are applied on the followers, resulting in new Ranges that can't acheive quorum for minutes. Requests that are sent to these new Ranges stall for this period.
To reproduce:
workload tpcc init --warehouses=1000
against a 3-node cluster withoutnobarrier
The text was updated successfully, but these errors were encountered: