-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: entire delta cluster stuck, not serving any SQL traffic #10602
Comments
The couple profiles I took didn't have anything incredibly helpful. There were decent chunks of time spent in go's GC, in raft heartbeats, and in rocksdb, but none of that seems unreasonable. The logs are still interesting though. I'm not totally sure how to read our grafana graphs about the GC queue, but there's clearly something going wrong in the GC queue processing, and the fact that the graph for GC processing time is around 20 seconds seems pretty bad. Is that per-replica? |
Initial guess is that we have 1 or more ranges that are below quorum. All nodes are up, right? |
Yes, all nodes are up. |
I haven't used it, but the |
It's a lot of ranges. Everything has effectively ground to a halt. The number of replica leaseholders in the cluster has dropped incredibly low and range allocator traffic has mostly stopped except during that 3 hour time period of activity this morning. |
/debug/requests shows (*Replica).Send operations having to retry effectively forever on "retry proposal 48e68cb508d9f571: reasonTicks" errors |
Looking at the relevant range status for a range being operated on by one of those (*Replica).Send traces, the raft state of the range is "StateDormant". |
That's an indication of the proposal never being committed, presumably because the range is below quorum. |
|
I silenced delta for 24h, it was spamming alerts quite a bit. |
It may just be a coincidence, but the number of dormant ranges is very close to the number of ranges experiencing all those deadline exceeded GC queue errors:
|
There are some problems with inconsistent data in the cluster. There are a bunch of consistency check errors, like:
And the replicas with consistency errors appear to be correlated with the context deadline exceeded errors. We don't know how the data became inconsistent, but we did find a bug with the consistency checking code in the process. Once we find a consistency check issue, we try collecting the details asynchronously with a context that get's getting cancelled shortly after kicking off the async task. We'll fix that issue and see if we can get a diff of what the inconsistency is. |
The log spam is a result of #10487; previously all of these deadline-exceeded errors were getting into the command queue and causing our memory spikes on delta (#10427). So this confirms that the problem is in fact in the GC queue. We can fix the spam by checking the context in the gc queue's loop and bail out early. But this sounds like a symptom rather than the cause. |
Yeah, the CPU profiles at least showed that log spam and processing old contexts wasn't using up too much of the server's resources, so I don't see how it could be anything more than a symptom. I'm going to deploy #10625 to the cluster so that hopefully we can get some more detail on the consistency problems and work from there on what the cause is. |
And we have our first diff: |
A couple more diffs. I think there were a couple more before I shut the cluster back down, but this should be enough. |
If I'm parsing that correctly, the follower has a key ( I take it we don't have logs from that time period. It is fairly unfortunate that we weren't crashing earlier. This is a very serious bug. I'm doubtful there is anything we can do to recover the cluster at this point. I think we should back up all of the logs and state (e.g. @bdarnell Thoughts? |
To be specific, the version I brought up was built at 3e0f63d |
Ah, I think I've been misunderstanding the stop-the-world process. I brought all nodes down before bringing any new ones up, but didn't explicitly |
But #10420 says that an explicit |
Stop the world means just taking all of the nodes down and restarting. |
The other consistency failures at keys with timestamps of |
Without having read through the details of #10420, my hunch would be that it (or something else in the new binary) did require a freeze. |
Did you mean #10420? |
Yeah. Typing is hard |
I doubt that #10420 required a freeze... Even if it did, it would've required a freeze for the purposes of the lease records themselves, but that's not what's being flagged by the consistency checker. Is there a way to know in Grafana what version a cluster was running at any time? I see a "Built timestamp" pane in Grafana that seems to contain metrics that have a beta version in their name, but I'm not sure how to use that to actually see what when on a machine. |
The timing is so suspicious (and the consequences so severe) that we should probably withdraw beta-20161110 until we figure this out. I re-read #10420 and it still looks safe to me. We should really have better records of what was running before, but by doing |
@andreimatei Do you mind trying to reproduce the consistency failure by upgrading from earlier SHAs to master? Per Ben's note, testing upgrades from 9ce8833 to master seems like the next logical step. I'm going to try as well. Hopefully one of us will hit pay dirt. @bdarnell Have we withdrawn a beta before? Do we have a process for that? |
Ok, managed to reproduce consistency check failures going from SHA 9ce8833 to 2257842. The specific sequence I performed was:
The consistency check failures look like:
I didn't get diffs for some reason, but given the reproducibility this shouldn't be difficult to track down now. I've also got the data directory from running on 9ce8833 so I can bisect to find out which SHA is incompatible. I've got a variety of errands and non-work work to do today, so any help tracking this down would be appreciated. |
Note that before @a-robinson's recent fix, |
Definitely not kosher to add non-nullable fields to |
Looking at one of the consistency failures I see:
The
Sounds like #10327 changed how timestamps are assigned when applying commands. That should narrow down the search. |
The one time we've withdrawn a beta we had the fix ready immediately so we just pushed out a new binary with the same filename (this is not what we should be doing, but it was soon enough after the release that it hopefully wasn't too harmful). We've never yanked a beta without something to replace it with. I think the procedure would be to remove the links from the release notes (but leave the page up), change the "current version" constant to point back to 1103, make a post on the forum about it, and probably delete the tarballs from s3. |
This reverts commit 586325e. See cockroachdb#10602
Beta has been yanked: cockroachdb/docs#853, #10664, and https://forum.cockroachlabs.com/t/release-notes-for-beta-20161110/345 |
I think I've found it. In The problem is that post-raft, without propEvalKV, we call |
I'm really glad y'all have found this! Since this reproduces so easily with roachdemo (so, I guess, the point is that it's so easy to have in-flight commands that were applied by some nodes before a shutdown and some after), I think we should have an acceptance test that tests upgrades from the last beta to the last PR. I can volunteer to work on a more deterministic way to wait for a consistency check and to build such a test. |
I haven't had a chance to look at this today. Is there a straightforward fix?
I'd like to see that test, though it might be mildly trick to write. |
When I said "false positives" for lease checks, I meant that commands might succeed even though they don't have a valid lease (they show up with a zero timestamp, which is always before the lease's expiration). But it's complicated and I don't know all the ways in which this might fail (it looks like we're also seeing the leases themselves diverge) To fix this convincingly, I think we're going to need a refactoring of the propEvalKV code so that evaluateProposal is only called once (instead of once before and once after raft). I don't think it will be too hard, but maybe I just haven't hit the roadblocks that led to this structure in the first place. The root cause here relates to some proto changes that were put in place to minimize code duplication between the propEvalKV and legacy modes, so the fix may have to introduce some duplication. |
Nice work figuring this out! If the fix is particularly onerous, couldn't we just require a |
|
There's another big problem somewhere in our replica processing which might
This is very repeatable via the Jepsen duplicates test. It looks like On Sun, Nov 13, 2016 at 9:58 PM Ben Darnell [email protected]
|
This method was previously called both before and after raft, and the after-raft logic relied on fields that were set in the ReplicatedProposalData by the before-raft logic. This caused inconsistencies when cockroachdb#10327 was deployed without freeze-cluster. Now, evaluateProposal is called only once, either before or after raft, and the post-raft logic handles the command in whichever fashion is appropriate for the version that proposed it. Fixes cockroachdb#10602
OK, after discussing with @andreimatei, I think there are two issues here. One I know how to solve and the other needs some deeper digging.
|
That doesn't quite make sense to me. I believe you're saying that some commands failed to be applied by nodes that tried to apply them before the restart, and are applied by nodes that try to apply it after the restart. But how exactly can we get this situation? Why didn't the nodes that tried to apply them before the restart succeed in applying them, given that the lease is also checked before proposing? This also reminds me that I think Tobi had plans to put the lease under which a command has been proposed into the raft command. Although I don't think that's actually necessary given that we don't wipe |
@andreimatei: @spencerkimball Please file a separate issue for the things you're looking into. |
This method was previously called both before and after raft, and the after-raft logic relied on fields that were set in the ReplicatedProposalData by the before-raft logic. This caused inconsistencies when cockroachdb#10327 was deployed without freeze-cluster. Now, evaluateProposal is called only once, either before or after raft, and the post-raft logic handles the command in whichever fashion is appropriate for the version that proposed it. Fixes cockroachdb#10602
Good news. More logging has revealed that the Jepsen duplicates issue is not also creating inconsistencies. The missing write to the original leader is accounted for by a snapshot. |
This method was previously called both before and after raft, and the after-raft logic relied on fields that were set in the ReplicatedProposalData by the before-raft logic. This caused inconsistencies when cockroachdb#10327 was deployed without freeze-cluster. Now, evaluateProposal is called only once, either before or after raft, and the post-raft logic handles the command in whichever fashion is appropriate for the version that proposed it. Fixes cockroachdb#10602
Delta has been failing to serve requests for most of the last 19 hours, having only 3 good hours from 8-11 UTC this morning.
The logs are full of "context deadline exceeded" errors.
There are a ton (> 1000) of repeated logs like this in a row, all for the same range/replica, spammed such that each is came less than a hundred microseconds after the last:
Those are followed by a ton of errors about an inability to push a transaction, with the context being the same range/replica. These are spammed even faster, coming 10s of microseconds apart:
In the one case I looked at most closely, there was another different error mixed in the middle every couple hundred lines:
Once that stops, there are a bunch of "transferring raft leadership" messages about different ranges before the pattern starts over again for a different range/replica.
I'll check out a profile of the node next.
The text was updated successfully, but these errors were encountered: