Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: entire delta cluster stuck, not serving any SQL traffic #10602

Closed
a-robinson opened this issue Nov 10, 2016 · 51 comments
Closed

stability: entire delta cluster stuck, not serving any SQL traffic #10602

a-robinson opened this issue Nov 10, 2016 · 51 comments
Assignees
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@a-robinson
Copy link
Contributor

Delta has been failing to serve requests for most of the last 19 hours, having only 3 good hours from 8-11 UTC this morning.

The logs are full of "context deadline exceeded" errors.

There are a ton (> 1000) of repeated logs like this in a row, all for the same range/replica, spammed such that each is came less than a hundred microseconds after the last:

W161110 16:39:04.701610 506 storage/gc_queue.go:218  [n10,gc,s19,r5444/7:/Table/55/1/871{86020…-91657…}] unable to resolve intents of committed txn on gc: context deadline exceeded

Those are followed by a ton of errors about an inability to push a transaction, with the context being the same range/replica. These are spammed even faster, coming 10s of microseconds apart:

W161110 16:39:04.727940 4935162 storage/gc_queue.go:628  [n10,gc,s19,r5444/7:/Table/55/1/871{86020…-91657…}] push of txn id=cf175f6e key=/Table/55/1/2699716940960131706/"bd72518a-c1ae-4f98-a1d3-8c40d4f6fe43"/7751851/0 rw=false pri=0.00868472 iso=SERIALIZABLE stat=PENDING epo=0 ts=1478513462.086455407,0 orig=0.000000000,0 max=0.000000000,0 wto=false rop=false failed: context deadline exceeded

In the one case I looked at most closely, there was another different error mixed in the middle every couple hundred lines:

W161110 16:39:04.791403 4935416 storage/gc_queue.go:628  [n10,gc,s19,r5444/7:/Table/55/1/871{86020…-91657…}] push of txn "sql/executor.go:546 sql txn implicit" id=be657d16 key=/Table/55/1/8718809091224052977/"8a16506f-3592-4c78-a50d-b1b83f015480"/6148471/0 rw=true pri=0.01430796 iso=SERIALIZABLE stat=PENDING epo=0 ts=1478241838.382343181,0 orig=1478241838.382343181,0 max=1478241838.456109681,0 wto=false rop=false failed: context deadline exceeded

Once that stops, there are a bunch of "transferring raft leadership" messages about different ranges before the pattern starts over again for a different range/replica.

I'll check out a profile of the node next.

@a-robinson a-robinson added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Nov 10, 2016
@a-robinson
Copy link
Contributor Author

a-robinson commented Nov 10, 2016

The couple profiles I took didn't have anything incredibly helpful. There were decent chunks of time spent in go's GC, in raft heartbeats, and in rocksdb, but none of that seems unreasonable.

The logs are still interesting though. I'm not totally sure how to read our grafana graphs about the GC queue, but there's clearly something going wrong in the GC queue processing, and the fact that the graph for GC processing time is around 20 seconds seems pretty bad. Is that per-replica?

@petermattis
Copy link
Collaborator

Initial guess is that we have 1 or more ranges that are below quorum. All nodes are up, right?

@a-robinson
Copy link
Contributor Author

Yes, all nodes are up.

@petermattis
Copy link
Collaborator

I haven't used it, but the /debug/raft page might be able to give more details about that range (r5444). Are there a lot of ranges experiencing these errors, or only a handful?

@a-robinson
Copy link
Contributor Author

It's a lot of ranges.

Everything has effectively ground to a halt. The number of replica leaseholders in the cluster has dropped incredibly low and range allocator traffic has mostly stopped except during that 3 hour time period of activity this morning.

@a-robinson
Copy link
Contributor Author

/debug/requests shows (*Replica).Send operations having to retry effectively forever on "retry proposal 48e68cb508d9f571: reasonTicks" errors

@a-robinson
Copy link
Contributor Author

Looking at the relevant range status for a range being operated on by one of those (*Replica).Send traces, the raft state of the range is "StateDormant".

@petermattis
Copy link
Collaborator

That's an indication of the proposal never being committed, presumably because the range is below quorum.

@petermattis
Copy link
Collaborator

StateDormant indicates there is no Raft group. That's odd. There should definitely be a Raft group, but perhaps not a leader.

@mberhault
Copy link
Contributor

I silenced delta for 24h, it was spamming alerts quite a bit.

@a-robinson
Copy link
Contributor Author

It may just be a coincidence, but the number of dormant ranges is very close to the number of ranges experiencing all those deadline exceeded GC queue errors:

$ grep Dormant ranges.txt | wc -l
44
$ cat logs/cockroach.stderr* | grep "unable to resolve intents of committed txn on gc: context deadline exceeded" | cut -d' ' -f 6 | sort -u | uniq | wc -l
37

@a-robinson
Copy link
Contributor Author

There are some problems with inconsistent data in the cluster. There are a bunch of consistency check errors, like:

logs/cockroach.stderr.13:E161110 14:05:13.568640 4201911 storage/stores_server.go:86  consistency check failed on range ID 3186: expected checksum c86b786f961e7df72b6f4f8408fefeab5141623cff902ecc46f4593b9d32b423262e23b31abc6b0ab24b49d832ec5a39d4a47e227da133d016051e7dba77927a, got 3293e8a81d782aeda88b37275fb361f3511389534f921cb10da1e1c7a4744308c975b3a849178a0e964770515faf86e351f9c3cb8eddf5d7d85e2094f8655948
logs/cockroach.stderr.13:E161110 14:06:00.186700 4201652 storage/replica_command.go:1870  [n10,replica consistency checker,s20,r3885/2:/Table/55/1/809{36830…-42445…}] could not CollectChecksum from replica {5 9 10}: rpc error: code = 4 desc = context deadline exceeded
logs/cockroach.stderr.13:E161110 14:07:10.187396 4206949 storage/replica_command.go:1870  [n10,replica consistency checker,s20,r17182/2:/Table/55/1/3316{1057…-6647…}] could not CollectChecksum from replica {5 9 5}: rpc error: code = 4 desc = context deadline exceeded
logs/cockroach.stderr.13:E161110 14:07:10.733325 4211455 storage/replica_command.go:1887  [n10,replica consistency checker,s19,r5064/9:/Table/55/1/8962{1500…-7178…}]
logs/cockroach.stderr.13:E161110 14:07:10.733458 4211456 storage/replica_command.go:1907  [n10,replica consistency checker,s19,r5064/9:/Table/55/1/8962{1500…-7178…}] consistency check failed with 1 inconsistent replicas; fetching details
logs/cockroach.stderr.13:E161110 14:07:10.733587 4211456 storage/replica_command.go:1913  [n10,replica consistency checker,s19,r5064/9:/Table/55/1/8962{1500…-7178…}] could not rerun consistency check: context canceled

And the replicas with consistency errors appear to be correlated with the context deadline exceeded errors. We don't know how the data became inconsistent, but we did find a bug with the consistency checking code in the process. Once we find a consistency check issue, we try collecting the details asynchronously with a context that get's getting cancelled shortly after kicking off the async task. We'll fix that issue and see if we can get a diff of what the inconsistency is.

@bdarnell
Copy link
Contributor

The log spam is a result of #10487; previously all of these deadline-exceeded errors were getting into the command queue and causing our memory spikes on delta (#10427). So this confirms that the problem is in fact in the GC queue. We can fix the spam by checking the context in the gc queue's loop and bail out early. But this sounds like a symptom rather than the cause.

@a-robinson
Copy link
Contributor Author

Yeah, the CPU profiles at least showed that log spam and processing old contexts wasn't using up too much of the server's resources, so I don't see how it could be anything more than a symptom.

I'm going to deploy #10625 to the cluster so that hopefully we can get some more detail on the consistency problems and work from there on what the cause is.

@a-robinson
Copy link
Contributor Author

And we have our first diff:
consistency_failure.txt

@a-robinson
Copy link
Contributor Author

A couple more diffs. I think there were a couple more before I shut the cluster back down, but this should be enough.
consistency_failure2.txt
consistency_failure3.txt

@petermattis
Copy link
Collaborator

If I'm parsing that correctly, the follower has a key (/Table/55/1/596266645108967248/"9e85fbc6-364b-4aec-b38c-7e7cbe51b2ed"/8027849/0) that isn't present on the leaseholder. This key was written at 2016-11-07 15:05:07.63308972. The different in the range stats key would be expected if one replicas has a key the other one doesn't.

I take it we don't have logs from that time period. It is fairly unfortunate that we weren't crashing earlier. This is a very serious bug. I'm doubtful there is anything we can do to recover the cluster at this point. I think we should back up all of the logs and state (e.g. /mnt/data{1,2}) to GCS and wipe the cluster in the hopes of catching this again.

@bdarnell Thoughts?

@a-robinson
Copy link
Contributor Author

Hm, that timing very suspiciously aligns with when I deployed a new version to the cluster that required a stop the world upgrade.

As far as I can tell, I did indeed stop the world before bringing anything back up again, but I doubt the timing is a coincidence.

screen shot 2016-11-11 at 11 50 22 am

@a-robinson
Copy link
Contributor Author

To be specific, the version I brought up was built at 3e0f63d

@a-robinson
Copy link
Contributor Author

Ah, I think I've been misunderstanding the stop-the-world process. I brought all nodes down before bringing any new ones up, but didn't explicitly freeze them, which I expect could explain this?

@a-robinson
Copy link
Contributor Author

But #10420 says that an explicit freeze isn't necessary...

@petermattis
Copy link
Collaborator

Stop the world means just taking all of the nodes down and restarting. freeze is different.

@a-robinson
Copy link
Contributor Author

@andreimatei

@petermattis
Copy link
Collaborator

The other consistency failures at keys with timestamps of 2016-11-07 15:02:54.814791457 and 2016-11-07 15:02:56.692711143. Very suspicious that these are clustered.

@a-robinson
Copy link
Contributor Author

a-robinson commented Nov 11, 2016

Without having read through the details of #10420, my hunch would be that it (or something else in the new binary) did require a freeze.

@petermattis
Copy link
Collaborator

Did you mean #10420?

@a-robinson
Copy link
Contributor Author

Yeah. Typing is hard

@andreimatei
Copy link
Contributor

I doubt that #10420 required a freeze... Even if it did, it would've required a freeze for the purposes of the lease records themselves, but that's not what's being flagged by the consistency checker.
The consistency checker seems to say that some random sql row is in one replica but not the other, so that would imply that one replica applied some command and the other didn't. But #10420 didn't introduce changes on the application path, only on the proposal path (or, at least, that was the idea).
I'm gonna try spinning up a cluster at just before #10420 and upgrade it to master, to see if I can repro. If I can't, then we can try going back in time further to earlier commits, and upgrading from those to master.

Is there a way to know in Grafana what version a cluster was running at any time? I see a "Built timestamp" pane in Grafana that seems to contain metrics that have a beta version in their name, but I'm not sure how to use that to actually see what when on a machine.

@bdarnell
Copy link
Contributor

The timing is so suspicious (and the consequences so severe) that we should probably withdraw beta-20161110 until we figure this out.

I re-read #10420 and it still looks safe to me. We should really have better records of what was running before, but by doing ls -l cockroach.* on a delta node we see that the build prior to 3e0f63d was 9ce8833. That range includes these 94 commits (but not #10420, so unless I'm misreading something this deploy was prior to that PR being merged). The most obviously suspicious change there is #10327.

@petermattis
Copy link
Collaborator

@andreimatei Do you mind trying to reproduce the consistency failure by upgrading from earlier SHAs to master? Per Ben's note, testing upgrades from 9ce8833 to master seems like the next logical step. I'm going to try as well. Hopefully one of us will hit pay dirt.

@bdarnell Have we withdrawn a beta before? Do we have a process for that?

@petermattis
Copy link
Collaborator

No luck so far. I've tried running a zerosum on a 4-node cluster on 9ce8833 and then again on master (2257842). zerosum performs consistency checks every 20s which is part of the reason I tried it first. I'm going to try again with block_writer -batch 10 load like we're using on delta.

@petermattis
Copy link
Collaborator

Ok, managed to reproduce consistency check failures going from SHA 9ce8833 to 2257842. The specific sequence I performed was:

~ rm -fr cockroach-data
~ ln -sf cockroach.9ce8833 cockroach
~ COCKROACH_CONSISTENCY_CHECK_INTERVAL=20s roachdemo -n 3
...
~ echo "range_max_bytes: $[4 * 1024 * 1024]" | ./cockroach zone set .default --file=-
~ block_writer -batch 10 -tolerate-errors
# run for 3 minutes, then kill node 3, then kill the cluster
~ ln -sf cockroach.2257842 cockroach
~ COCKROACH_CONSISTENCY_CHECK_INTERVAL=20s COCKROACH_CONSISTENCY_CHECK_PANIC_ON_FAILURE=true roachdemo -n 3
...
# crash after a few seconds

The consistency check failures look like:

F161112 14:09:49.197127 307 storage/replica_command.go:1903  [n1,s1,r6/1:/Table/5{0-1}] consistency check failed with 1 inconsistent replicas

I didn't get diffs for some reason, but given the reproducibility this shouldn't be difficult to track down now. I've also got the data directory from running on 9ce8833 so I can bisect to find out which SHA is incompatible. I've got a variety of errands and non-work work to do today, so any help tracking this down would be appreciated.

@petermattis
Copy link
Collaborator

Note that before @a-robinson's recent fix, COCKROACH_CONSISTENCY_CHECK_PANIC_ON_FAILURE=true doesn't work and you have to manually grep for consistency check failures.

@petermattis
Copy link
Collaborator

Bisection points the finger at 6c9190a (a.k.a. #10327). Now we need to figure out what exactly is causing the problem. My initial guess is the addition of non-nullable fields to ReplicaProposalData. I'm pretty sure that isn't kosher.

@petermattis
Copy link
Collaborator

Definitely not kosher to add non-nullable fields to ReplicatedProposalData, but that proto is only encoded upstream of Raft. I realized this after trying to prove those changes were at fault by applying the proto changes to the previous commit. So something more subtle about #10327 is causing consistency checker issues. Presumably on the apply path.

@petermattis
Copy link
Collaborator

Looking at one of the consistency failures I see:

-  raw_key:/Local/RangeID/62/r/RangeLease raw_value:120408001000180020002800323718b6da75030a0c08f19098b2bfbf98c3141001120c08e8f8d380b7dd9ac31410001a06080310031802220c08e893b989b6dd9ac3141000
+  raw_key:/Local/RangeID/62/r/RangeLease raw_value:1204080010001800200028003237a429c3b9030a0c08829a91c8a4bf98c3141001120c08e8f8d380b7dd9ac31410001a06080310031802220c08e893b989b6dd9ac3141000

The raw_value is a hex encoded MVCCMetadata proto where MVCCMetadata.RawBytes is a roachpb.Value.RawBytes field that encodes a roachpb.Lease. Layers! Anyone, decoding the two values and diff'ing the two structures shows (using pretty.Diff):

[Start.WallTime: 1478977262434257009 != 1478977255232523522]

Sounds like #10327 changed how timestamps are assigned when applying commands. That should narrow down the search.

@bdarnell
Copy link
Contributor

The one time we've withdrawn a beta we had the fix ready immediately so we just pushed out a new binary with the same filename (this is not what we should be doing, but it was soon enough after the release that it hopefully wasn't too harmful). We've never yanked a beta without something to replace it with.

I think the procedure would be to remove the links from the release notes (but leave the page up), change the "current version" constant to point back to 1103, make a post on the forum about it, and probably delete the tarballs from s3.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Nov 13, 2016
@bdarnell
Copy link
Contributor

@bdarnell
Copy link
Contributor

bdarnell commented Nov 13, 2016

I think I've found it. In evaluateProposal, we initialize certain fields of ReplicatedProposalData, and in processRaftCommand we use those fields.

The problem is that post-raft, without propEvalKV, we call processRaftCommand and then evaluateProposal, so if the command was not proposed under a build with this PR, those fields will be empty at the time they are needed and our lease checks will have false positives.

@bdarnell bdarnell self-assigned this Nov 13, 2016
@andreimatei
Copy link
Contributor

I'm really glad y'all have found this!
Ben, "lease checks will have false positives" in your context here means that processing a command will consider a lease to not be valid when in fact it should be considered valid, correct?

Since this reproduces so easily with roachdemo (so, I guess, the point is that it's so easy to have in-flight commands that were applied by some nodes before a shutdown and some after), I think we should have an acceptance test that tests upgrades from the last beta to the last PR. I can volunteer to work on a more deterministic way to wait for a consistency check and to build such a test.

@petermattis
Copy link
Collaborator

The problem is that post-raft, without propEvalKV, we call processRaftCommand and then evaluateProposal, so if the command was not proposed under a build with this PR, those fields will be empty at the time they are needed and our lease checks will have false positives.

I haven't had a chance to look at this today. Is there a straightforward fix?

I think we should have an acceptance test that tests upgrades from the last beta to the last PR. I can volunteer to work on a more deterministic way to wait for a consistency check and to build such a test.

I'd like to see that test, though it might be mildly trick to write.

@bdarnell
Copy link
Contributor

When I said "false positives" for lease checks, I meant that commands might succeed even though they don't have a valid lease (they show up with a zero timestamp, which is always before the lease's expiration). But it's complicated and I don't know all the ways in which this might fail (it looks like we're also seeing the leases themselves diverge)

To fix this convincingly, I think we're going to need a refactoring of the propEvalKV code so that evaluateProposal is only called once (instead of once before and once after raft). I don't think it will be too hard, but maybe I just haven't hit the roadblocks that led to this structure in the first place. The root cause here relates to some proto changes that were put in place to minimize code duplication between the propEvalKV and legacy modes, so the fix may have to introduce some duplication.

@a-robinson
Copy link
Contributor Author

Nice work figuring this out!

If the fix is particularly onerous, couldn't we just require a freeze before upgrades that cross the relevant commit boundary? I realize that kind of sucks, but we're already requiring a stop-the-world process for these upgrades anyway.

@bdarnell
Copy link
Contributor

freeze-cluster is not reliable (#7038 and #7238), so we'll need to fix that before we do anything that relies on it. We should probably prioritize that because until propEvalKV goes in it's easy to accidentally make a change that is only safe with freeze-cluster. (on the other hand, I think with the way propEvalKV is now structured (behind a flag, with both versions supported in one binary) it might actually be possible to switch without freeze-cluster, so we may never actually need freeze-cluster.

@spencerkimball
Copy link
Member

There's another big problem somewhere in our replica processing which might
be the cause of the original inconsistency (although I agree this lease
problem with ProposalData could also be the cause). The problem I'm
seeing is what's causing the Jepsen duplicates failures. Here's the basic
outline of events:

  • Leader proposes command to Raft
  • Network goes very flaky and nothing gets through to followers
  • 15s later the leader reproposes the command (via
    Replica.refreshProposalsLocked)
    • This causes Replica.tryAddWriteCmd to retry
    • On the retry, the lease is no longer held and
      Replica.tryAddWriteCmd returns a NotLeaseHolderError
  • Despite reproposal, the two original followers apply the original Raft
    command, leaving inconsistent replicas
    • The two original followers have the command applied
    • But, the original leader doesn't

This is very repeatable via the Jepsen duplicates test. It looks like
re-proposing Raft commands in Replica.refreshProposalsLocked is not safe.

On Sun, Nov 13, 2016 at 9:58 PM Ben Darnell [email protected]
wrote:

freeze-cluster is not reliable (#7038
#7038 and #7238
#7238), so we'll need to
fix that before we do anything that relies on it. We should probably
prioritize that because until propEvalKV goes in it's easy to accidentally
make a change that is only safe with freeze-cluster. (on the other hand,
I think with the way propEvalKV is now structured (behind a flag, with both
versions supported in one binary) it might actually be possible to switch
without freeze-cluster, so we may never actually need freeze-cluster.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#10602 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF3MTVMv0oF33WWAqmDX-cmzKooDxESpks5q985UgaJpZM4Ku50R
.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Nov 14, 2016
This method was previously called both before and after raft, and the
after-raft logic relied on fields that were set in the
ReplicatedProposalData by the before-raft logic. This caused
inconsistencies when cockroachdb#10327 was deployed without freeze-cluster.

Now, evaluateProposal is called only once, either before or after raft,
and the post-raft logic handles the command in whichever fashion is
appropriate for the version that proposed it.

Fixes cockroachdb#10602
@spencerkimball
Copy link
Member

OK, after discussing with @andreimatei, I think there are two issues here. One I know how to solve and the other needs some deeper digging.

  • Raft re-proposals, if they end up returning an error, need to escalate that error to AmbiguousResultError instead. This is because we don't know whether the command was applied or not in some cases. For example, in this case where we've lost the range lease but there's a command possibly still in the Raft black box.
  • Need to look more closely at the Raft results, because I was seeing only two replicas getting the first of the duplicated inserts. Hopefully I was missing something and there isn't an inconsistency issue that might also be present. I'm going to put in some more precise logging.

@andreimatei
Copy link
Contributor

@bdarnell:

When I said "false positives" for lease checks, I meant that commands might succeed even though they don't have a valid lease (they show up with a zero timestamp, which is always before the lease's expiration). But it's complicated and I don't know all the ways in which this might fail (it looks like we're also seeing the leases themselves diverge)

That doesn't quite make sense to me. I believe you're saying that some commands failed to be applied by nodes that tried to apply them before the restart, and are applied by nodes that try to apply it after the restart. But how exactly can we get this situation? Why didn't the nodes that tried to apply them before the restart succeed in applying them, given that the lease is also checked before proposing?
It seems to me like the more likely explanation for the replica divergence is the reverse of what you say - commands are applied by replicas before the restart and fail to be applied after the restart, because of this check: !l.OwnedBy(origin.StoreID) (given that origin.StoreID is unitialized). So it's about the StoreID not being initialized, not about the raftCmd.Timestamp.
No?

This also reminds me that I think Tobi had plans to put the lease under which a command has been proposed into the raft command. Although I don't think that's actually necessary given that we don't wipe r.mu.state.lease during a restart.

@bdarnell
Copy link
Contributor

@andreimatei: OriginReplica is being set because that predates the propEvalKV work, so that's not it. I now think that the most likely cause of the inconsistency is the ReplicatedProposalData.IsLeaseRequest flag: This is what allows LeaseRequests to be executed even though they are not proposed by the leaseholder. So the lease could be granted on replicas that applied it before the restart, but rejected on replicas that applied it afterwards.

@spencerkimball Please file a separate issue for the things you're looking into.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Nov 15, 2016
This method was previously called both before and after raft, and the
after-raft logic relied on fields that were set in the
ReplicatedProposalData by the before-raft logic. This caused
inconsistencies when cockroachdb#10327 was deployed without freeze-cluster.

Now, evaluateProposal is called only once, either before or after raft,
and the post-raft logic handles the command in whichever fashion is
appropriate for the version that proposed it.

Fixes cockroachdb#10602
@spencerkimball
Copy link
Member

Good news. More logging has revealed that the Jepsen duplicates issue is not also creating inconsistencies. The missing write to the original leader is accounted for by a snapshot.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Nov 16, 2016
This method was previously called both before and after raft, and the
after-raft logic relied on fields that were set in the
ReplicatedProposalData by the before-raft logic. This caused
inconsistencies when cockroachdb#10327 was deployed without freeze-cluster.

Now, evaluateProposal is called only once, either before or after raft,
and the post-raft logic handles the command in whichever fashion is
appropriate for the version that proposed it.

Fixes cockroachdb#10602
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

6 participants