-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: scrub/all-checks/tpcc/w=1000 failed #35986
Comments
This was another OOM. The heap profile that got written looks extremely similar to the last one (#35790):
|
SHA: https://github.com/cockroachdb/cockroach/commits/9399d559ae196e5cf2ad122195048ff9115ab56a Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1194326&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/a039a93a5cc6eb3f395ceb6f7dc8030985dccc29 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1212269&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/c6df752eefe4609b8a5bbada0955f79a2cfb790e Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1217763&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/5267932f6fec0405b31328c1ad43711b0bb013e5 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1220238&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/b88a6ce86bfe507e14e1e80fdaefd219b5f0b046 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1222804&tab=buildLog
|
@andreimatei is dusting off (well, hasn't really had much time to gather dust) the roachtest stresser. It looks like we need to treat this as a release blocker until we have a reason to believe that 19.1 is unaffected. I'm adding it to the list. cc @bdarnell |
cockroach/pkg/workload/tpcc/checks.go Lines 160 to 162 in d2ebf14
This check (and a few others in this file) are imprecise: If the server crashes while results are being streamed back (or otherwise returns an error), it will report this as a data inconsistency instead of the error that occurred. We should be sure to report There were no server crashes during this run, and I don't see a way that these read-only check statements could return a late error, but when we do manage to reproduce this we should make sure it's not a false positive. |
Actually I'm not a good owner for this. I'm haven't gotten to testing anything today. I don't know if I'll do anything over the weekend, and then on Tuesday I'm off for a few days. Another victim might be better :) |
More sketchiness in the check: it does two separate read-only transactions, without syncing them using AOST. So if it's run on a non-idle cluster, it could have false positives. I think there's some reason to think that might be at least a possibility here (although I'm not convinced). The test runs tpcc for 2 hours, and during that time it runs SCRUB three times, and expects them to complete in less than that time. A comment indicates that SCRUB is expected to take 35m. In this instance, the three SCRUBs took 33m, 45m, and 40m, adding up to more than 2h (with a 10m sleep at startup). The last scrub finishes after tpcc has stopped:
"Stopping" TPCC in this case means cancelling all the workers. If anything is in a cancellation-insensitive retry loop on the server side, it could keep going. The check that failed started at 17:11:24, 33s after the last SCRUB finished. But p90 latency was 90s, indicating that the scrub test is either badly overloaded or experiencing high contention. So if there was enough stuff queued up at the end of the test that something went on to get its timestamp pushed by 30+ seconds and still commit, we could explain the failure. The scrub test runs tpcc-1000 on a 4-node, 4 vcpu/node cluster. That's not enough to keep up with the workload. Is it bad enough to queue up enough work (perhaps exacerbated by long-running scrub queries to push timestamps) that something could still be running by the time the check starts? It seems unlikely, but maybe? |
Aha, tpcc specifically detaches its transactions from the context that gets cancelled: cockroach/pkg/workload/tpcc/worker.go Lines 195 to 201 in 1a5eaba
That increases the odds that the last round of contending transactions fight it out for 30+ seconds considerably. |
Interesting. So the theory is that the check is buggy and the trigger for that isn't rare? It is true that the scrub tests are vastly underprovisioned. There seems to be enough evidence that this might've been the problem. While we scrutinize and fix the checks, we should also improve them so that they at least log the numbers they're comparing. In an ideal world, they would also run a verbose query that would return the inconsistency found, though if there is one we're going to want to save the data directory anyway. We could add an S3 bucket to which we can upload the store directories in such a case. This would also be used by the replica consistency checks in roachtests. cc @bobvawter |
The checks are clearly buggy and should fail easily if run concurrently with a live workload (at least with The speculative part is that maybe the overload condition is leaving some activity running even after the TPCC run has reached its end. The context leak in #35986 (comment) is less of a smoking gun than it initially appeared, since the checks are run from a separate So for this to be the problem, we need some transactions to be queued up in such a way that in the absence of client activity, they will push their timestamps out by a significant amount but still be able to have server-side retries and commit. That seems implausible to me on the time scales we're talking about. There's a ton of this in the logs, all at the same time when the test ends:
So we can see that there were a lot of pending transactions that apparently got canceled when the client went away, but because we weren't able to process all the cancellations in time maybe something kept going. Normally I'd say this is a sign that we need to replace RunAsyncTask with RunLimitedAsyncTask to provide backpressure, but there's nothing to backpressure in this case. It could still be a win if it delays the start of each async rollback's timeout, though. The goroutine count on n1 was 11k before the |
The check queries run for over three minutes. This makes them eligible for follower reads for the latter half of their run. The overload conditions in this test may stress certain aspects of follower reads, such as how they work when nodes (both leader and follower) are far behind on their raft logs, or intents resolving with their timestamps pushed significantly ahead of their original timestamps. |
SHA: https://github.com/cockroachdb/cockroach/commits/6da68d7fe2c9a29b85e2ec0c7e545a0d6bdc4c5c Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1226521&tab=buildLog
|
I haven't reproduced the check failure yet, but after watching some normal runs we don't seem to be getting very large raft logs. I've just spot-checked, but every data range I've looked at has a last log index of no more than 30. I also haven't seen any mismatched applied indexes; all the followers are up to date with the leader on the ranges I've looked at. What we see instead is a lot of invalid leases. During the test run, one or more nodes misses a heartbeat, invalidating a lot of leases. These leases are reestablished very slowly while the checks are running, and then at the end of the checks all the remaining ranges get leases very quickly. I'm not sure whether that's a sign that there's a bottleneck somewhere preventing leases from being acquired quickly or if there's something about the access patterns of the checks that touches a lot of formerly-idle ranges at the end. So during the check that failed, our read RPCs are likely to hit a lot of ranges without a valid lease. For fresh reads, that would force a lease acquisition. For follower reads, it depends on how the closed timestamp has advanced. My current working hypothesis:
|
No, that's not how it works. We check for this. Maybe there were two separate quiescence events:
|
What makes you think that this has to do with the check queries and not queries in the TPC-C workload? It's worth noting that a large number of TPC-C queries take well over 45 seconds (see the P95 values during the test). My intuition is the stale read for one of the order updating queries is more likely to be the culprit for problems than anything else. Do we anticipate that the scrub would interfere with the check?
I'd like to understand how this happens. The replica which has failed to heartbeat its liveness record should still be able to serve reads below the old closed timestamp, but in theory shouldn't those reads be safe? Also, and this now is pushing on things I don't fully understand, when the node becomes live again, shouldn't the epoch have changed? If we're generally worried about correctness of follower reads in read-write transactions we could take a more dramatic step to prevent follower reads in all read-write transactions. We could make clients explicitly opt in to them by setting a header flag and enforcing that we only set such a header for read only transactions in the conn executor. |
A fair question. When/if we manage to reproduce this we can easily distinguish between a problem in the persisted data and a problem in the check. But in the meantime, I believe it's more likely to be a problem in the check. The failing check verifies that the
I can't think of any reason it would.
Yes. The (potential) problem is that there is special logic for advancing the closed timestamp of a quiesced range. (there was a lot of discussion about this on the RFC but now I'm just paging all this stuff back in and trying to figure out where we left things).
Not necessarily. The epoch is bumped when one node steals another's leases. A node can go from live to dead and back again and reuse its old epoch. (also, in this scenario it's the follower that becomes non-live, but we usually only care about epochs on leaders/leaseholders)
Didn't we already do that? #35969 |
Quiescence looks like a red herring. An early version of the original follower reads RFC had special treatment for quiescence, but this was dropped in #26362. |
Cool, I was about to remark that I don't see much in the way of special logic around quiesced ranges. We always seem to read the
Ack. If I understand the state of the discussion, we're back to not really having a plausible theory for how this might have happened. It seems to me that either (a) the data actually inconsistent at check time or (b) the data was not inconsistent at check time but the check transaction failed to observe a consistent snapshot The argument against (a) is that:
(b) also seems somewhat unlikely given that we think that all writing transactions from workload run had concluded before the workload check began. If we were to observe a stale read from a follower during the check transaction, it seems like it would have to be quite stale. The other thing going on here is that there's a concurrent SCRUB but as I understand it that scrub should not be writing in a way that might interfere with reads. Is there anything I'm missing? |
Yup, except that there is no "check transaction". The check runs two non-transactional statements at a time while the database is believed to be idle, and expects them to both see consistent data.
That's an argument against a follower read causing us to insert inconsistent data, not a complete argument against (a). It could also be caused by something like the issue fixed in #32166, for example.
Right. That's why I was looking at quiescence issues, since a quiescent range could be arbitrarily far behind.
Note that the scrub is concurrent with the main tpcc workload, not with the checks. By the time the checks start they're supposed to be the only thing running. But scrub is just a normal read and I don't see any reason to suspect it. Instead of the scrub, I believe the reason this test saw an issue that none of our other TPCC tests did is that it uses a very under-provisioned cluster. |
Continuing to iterate on the theory from #35986 (comment) Incrementing the epoch causes leases to be non-equivalent, which makes it impossible for a new epoch's lease to make it into
At 04:11:51, 04:11:55, and 04:12:00, we see three separate heartbeats that take over 4 seconds. The first one times out (at 4.5s), so when the second one starts the first is still pending in the raft log. The second one has a CPut failure, which gets logged as "heartbeat failed on epoch increment" even though that's not necessarily what happens. cockroach/pkg/storage/node_liveness.go Lines 569 to 577 in 23e418e
A CPut failure is turned into either errNodeAlreadyLive or ErrEpochIncremented, but there is also a third case: a previous heartbeat succeeded, invalidating our CPut expected value, but is no longer live by the time we apply. All of n4's leases, including the one for r5542, were considered expired and many of them entered this cockroach/pkg/storage/replica_range_lease.go Line 291 in 23e418e
When the third heartbeat succeeds at 04:12:00, all the leases are unblocked.
(note that despite the log spam, there is some batching; we're not hammering the liveness range with redundant heartbeats here) At this point, we've successfully heartbeat our existing epoch, so our old "expired" lease has been revived! The code in The new proposed lease is identical to the old one except for the |
I poked around a bit too but can't find anything going wrong here neither. In the end I looked into what happens with closed timestamps when the epoch becomes stale (and later non-stale). The answer is, nothing interesting: cockroach/pkg/storage/closedts/provider/provider.go Lines 151 to 161 in 7037b54
We just don't close out a timestamp and leave the tracker alone. When we become live again, it is emitted, as you'd expect to. Time for a new repro. |
SHA: https://github.com/cockroachdb/cockroach/commits/9e6ae3cc37e7691147bb6f5d1a156ebe4c5cf7f9 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1245443&tab=buildLog
|
I don't have another repro, but I have a theory. The key is the fact that the node lost and renewed its liveness several times, combined with weak input validation in
I think the fix is that whenever we perform a liveness action in the course of requesting a lease, we have to adjust our start time to a value after the last known change to the liveness record. |
In 1-3, shouldn't the call to cockroach/pkg/storage/replica_range_lease.go Lines 305 to 307 in 29554aa
and the cockroach/pkg/storage/replica_range_lease.go Line 878 in 29554aa
This means, or rather, should mean, that the I also looked at cockroach/pkg/storage/node_liveness.go Lines 712 to 732 in a134caa
I didn't see anything in I generally agree that a bug in this area still makes sense and hope that I'm missing something in the above "refutation". |
There's this: cockroach/pkg/storage/node_liveness.go Lines 723 to 725 in a134caa
If our CPut fails because of an expiration mismatch, we then check the So we need to add a third actor to this drama, replacing step 4 from #35986 (comment) : 4.1: After n4's second missed heartbeat, another node (call it n2) sends another epoch increment at t4. This increment succeeds. |
I think this is a bug. Is it the bug that happens here? Parts of it sound far-fetched, especially the fact that n1's IncrementEpoch starts at step 2 but doesn't complete until step 4.2 after both a heartbeat from n4, a second liveness expiration, and an IncrementEpoch from n2. However, there are factors in the node liveness implementation that make this more plausible. A semaphore in the liveness module insures that only one IncrementEpoch is in flight at a time. Since these are performed with a background context, there is no timeout that will fire while they're in the queue (there is a timeout that starts once the RPC begins). So when a node fails liveness, many different ranges may independently reach the IncrementEpoch path and queue up their own redundant increments. Many of these will hit the fast path at the start of updateLivenessAttempt, but we could still queue up a bunch of them while things are bogged down (as we can see from the log spam when the jam is cleared). So I think with some things backlogged on the liveness semaphore, and others waiting on the KV machinery on the liveness range's leaseholder, we could see such extreme reorderings like that. The real question is how we can test this hypothesis. I don't see a good way to induce this kind of timing in a roachtest (maybe by hacking a sleep into IncrementEpoch?). I could probably manually reproduce this timing in a unittest, but it would be very fiddly and doesn't tell us whether it's what's really happening. |
IncrementEpoch was failing to distinguish between the request that caused the increment and another caller making the same increment. This is incorrect since a successful increment tells you *when* the node was confirmed dead, while relying on another node's increment leaves this uncertain. In rare cases involving a badly overloaded cluster, this could result in inconsistencies (non-serializable transactions) due to incorrect timestamp cache management. Fixes cockroachdb#35986 Release note (bug fix): Fix a rare inconsistency that could occur on badly overloaded clusters.
This comment has been minimized.
This comment has been minimized.
^- no actionable artifacts because #36562 hasn't landed. @andreimatei do you have any cycles to help me push that over the finish line? |
IncrementEpoch was failing to distinguish between the request that caused the increment and another caller making the same increment. This is incorrect since a successful increment tells you *when* the node was confirmed dead, while relying on another node's increment leaves this uncertain. In rare cases involving a badly overloaded cluster, this could result in inconsistencies (non-serializable transactions) due to incorrect timestamp cache management. Fixes cockroachdb#35986 Release note (bug fix): Fix a rare inconsistency that could occur on badly overloaded clusters.
36942: storage: Don't swallow ErrEpochAlreadyIncremented r=tbg a=bdarnell IncrementEpoch was failing to distinguish between the request that caused the increment and another caller making the same increment. This is incorrect since a successful increment tells you *when* the node was confirmed dead, while relying on another node's increment leaves this uncertain. In rare cases involving a badly overloaded cluster, this could result in inconsistencies (non-serializable transactions) due to incorrect timestamp cache management. Fixes #35986 Release note (bug fix): Fix a rare inconsistency that could occur on badly overloaded clusters. Co-authored-by: Ben Darnell <[email protected]>
IncrementEpoch was failing to distinguish between the request that caused the increment and another caller making the same increment. This is incorrect since a successful increment tells you *when* the node was confirmed dead, while relying on another node's increment leaves this uncertain. In rare cases involving a badly overloaded cluster, this could result in inconsistencies (non-serializable transactions) due to incorrect timestamp cache management. Fixes cockroachdb#35986 Release note (bug fix): Fix a rare inconsistency that could occur on badly overloaded clusters.
IncrementEpoch was failing to distinguish between the request that caused the increment and another caller making the same increment. This is incorrect since a successful increment tells you *when* the node was confirmed dead, while relying on another node's increment leaves this uncertain. In rare cases involving a badly overloaded cluster, this could result in inconsistencies (non-serializable transactions) due to incorrect timestamp cache management. Fixes cockroachdb#35986 Release note (bug fix): Fix a rare inconsistency that could occur on badly overloaded clusters.
I had to study this code for cockroachdb#35986, so I took the opportunity to update the documentation while it's fresh in my mind. Release note: None
The scrub roachtest was previously running tpcc-1000 on a cluster of 12 total vcpus, which is not enough (it needs ~double that). This exposed a lot of interesting issues like cockroachdb#35986, but it's only incidental to the main purpose of this test (and it's also flaky due to uninteresting problems associated with overloading). Switch the test to tpcc-100 so it can be stable; we'll reintroduce a test dedicated to overload conditions in the future (when we can make it stable). Fixes cockroachdb#35985 Fixes cockroachdb#37017 Release note: None
37046: roachtest: Shrink scrub workloads r=lucy-zhang a=bdarnell The scrub roachtest was previously running tpcc-1000 on a cluster of 12 total vcpus, which is not enough (it needs ~double that). This exposed a lot of interesting issues like #35986, but it's only incidental to the main purpose of this test (and it's also flaky due to uninteresting problems associated with overloading). Switch the test to tpcc-100 so it can be stable; we'll reintroduce a test dedicated to overload conditions in the future (when we can make it stable). Fixes #35985 Fixes #37017 Release note: None Co-authored-by: Ben Darnell <[email protected]>
I had to study this code for cockroachdb#35986, so I took the opportunity to update the documentation while it's fresh in my mind. Release note: None
36984: storage: Add comments about leases and liveness r=tbg a=bdarnell I had to study this code for #35986, so I took the opportunity to update the documentation while it's fresh in my mind. Release note: None 37100: roachtest: Hopefully deflake election-after-restart r=tbg a=bdarnell Logs suggest that we're seeing failures when nodes aren't caught up on all the split processing before the restart, so give this a chance to finish. Updates #35047 Release note: None Co-authored-by: Ben Darnell <[email protected]>
SHA: https://github.com/cockroachdb/cockroach/commits/3a7ea2d8c9d4a3e0d97f8f106fcf95b3f03765ec
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1187480&tab=buildLog
The text was updated successfully, but these errors were encountered: