-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [congmgr assertion failed] #53540
Comments
or, unredacted:
|
@nvanbenschoten can you take a look at this on Monday? This is blocking 20.2.0-beta.1. |
@knz while here, can we discuss how log redaction should work in this case? We have |
the new error redaction code should have taken care of this already without any changes. I am surprised it didn't. Or maybe this failure isn't from running off |
oh wait I see an oddity specific to |
I've filed #53700 which I believe is restricted to I manually ran Are you able to reproduce this error output at will? if so, can you alter the |
Should help track down cockroachdb#53540.
I'm able to hit this pretty easily (though in 4-6 hour) on eaa939c, but haven't been able to hit it yet on 95a13a8. That indicates that the issue was introduced somewhere in the ~200 commits between these two SHAs. I haven't had much luck digging in with some of the improved logging in #53836 because it takes so long to hit this, so I'm planning on kicking off a long-running git bisect to hopefully narrow the search space down to something a little more manageable. There were two bumps to Pebble in this period: d15daba and 8b9391d. The first one of these, in particular, is a little scary because it includes these two optimizations to iterator seeking: cockroachdb/pebble@76afcca and cockroachdb/pebble@a24bc6c. |
@sumeerbhola can speak to the safety of those Pebble iterator optimizations. I believe he thought they were fairly safe. Usually if we add a bug to iterators the Pebble metamorphic test catches it. |
Eeek! @nvanbenschoten is this error more likely due to an iterator not returning a key when it should, or returning an old version of a key? |
FWIW, I've been stressing the Pebble metamorphic test for 30min now without failure. I'll leave it running as it sounds like this is a very rare problem. |
I think it's more likely due to an iterator returning an old version of a key. For instance, this would be the symptom if an older version of an MVCCMetadata key was revived. But let's wait to see what this last round of testing shows. I'd like to be very sure about the bisect before pointing the finger in Pebble's direction. |
@petermattis @sumeerbhola This just finished, again with no instance of this error. So this failure does appear to be a result of that change. |
I'm unclear on the commit history based on the output of Perhaps the cause is actually an earlier commit to |
They're back-to-back in https://github.com/cockroachdb/cockroach/commits/d15daba173ac5f7601c70ca6dad57a5a21c2c77b. Am I confusing myself with merge commits? I was comparing d15daba vs. 95a13a8 (its parent). |
Misread the code -- there isn't a bug there. Unlike |
@sumeerbhola try running a bunch of them concurrently, see the |
Oh, looks like you got a "dead node" from your description? It should be in the log files, |
There are logs in both
|
There were a collection of patches that I needed to run on top of d15daba to avoid other recently fixed failures. It looks like they were 3c09757 and 5430170. 0a9ee34 is probably also a good one to have. If you're looking to reproduce this, I'd suggest running |
@sumeerbhola do you remember what happened here? We tracked this down to d15daba and then what? I'm assuming the issue was resolved, so I'm hoping we can point to the fix and close this issue. |
The Pebble optimization bugs were reproduced in Pebble tests, and the fixed optimizations were re-introduced in a series of Pebble PRs cockroachdb/pebble#947 cockroachdb/pebble#949 cockroachdb/pebble#951, and the corresponding Pebble bump in CockroachDB was merged on Oct 23 #55895 |
Thanks! So it sounds like we can close this then. |
(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on provisional_202008261913_v20.2.0-beta.1@eaa939ce6548a54a23970814ff00f30ad87680ac:
More
Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
Related:
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #53443 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed C-test-failure O-roachtest O-robot branch-provisional_202008241848_v20.1.5 release-blocker
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #53414 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed C-test-failure O-roachtest O-robot branch-release-19.1 release-blocker
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #52337 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed C-test-failure O-roachtest O-robot branch-provisional_202008031850_v20.2.0-alpha.3 release-blocker
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #52206 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed C-test-failure O-roachtest O-robot branch-release-20.1 release-blocker
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #50698 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed C-test-failure O-roachtest O-robot branch-release-19.2 release-blocker
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #48255 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed C-test-failure O-roachtest O-robot branch-master release-blocker
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: