-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jepsen: Bank test failed #17815
Comments
Are you able to reproduce this easily? If so, could you try with a patch to |
Yes. It's not quick (it takes several hours at least), but it's easy to try. |
Could you kick off a test using my branch from #17939? |
I'm running a test with #17939 now (I had tried to run it this morning but forgot that the jepsen test scripts don't automatically rebuild the binary so I was testing the wrong build). We'll know tonight whether this fixes it (or at least have some supporting evidence. It appears to reproduce around every 3 hours but I don't have many data points to support that estimate). We see a lot of "context canceled while in command queue" messages in the logs (often affecting PushTxn), so I think it's likely to be the same issue.
|
With #17939, it's passed six hours of test runs (before, it hadn't passed more than three hours in a row). I'll keep running it but so far it's looking good. |
@bdarnell that's a very good sign. Did this run overnight? |
No, I've just been doing a series of manual 1-hour runs. Longer runs can produce tons of logs and require more analysis time at the end of the run. |
Ok, I just merged #17939 so if we see this issue again we'll know that the CommandQueue fix didn't fully resolve this. |
OK, after a few more passing test runs I'm going to call this fixed by #17939. |
The failure in cockroachdb#17815 took several hours of testing to show up. We run 30 different configurations (combinations of test and nemesis) so we don't want to run everything for too long every night, but we can do more than the current 3m per configuration.
The jepsen bank test run on Aug 19 failed with a bunch of
wrong-total
errors:(The errors continue; looks like there was one error that changed the total from 50 to 48 and then every subsequent read reported the error again)
The bank test has not had a recent history of failures like this, although it has a history of flakiness for other reasons so it's hard to tell when this rare failure may have been introduced.
The text was updated successfully, but these errors were encountered: