-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #58049
Comments
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@46919380225dba7122130c338744b561d7eb6c56:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@a782ee8d93a23fc53eedada3c51893c19f7bb41e:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@75efa1fe6f2096adc9db474026a2b7235e53c388:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
The last few crashes have looked like:
cc. @lucy-zhang |
I have to investigate further but if this indeed is NULL, then we have a broken invariant in the jobs system, namely all jobs that are running must have a non NULL claim_session_id. |
It is a AUTO CREATE STATS job. Let me look into how these jobs are created. |
I wonder if this recent change to |
This invariant confuses me. The way we detect a job as available to be claimed is by setting the That being said, if that were the case, a session would have had to expire. Let me go through the logs. |
Yeah, that's exactly what happened. This cluster was pretty toast, everything was timing out and the node failed to heartbeat its sqlliveness session. |
At least it's an easy fix, will need to get backported. |
Actually, ignore the backport, this logic only exists on master. |
58161: jobs: fix panic when a job has lost its claim during update r=spaskob a=ajwerner Jobs can lose their claims. We detect when the claim doesn't match a new claim but we don't detect when the claim has been set to NULL (which is effectively equivalent). This is important because the way we clear claims is to set them to `NULL` and then we adopt jobs which have a `NULL` claim. This bug was introduced in #55120 and has been released only in the 21.1 alpha. Fixes #58049. Release note (bug fix): Fixed a bug from the previous alpha where the binary could crash if a running node lost its claim to a job while updating. Co-authored-by: Andrew Werner <[email protected]>
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@eda9189cecbbc279f1857f6e6b992bdfd363397e:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: