-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #70410
Comments
Looks similar to #61708. |
If you download the debug zip, you'll see that the district table job has been marked successful. You'll also notice that if you decode the descriptors, that the table is not marked public.
{
"id": 60,
"name": "district",
"state": "OFFLINE"
}
|
This change pulls the logic related to inserting table statistic stubs for new tables created during the import, out of the txn closure that sets the tables to `public` from `offline`. This was motivated by the failure mode observed in cockroachdb#70410 where we failed to write the stats stub for a particular table and ended up with a descriptor published in the offline state. Fixes: cockroachdb#70410 Release note: None
What we know so far: This code seems like the likely culprit cockroach/pkg/ccl/importccl/import_stmt.go Lines 2403 to 2432 in a454599
it's reading from the old descriptors stored on the job, in the same DescsTxn as the one in which we publish descriptors to the public state. We do see the writing of statistic stubs failing for the
We swallow the above error when writing table stats stubs, and so the transaction should not rollback? It does indeed appear that |
The code you pointed to does seem suspect. We need to propagate errors up to the transaction retry loop for retryable errors to be handled correctly. This sounds like #43067, which is related to #22615. The behavior that we would see if we failed to propagate the error is that any side effect of the transaction up to the point of the error would roll back, but the transaction would then continue in a new epoch past that point. So any side effect from after this point could still be committed. Is that consistent with the symptoms here? |
Yep, that's exactly what we're observing. Thanks for this! |
Thank you for the fix @adityamaru and everyone else for the analysis. I promise to propagate retryable errors from now on. 😊 |
70130: changefeedccl: check that rangefeeds are enabled earlier r=miretskiy a=stevendanna Pushing this check earlier means that for core changefeeds, the check happens before the backfill. Previously an error would only be seen after the backfill when we attempted to start the rangefeed, which some users found confusing. Release note: None 70326: [CC-4712] ui: fix drag-to-timerange specific window issue r=Santamaura a=Santamaura This PR fixes the issue where the user drags-to-timerange on a metrics page graph where the window is exactly 24 hours or 1 hour and the timewindow changes to the past day or past hour. This was due to a time scale function that determines the closes time scale returning a standard option instead of a custom option because the calculation only considers duration. Some additional logic was added to the function to handle this specific edge case. Release note (ui change): fix drag-to-timerange for specific window issue https://user-images.githubusercontent.com/17861665/133663386-8284d5c7-ca96-4d0b-8fbf-d62d4fd4a7e3.mp4 70447: importccl: move stub stats logic outside publish txn r=ajwerner,michae2 a=adityamaru This change pulls the logic related to inserting table statistic stubs for new tables created during the import, out of the txn closure that sets the tables to `public` from `offline`. This was motivated by the failure mode observed in #70410 where we failed to write the stats stub for a particular table and ended up with a descriptor published in the offline state. Fixes: #70410 Release note: None 70467: colexecop: harden CloseAndLogOnErr against panics r=yuzefovich a=yuzefovich We have seen a few cases where closing the `Closer`s would lead to a panic which wasn't caught because we're outside of the panic-catcher scope. This commit adds a panic-catcher to `CloseAndLogOnErr` (which is the "root" way of closing things) in order to increase the stability of CRDB in face of edge case bugs. Addresses: #70000. Addresses: #70438. Release note: None 70484: workload: log histogram write/encode failures, close output file r=erikgrinaker a=stevendanna We are currently observing incomplete histograms being output during nightly roachperf tpccbench runs. I don't think the changes here are likely to address the cause, as I would expect write failures to affect a broader range of roachperf output. But, it is still good to log any failures we do encounter. Further, we now sync and close the file explicitly. Informs #70313 Release note: None 70491: roachtest: retry java install r=miretskiy a=stevendanna Occasionally, the apt mirrors in GCP return 503 Service Unavailable. Here, we retry the install attempt 3 times with some backoff between attempts. I've also added the --no-install-recommends flag, although it does very little in this case. Release note: None Co-authored-by: Steven Danna <[email protected]> Co-authored-by: Santamaura <[email protected]> Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on master @ d9278939d4be4d1109e9ed0a84c6e9cd2a8705bb:
Reproduce
See: roachtest README
This test on roachdash | Improve this report!
Epic CRDB-7779
The text was updated successfully, but these errors were encountered: