-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opt: FoldAssignmentCast ignores transaction errors #85677
Comments
@fqazi Do you have reproduction steps for this? |
Yahor says it's difficult to reproduce, but it's possible to fake an error to see the problems this creates. |
The transaction retry is hard to repro standalone if you take the ResolveOID* and inject a transaction retry error or generate a real one with a paused thread. If we need to validate the fix the random-load variant of the schema changer workload is pretty easy to run, and we can see it in 10 minutes at most. |
cockroach/pkg/sql/opt/norm/fold_constants_funcs.go Lines 397 to 400 in 31ffb42
|
Great find! This is a bad one. I want to take some time to audit all these casts - any cast that has to read some mutable state should not be folded during normalization. @ajwerner and I also discussed how labelling these casts as |
I remain meh on this take. I suspect that folding |
Alright in light of #74470, I agree that immutable is not fitting. Stable is fitting and folding these stable values may be important. |
I thought we had a linter for not returning errors (returnerrcheck)? But it looks like this problem (i.e. not returning an error that could have caused the txn to abort) has been happening in a few places. |
The |
Yes, I believe we do know that this is a cause of #80764 (#80764 (comment)). |
@mgartner regarding your earlier comment:
does that work you're describing to audit all the casts need to happen before making the change to stop swallowing the error at cockroach/pkg/sql/opt/norm/fold_constants_funcs.go Lines 397 to 400 in 31ffb42
|
No, I can fix that now, I think. |
I'll point out that the pattern of swallowing an evaluation error is used in other cases: cockroach/pkg/sql/opt/norm/fold_constants_funcs.go Lines 469 to 472 in 31ffb42
cockroach/pkg/sql/opt/norm/fold_constants_funcs.go Lines 618 to 621 in 31ffb42
The reason we can do this is because we only try to evaluate immutable or stable expressions. I am hesitant to change all the places where we swallow errors - folding expressions is best-effort, and if it fails for any reason, we'd prefer to continue with optimization and leave the expression as-is. So I think changing this assignment cast case is a temporary solution - ideally we can fix the volatility of the cast and then revert to swallowing errors. |
An approach we've taken elsewhere is to classify the errors which can be swallowed safely. We don't have a unified way to do this presently. In the cases I'm thinking of, we literally passed a boolean with the error. That wasn't great. |
This is very flakey. Some of it is due to cockroachdb#87672. Some of it was due to cockroachdb#85677. There are some issues with inserts which need to be fixed. Until this stabilizes, it's causing problems. Along the way, I'm marking cockroachdb#78478 as a GA blocker so we do actually fix it. Release note: None
80474: gcp: make per-chunk retry upload timeout configurable r=dt a=adityamaru This change adds a cluster setting `cloudstorage.gs.chunking.retry_timeout` that can be used to change the default per-chunk retry timeout that GCS imposes when chunking of file upload is enabled. The default value is set to 60 seconds, which is double of the default google sdk value of 30s. This change was motivated by sporadic occurrences of a 503 service unavailable error during backups. On its own this change is not expected to solve the resiliency issues of backup when the upload service is unavailable, but it is nice to have configurable setting nonetheless. Release note (sql change): `cloudstorage.gs.chunking.retry_timeout` is a cluster setting that can be used to configure the per-chunk retry timeout of files to Google Cloud Storage. The default value is 60 seconds. 86696: sql/gcjob: make index GC robust to descriptors being deleted r=ajwerner a=ajwerner First commit is #86690 If the descriptor was deleted, the GC job should exit gracefully. Fixes #86340 Release justification: bug fix for backport Release note (bug fix): In some scenarios, when a DROP INDEX was run around the same time as a DROP TABLE or DROP DATABASE covering the same data, the `DROP INDEX` gc job could get caught retrying indefinitely. This has been fixed. 87378: builtins: stream consistency checker output r=pavelkalinnikov a=tbg Also makes it resilient to per-Range errors, which now no longer tank the entire operation. ```sql -- avoid buffering in cli \set display_format=csv; -- avoid rows getting buffered at server set avoid_buffering=true; -- compute as fast as possible SET CLUSTER SETTING server.consistency_check.max_rate = '1tb'; SELECT * FROM crdb_internal.check_consistency(false, '', ''); ``` Release justification: improvement for a debugging-related feature Release note: None 87657: Makefile: always use `cockroach-short` for file generation r=knz a=rickystewart This defaults to the full `cockroach` executable which requires pulling in all the UI stuff. Use `cockroach-short` to make generation require fewer dependencies. Release note: None 87701: testccl/workload/schemachange: skip random schema test r=ajwerner a=ajwerner This is very flakey. Some of it is due to #87672. Some of it was due to #85677. There are some issues with inserts which need to be fixed. Until this stabilizes, it's causing problems. Along the way, I'm marking #78478 as a GA blocker so we do actually fix it. Release note: None Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
87614: sql/opt/norm: propagate kv errors from cast folding r=ajwerner a=ajwerner Swallowing KV errors here leads to incorrect results. Writes can be missed and serializability can be silently violated. This comes up in the context of the randomized schema change testing. May deal with #85677 relates to #80764 Release note: None 87646: backupccl: deflake backup tests r=adityamaru a=adityamaru See individual commits. Release note: None 87667: server: clean up some logging r=ajwerner a=ajwerner For one, this fixes a format directive issue in the first line. It also stops rendering some arguments to strings directly so that redaction works correctly. Release note: None 87702: kvserver: explain our use of (*raftGroup).ReportSnapshot r=andrewbaptist a=tbg Closes #87581. Release note: None 87713: ci: skip "integration"-style tests in testrace r=knz a=rickystewart Closes #87700. Release note: None Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: adityamaru <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
Looks like this was fixed by #87614. |
Describe the problem
Please describe the issue you observed, and any steps we can take to reproduce it:
Currently, if a transaction error is hit inside
FoldAssignmentCast
when callingPerformAssignmentCast
the errors are silently swallowed. This can happen when dealing with OIDs which will potentially involve running SQL using the internal executor. When this scenario is encountered we continue execution like normal with a potentially aborted transaction causing failures in other code paths.To Reproduce
The original issue was reproduced on the randomized schema changer workload, when inserts with OID casts are executed. However, this issue can be easily faked by forcing a transaction to abort when an OID involved in an insert is being resolved.
Jira issue: CRDB-18380
The text was updated successfully, but these errors were encountered: