opt: FoldAssignmentCast ignores transaction errors #85677

fqazi · 2022-08-05T18:10:04Z

Describe the problem

Please describe the issue you observed, and any steps we can take to reproduce it:
Currently, if a transaction error is hit inside FoldAssignmentCast when calling PerformAssignmentCast the errors are silently swallowed. This can happen when dealing with OIDs which will potentially involve running SQL using the internal executor. When this scenario is encountered we continue execution like normal with a potentially aborted transaction causing failures in other code paths.

To Reproduce

The original issue was reproduced on the randomized schema changer workload, when inserts with OID casts are executed. However, this issue can be easily faked by forcing a transaction to abort when an OID involved in an insert is being resolved.

Jira issue: CRDB-18380

The text was updated successfully, but these errors were encountered:

mgartner · 2022-08-09T18:55:25Z

@fqazi Do you have reproduction steps for this?

mgartner · 2022-08-09T18:56:13Z

Yahor says it's difficult to reproduce, but it's possible to fake an error to see the problems this creates.

fqazi · 2022-08-09T19:11:48Z

The transaction retry is hard to repro standalone if you take the ResolveOID* and inject a transaction retry error or generate a real one with a paused thread. If we need to validate the fix the random-load variant of the schema changer workload is pretty easy to run, and we can see it in 10 minutes at most.

ajwerner · 2022-08-10T20:18:55Z

cockroach/pkg/sql/opt/norm/fold_constants_funcs.go

Lines 397 to 400 in 31ffb42

    
           result, err := eval.PerformAssignmentCast(c.f.evalCtx, datum, typ) 
        
           if err != nil { 
        
           	return nil, false 
        
           }

mgartner · 2022-08-11T00:05:16Z

Great find! This is a bad one. I want to take some time to audit all these casts - any cast that has to read some mutable state should not be folded during normalization. @ajwerner and I also discussed how labelling these casts as IMMUTABLE is deceiving at best, and incorrect at worst. I'll have to think more about these cases. This relates to #74470.

ajwerner · 2022-08-11T00:11:11Z

any cast that has to read some mutable state should not be folded during normalization

I remain meh on this take. I suspect that folding regclass casts is probably important for a bunch of queries. I think that for all intents and purposes, these casts are stable and for the usage of the memo, are even immutable. Ideally we'd just propagate errors which relate to concurrency control. I know that's a bit tricky.

ajwerner · 2022-08-11T00:13:08Z

Alright in light of #74470, I agree that immutable is not fitting. Stable is fitting and folding these stable values may be important.

rafiss · 2022-08-11T03:25:45Z

I thought we had a linter for not returning errors (returnerrcheck)? But it looks like this problem (i.e. not returning an error that could have caused the txn to abort) has been happening in a few places.

ajwerner · 2022-08-11T03:36:23Z

The returnerrcheck linter doesn't do anything if the function doesn't itself return an error. The optimizer doesn't really return errors, it panics them, so the linter does not trigger.

rafiss · 2022-08-11T16:13:33Z

The returnerrcheck linter doesn't do anything if the function doesn't itself return an error. The optimizer doesn't really return errors, it panics them, so the linter does not trigger.

I see, thanks.

So do we know if this issue is another cause of #82034 and #80764?

ajwerner · 2022-08-11T16:20:04Z

So do we know if this issue is another cause of #82034 and #80764?

Yes, I believe we do know that this is a cause of #80764 (#80764 (comment)).

rafiss · 2022-09-01T20:42:47Z

@mgartner regarding your earlier comment:

I want to take some time to audit all these casts - any cast that has to read some mutable state should not be folded during normalization. @ajwerner and I also discussed how labelling these casts as IMMUTABLE is deceiving at best, and incorrect at worst. I'll have to think more about these cases.

does that work you're describing to audit all the casts need to happen before making the change to stop swallowing the error at

cockroach/pkg/sql/opt/norm/fold_constants_funcs.go

Lines 397 to 400 in 31ffb42

    
           result, err := eval.PerformAssignmentCast(c.f.evalCtx, datum, typ) 
        
           if err != nil { 
        
           	return nil, false 
        
           }

mgartner · 2022-09-01T22:41:20Z

No, I can fix that now, I think.

mgartner · 2022-09-01T22:54:21Z

I'll point out that the pattern of swallowing an evaluation error is used in other cases:

cockroach/pkg/sql/opt/norm/fold_constants_funcs.go

Lines 469 to 472 in 31ffb42

    
           result, err := eval.BinaryOp(c.f.evalCtx, o.EvalOp, lDatum, rDatum) 
        
           if err != nil { 
        
           	return nil, false 
        
           }

cockroach/pkg/sql/opt/norm/fold_constants_funcs.go

Line 514 in 31ffb42

result, err := eval.Expr(c.f.evalCtx, texpr)

cockroach/pkg/sql/opt/norm/fold_constants_funcs.go

Lines 618 to 621 in 31ffb42

    
           result, err := eval.Expr(c.f.evalCtx, fn) 
        
           if err != nil { 
        
           	return nil, false 
        
           }

The reason we can do this is because we only try to evaluate immutable or stable expressions. I am hesitant to change all the places where we swallow errors - folding expressions is best-effort, and if it fails for any reason, we'd prefer to continue with optimization and leave the expression as-is. So I think changing this assignment cast case is a temporary solution - ideally we can fix the volatility of the cast and then revert to swallowing errors.

ajwerner · 2022-09-02T01:58:15Z

An approach we've taken elsewhere is to classify the errors which can be swallowed safely. We don't have a unified way to do this presently. In the cases I'm thinking of, we literally passed a boolean with the error. That wasn't great.

This is very flakey. Some of it is due to cockroachdb#87672. Some of it was due to cockroachdb#85677. There are some issues with inserts which need to be fixed. Until this stabilizes, it's causing problems. Along the way, I'm marking cockroachdb#78478 as a GA blocker so we do actually fix it. Release note: None

80474: gcp: make per-chunk retry upload timeout configurable r=dt a=adityamaru This change adds a cluster setting `cloudstorage.gs.chunking.retry_timeout` that can be used to change the default per-chunk retry timeout that GCS imposes when chunking of file upload is enabled. The default value is set to 60 seconds, which is double of the default google sdk value of 30s. This change was motivated by sporadic occurrences of a 503 service unavailable error during backups. On its own this change is not expected to solve the resiliency issues of backup when the upload service is unavailable, but it is nice to have configurable setting nonetheless. Release note (sql change): `cloudstorage.gs.chunking.retry_timeout` is a cluster setting that can be used to configure the per-chunk retry timeout of files to Google Cloud Storage. The default value is 60 seconds. 86696: sql/gcjob: make index GC robust to descriptors being deleted r=ajwerner a=ajwerner First commit is #86690 If the descriptor was deleted, the GC job should exit gracefully. Fixes #86340 Release justification: bug fix for backport Release note (bug fix): In some scenarios, when a DROP INDEX was run around the same time as a DROP TABLE or DROP DATABASE covering the same data, the `DROP INDEX` gc job could get caught retrying indefinitely. This has been fixed. 87378: builtins: stream consistency checker output r=pavelkalinnikov a=tbg Also makes it resilient to per-Range errors, which now no longer tank the entire operation. ```sql -- avoid buffering in cli \set display_format=csv; -- avoid rows getting buffered at server set avoid_buffering=true; -- compute as fast as possible SET CLUSTER SETTING server.consistency_check.max_rate = '1tb'; SELECT * FROM crdb_internal.check_consistency(false, '', ''); ``` Release justification: improvement for a debugging-related feature Release note: None 87657: Makefile: always use `cockroach-short` for file generation r=knz a=rickystewart This defaults to the full `cockroach` executable which requires pulling in all the UI stuff. Use `cockroach-short` to make generation require fewer dependencies. Release note: None 87701: testccl/workload/schemachange: skip random schema test r=ajwerner a=ajwerner This is very flakey. Some of it is due to #87672. Some of it was due to #85677. There are some issues with inserts which need to be fixed. Until this stabilizes, it's causing problems. Along the way, I'm marking #78478 as a GA blocker so we do actually fix it. Release note: None Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>

This is very flakey. Some of it is due to #87672. Some of it was due to #85677. There are some issues with inserts which need to be fixed. Until this stabilizes, it's causing problems. Along the way, I'm marking #78478 as a GA blocker so we do actually fix it. Release note: None

87614: sql/opt/norm: propagate kv errors from cast folding r=ajwerner a=ajwerner Swallowing KV errors here leads to incorrect results. Writes can be missed and serializability can be silently violated. This comes up in the context of the randomized schema change testing. May deal with #85677 relates to #80764 Release note: None 87646: backupccl: deflake backup tests r=adityamaru a=adityamaru See individual commits. Release note: None 87667: server: clean up some logging r=ajwerner a=ajwerner For one, this fixes a format directive issue in the first line. It also stops rendering some arguments to strings directly so that redaction works correctly. Release note: None 87702: kvserver: explain our use of (*raftGroup).ReportSnapshot r=andrewbaptist a=tbg Closes #87581. Release note: None 87713: ci: skip "integration"-style tests in testrace r=knz a=rickystewart Closes #87700. Release note: None Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: adityamaru <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>

mgartner · 2023-06-08T16:23:29Z

Looks like this was fixed by #87614.

fqazi added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 labels Aug 5, 2022

blathers-crl bot added the T-sql-queries SQL Queries Team label Aug 5, 2022

This was referenced Aug 5, 2022

roachtest: schemachange/random-load failed #82133

Closed

sql: aborted transactions can end up being used for prepare #80764

Closed

mgartner self-assigned this Aug 9, 2022

ajwerner mentioned this issue Aug 11, 2022

kvclient: require explicit acknowledgement from client on restart #85980

Closed

fqazi mentioned this issue Aug 23, 2022

pkg/ccl/testccl/workload/schemachange/schemachange_test: TestWorkload failed #78478

Closed

ajwerner mentioned this issue Sep 8, 2022

sql/opt/norm: propagate kv errors from cast folding #87614

Merged

ajwerner mentioned this issue Sep 9, 2022

testccl/workload/schemachange: skip random schema test #87701

Merged

blathers-crl bot mentioned this issue Sep 9, 2022

release-22.2: testccl/workload/schemachange: skip random schema test #87719

Merged

This was referenced Sep 9, 2022

release-22.1: sql/opt/norm: propagate kv errors from cast folding #87729

Merged

release-22.2: sql/opt/norm: propagate kv errors from cast folding #87730

Merged

ajwerner mentioned this issue Oct 10, 2022

sql: avoid some overhead due to error construction on common paths #89691

Merged

mgartner closed this as completed Jun 8, 2023

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to Done in SQL Queries Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opt: FoldAssignmentCast ignores transaction errors #85677

opt: FoldAssignmentCast ignores transaction errors #85677

fqazi commented Aug 5, 2022 •

edited by cockroach-jira-scripts

Loading

mgartner commented Aug 9, 2022

mgartner commented Aug 9, 2022

fqazi commented Aug 9, 2022

ajwerner commented Aug 10, 2022

mgartner commented Aug 11, 2022

ajwerner commented Aug 11, 2022

ajwerner commented Aug 11, 2022

rafiss commented Aug 11, 2022

ajwerner commented Aug 11, 2022

rafiss commented Aug 11, 2022

ajwerner commented Aug 11, 2022

rafiss commented Sep 1, 2022

mgartner commented Sep 1, 2022

mgartner commented Sep 1, 2022

ajwerner commented Sep 2, 2022

mgartner commented Jun 8, 2023

opt: FoldAssignmentCast ignores transaction errors #85677

opt: FoldAssignmentCast ignores transaction errors #85677

Comments

fqazi commented Aug 5, 2022 • edited by cockroach-jira-scripts Loading

mgartner commented Aug 9, 2022

mgartner commented Aug 9, 2022

fqazi commented Aug 9, 2022

ajwerner commented Aug 10, 2022

mgartner commented Aug 11, 2022

ajwerner commented Aug 11, 2022

ajwerner commented Aug 11, 2022

rafiss commented Aug 11, 2022

ajwerner commented Aug 11, 2022

rafiss commented Aug 11, 2022

ajwerner commented Aug 11, 2022

rafiss commented Sep 1, 2022

mgartner commented Sep 1, 2022

mgartner commented Sep 1, 2022

ajwerner commented Sep 2, 2022

mgartner commented Jun 8, 2023

fqazi commented Aug 5, 2022 •

edited by cockroach-jira-scripts

Loading