Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: FoldAssignmentCast ignores transaction errors #85677

Closed
fqazi opened this issue Aug 5, 2022 · 16 comments
Closed

opt: FoldAssignmentCast ignores transaction errors #85677

fqazi opened this issue Aug 5, 2022 · 16 comments
Assignees
Labels
branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team

Comments

@fqazi
Copy link
Collaborator

fqazi commented Aug 5, 2022

Describe the problem

Please describe the issue you observed, and any steps we can take to reproduce it:
Currently, if a transaction error is hit inside FoldAssignmentCast when calling PerformAssignmentCast the errors are silently swallowed. This can happen when dealing with OIDs which will potentially involve running SQL using the internal executor. When this scenario is encountered we continue execution like normal with a potentially aborted transaction causing failures in other code paths.

To Reproduce

The original issue was reproduced on the randomized schema changer workload, when inserts with OID casts are executed. However, this issue can be easily faked by forcing a transaction to abort when an OID involved in an insert is being resolved.

Jira issue: CRDB-18380

@fqazi fqazi added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 labels Aug 5, 2022
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Aug 5, 2022
@mgartner mgartner self-assigned this Aug 9, 2022
@mgartner
Copy link
Collaborator

mgartner commented Aug 9, 2022

@fqazi Do you have reproduction steps for this?

@mgartner
Copy link
Collaborator

mgartner commented Aug 9, 2022

Yahor says it's difficult to reproduce, but it's possible to fake an error to see the problems this creates.

@fqazi
Copy link
Collaborator Author

fqazi commented Aug 9, 2022

The transaction retry is hard to repro standalone if you take the ResolveOID* and inject a transaction retry error or generate a real one with a paused thread. If we need to validate the fix the random-load variant of the schema changer workload is pretty easy to run, and we can see it in 10 minutes at most.

@ajwerner
Copy link
Contributor

result, err := eval.PerformAssignmentCast(c.f.evalCtx, datum, typ)
if err != nil {
return nil, false
}

@mgartner
Copy link
Collaborator

Great find! This is a bad one. I want to take some time to audit all these casts - any cast that has to read some mutable state should not be folded during normalization. @ajwerner and I also discussed how labelling these casts as IMMUTABLE is deceiving at best, and incorrect at worst. I'll have to think more about these cases. This relates to #74470.

@ajwerner
Copy link
Contributor

any cast that has to read some mutable state should not be folded during normalization

I remain meh on this take. I suspect that folding regclass casts is probably important for a bunch of queries. I think that for all intents and purposes, these casts are stable and for the usage of the memo, are even immutable. Ideally we'd just propagate errors which relate to concurrency control. I know that's a bit tricky.

@ajwerner
Copy link
Contributor

Alright in light of #74470, I agree that immutable is not fitting. Stable is fitting and folding these stable values may be important.

@rafiss
Copy link
Collaborator

rafiss commented Aug 11, 2022

I thought we had a linter for not returning errors (returnerrcheck)? But it looks like this problem (i.e. not returning an error that could have caused the txn to abort) has been happening in a few places.

@ajwerner
Copy link
Contributor

The returnerrcheck linter doesn't do anything if the function doesn't itself return an error. The optimizer doesn't really return errors, it panics them, so the linter does not trigger.

@rafiss
Copy link
Collaborator

rafiss commented Aug 11, 2022

The returnerrcheck linter doesn't do anything if the function doesn't itself return an error. The optimizer doesn't really return errors, it panics them, so the linter does not trigger.

I see, thanks.


So do we know if this issue is another cause of #82034 and #80764?

@ajwerner
Copy link
Contributor

So do we know if this issue is another cause of #82034 and #80764?

Yes, I believe we do know that this is a cause of #80764 (#80764 (comment)).

@rafiss
Copy link
Collaborator

rafiss commented Sep 1, 2022

@mgartner regarding your earlier comment:

I want to take some time to audit all these casts - any cast that has to read some mutable state should not be folded during normalization. @ajwerner and I also discussed how labelling these casts as IMMUTABLE is deceiving at best, and incorrect at worst. I'll have to think more about these cases.

does that work you're describing to audit all the casts need to happen before making the change to stop swallowing the error at

result, err := eval.PerformAssignmentCast(c.f.evalCtx, datum, typ)
if err != nil {
return nil, false
}

@mgartner
Copy link
Collaborator

mgartner commented Sep 1, 2022

No, I can fix that now, I think.

@mgartner
Copy link
Collaborator

mgartner commented Sep 1, 2022

I'll point out that the pattern of swallowing an evaluation error is used in other cases:

result, err := eval.BinaryOp(c.f.evalCtx, o.EvalOp, lDatum, rDatum)
if err != nil {
return nil, false
}

result, err := eval.Expr(c.f.evalCtx, texpr)

result, err := eval.Expr(c.f.evalCtx, fn)
if err != nil {
return nil, false
}

The reason we can do this is because we only try to evaluate immutable or stable expressions. I am hesitant to change all the places where we swallow errors - folding expressions is best-effort, and if it fails for any reason, we'd prefer to continue with optimization and leave the expression as-is. So I think changing this assignment cast case is a temporary solution - ideally we can fix the volatility of the cast and then revert to swallowing errors.

@ajwerner
Copy link
Contributor

ajwerner commented Sep 2, 2022

An approach we've taken elsewhere is to classify the errors which can be swallowed safely. We don't have a unified way to do this presently. In the cases I'm thinking of, we literally passed a boolean with the error. That wasn't great.

ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 9, 2022
This is very flakey. Some of it is due to cockroachdb#87672. Some of it was due to cockroachdb#85677.
There are some issues with inserts which need to be fixed. Until this
stabilizes, it's causing problems.

Along the way, I'm marking cockroachdb#78478 as a GA blocker so we do actually fix it.

Release note: None
craig bot pushed a commit that referenced this issue Sep 9, 2022
80474: gcp: make per-chunk retry upload timeout configurable r=dt a=adityamaru

This change adds a cluster setting `cloudstorage.gs.chunking.retry_timeout`
that can be used to change the default per-chunk retry timeout
that GCS imposes when chunking of file upload is enabled. The default
value is set to 60 seconds, which is double of the default google sdk
value of 30s.

This change was motivated by sporadic occurrences of a 503 service unavailable
error during backups. On its own this change is not expected to solve the
resiliency issues of backup when the upload service is unavailable, but it
is nice to have configurable setting nonetheless.

Release note (sql change): `cloudstorage.gs.chunking.retry_timeout`
is a cluster setting that can be used to configure the per-chunk retry
timeout of files to Google Cloud Storage. The default value is 60 seconds.

86696: sql/gcjob: make index GC robust to descriptors being deleted r=ajwerner a=ajwerner

First commit is #86690

If the descriptor was deleted, the GC job should exit gracefully.

Fixes #86340

Release justification: bug fix for backport

Release note (bug fix): In some scenarios, when a DROP INDEX was
run around the same time as a DROP TABLE or DROP DATABASE covering the same
data, the `DROP INDEX` gc job could get caught retrying indefinitely. This
has been fixed.

87378: builtins: stream consistency checker output r=pavelkalinnikov a=tbg

Also makes it resilient to per-Range errors, which now no longer
tank the entire operation.

```sql
-- avoid buffering in cli
\set display_format=csv;
-- avoid rows getting buffered at server
set avoid_buffering=true;
-- compute as fast as possible
SET CLUSTER SETTING server.consistency_check.max_rate = '1tb';

SELECT * FROM crdb_internal.check_consistency(false, '', '');
```

Release justification: improvement for a debugging-related feature
Release note: None


87657: Makefile: always use `cockroach-short` for file generation r=knz a=rickystewart

This defaults to the full `cockroach` executable which requires pulling in all the UI stuff. Use `cockroach-short` to make generation require fewer dependencies.

Release note: None

87701: testccl/workload/schemachange: skip random schema test r=ajwerner a=ajwerner

This is very flakey. Some of it is due to #87672. Some of it was due to #85677. There are some issues with inserts which need to be fixed. Until this stabilizes, it's causing problems.

Along the way, I'm marking #78478 as a GA blocker so we do actually fix it.

Release note: None

Co-authored-by: Aditya Maru <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Sep 9, 2022
This is very flakey. Some of it is due to #87672. Some of it was due to #85677.
There are some issues with inserts which need to be fixed. Until this
stabilizes, it's causing problems.

Along the way, I'm marking #78478 as a GA blocker so we do actually fix it.

Release note: None
craig bot pushed a commit that referenced this issue Sep 9, 2022
87614: sql/opt/norm: propagate kv errors from cast folding r=ajwerner a=ajwerner

Swallowing KV errors here leads to incorrect results. Writes can be missed and serializability can be silently violated. This comes up in the context of the randomized schema change testing.

May deal with #85677
relates to #80764

Release note: None

87646: backupccl: deflake backup tests r=adityamaru a=adityamaru

See individual commits.

Release note: None

87667: server: clean up some logging r=ajwerner a=ajwerner

For one, this fixes a format directive issue in the first line. It also stops rendering some arguments to strings directly so that redaction works correctly.

Release note: None

87702: kvserver: explain our use of (*raftGroup).ReportSnapshot r=andrewbaptist a=tbg

Closes #87581.

Release note: None


87713: ci: skip "integration"-style tests in testrace r=knz a=rickystewart

Closes #87700.
Release note: None

Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: adityamaru <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
@mgartner
Copy link
Collaborator

mgartner commented Jun 8, 2023

Looks like this was fixed by #87614.

@mgartner mgartner closed this as completed Jun 8, 2023
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

No branches or pull requests

4 participants