-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: aborted transactions can end up being used for prepare #80764
Comments
It's strange to me that this is a relied upon behaviour - do we want PREPARE to work in a failed txn? in PG, we definitely do not allow this to work:
I'm not sure what the "error" that surprises you is, can you elaborate I'm reading the commit message from dda2fa9, and to me it seems "deliberate" we no longer start an implicit txn. i'm kind of confused about why we wanted this to work in the first place. overall, i'm not sure if i believe this is an issue - @fqazi can you elaborate based on the questions above? cheers! |
That error is acceptable, I noticed a case where we decided to proceed with a closed transaction and hit an assertion planning (checking authorization). Unfortunately, I haven't had luck reproducing, since it was in a randomized workload. |
are we sure it is related to that commit? if so, did it come down to a bisect?
is there a stack trace or anything more specific (e.g. precise error message) you could provide? |
i'm also tempted to remove this as a release blocker if it's difficult to reproduce. it sounds like the assertion failure is a "safe case" to hit (i.e. won't cause data corruption or unavailability of the database as a whole), unless i am mistaken. |
We weren't able to confirm the commit, and even on randomized, I haven't hit it again. I'm okay with removing the release blocker from this, since it's not easy to hit (and probably some weird scenario). I only have a stale stack trace that is incomplete from when it was hit:
|
hmm, that stack trace doesn't seem related to the prepare path. are we sure that's related? |
It's partial, the bottom part was in Prepare. Pretty much it was a schema changer workload query with SHOW REGIONS being prepared |
I ran into this again, and the full-stack below, on a experimental branch with the schemachanger workload with simple selects enabled (there is an application bug here too) :
|
At least one of the causes is #85677, where we are swallowing transaction errors, so not really SQL's fault. I'll see what else shakes out with debug code |
Nice @fqazi. Can you please share your method for how you found that? This bug has been hard to track down, and the same problem could exist in more places, so it would be good if it was more widely understood how we can find these other places. |
We added tracing inside the workload, which gets dumped when any failure gets hit. This allows us to see where the transaction retry error was hit using heuristics (i.e. last query on the connection before the retry is hit). It's not perfect and specific to the workload, but maybe some lightweight tracing of internal executor uses that's always collected might be worth it for a similar family of errors. |
Are your workload tracing changes merged in, or custom? Can you please provide full instructions so I can do this myself? |
They aren't merged in yet, I'll get a PR for them but they are specific for the randomized schema changer workload (i.e. the client side code for the workload does this). |
You can just post a |
So, patch the workload with:
You can execute the workload using: |
Great, thanks. I have a theory that #85996 will actually give us more helpful error messages. I'll try using your repro to find out. |
85996: sql: don't panic on auth check with unopen txn r=ajwerner,RichardJCai a=rafiss Relates to #80764 and #82034 Ever since c00ea84 was merged, the KV layer has disallowed use of an aborted txn. Therefore, the checks here are no longer necessary. This should actually help with debugging, since now if the aborted txn is used, we should get back an error that has the reason for the abort (or restart), instead of a panic that does not have that info. Release note: None Co-authored-by: Rafi Shamim <[email protected]>
87614: sql/opt/norm: propagate kv errors from cast folding r=ajwerner a=ajwerner Swallowing KV errors here leads to incorrect results. Writes can be missed and serializability can be silently violated. This comes up in the context of the randomized schema change testing. May deal with #85677 relates to #80764 Release note: None 87646: backupccl: deflake backup tests r=adityamaru a=adityamaru See individual commits. Release note: None 87667: server: clean up some logging r=ajwerner a=ajwerner For one, this fixes a format directive issue in the first line. It also stops rendering some arguments to strings directly so that redaction works correctly. Release note: None 87702: kvserver: explain our use of (*raftGroup).ReportSnapshot r=andrewbaptist a=tbg Closes #87581. Release note: None 87713: ci: skip "integration"-style tests in testrace r=knz a=rickystewart Closes #87700. Release note: None Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: adityamaru <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
I'm going to close this as per the previous comment |
Describe the problem
Please describe the issue you observed, and any steps we can take to reproduce it:
An assertion or other weird behaviour can be encountered because the Prepare command no longer starts a new implicit transaction if the current transaction is in an aborted state. As a result, users may notice regressions in certain scenarios, such as the example below.
To Reproduce
This was originally observed on the randomized schema change workload, and the exact sequence isn't known. A synthetic repro can be done with the following Go executable and modifying the code to intentionally close the transaction inside: execStmtInAbortedState (though I'm not sure if that is fully valid).
Jira issue: CRDB-15526
The text was updated successfully, but these errors were encountered: