-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpcc/mixed-headroom/n5cpu16 failed [OOM during import while running 21.2] #74892
Comments
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 5ad21e3896ee809e9c3ebc28bb22166f1275acca:
|
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 4b41789120e019ab015e6dbb924df763897ebadb:
|
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 912964e02ddd951c77d4f71981ae18b3894e9084:
|
It looks like a transaction retry error is somehow bubbling up to here: cockroach/pkg/workload/tpcc/worker.go Lines 231 to 234 in 79a4d4a
|
The "last good" run before the failing streak is https://teamcity.cockroachdb.com/viewLog.html?buildId=4115910 ( d6b99e9) and the first failure in the streak 7841945.
Starting 3x b3877b8 here: https://teamcity.cockroachdb.com/viewLog.html?buildId=4163457&buildTypeId=Cockroach_Nightlies_RoachtestStress&tab=buildResultsDiv&branch_Cockroach_Nightlies=%3Cdefault%3E If this passes, then it's likely a SQL/colexec change that's to blame for this change of behavior. cc @yuzefovich in case you have an immediate idea what could have changed in the propagation of txn retry errors. |
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ da01e4c0545f191a0573e1d097ff0366769e0d6b:
|
I think it's most likely because of the streamer work (#68430) where we now use leaf txns to issue concurrent requests for index joins in some cases. Notably, I haven't yet implemented the transparent refresh mechanism there, so it's expected that the number of retryable errors increases because of that PR. I guess if we do |
Would you mind making that change? I think the streamer needs to be off by default if it can't properly propagate refresh errors. We're going to catch this in most workloads. |
Just to make sure I understand things correctly: generally speaking, propagating a txn retryable error to the client is acceptable because the app must have some kind of retry loop; however, in most of our roachtests we don't tolerate the retryable errors and treat them as a failure of the test. Does this sound right? |
The workload here handles retry errors (unless I'm misreading something about where the error occurs). I think what is happening here is that a retry error bubbles up as a regular error, i.e. it can't have had the proper type. Or at least that's what I think we're seeing? The error is returned from this method: cockroach/pkg/workload/tpcc/new_order.go Lines 133 to 438 in 1c66c95
You can see by inspection that this implies that an error is returned from this block: cockroach/pkg/workload/tpcc/new_order.go Line 215 in 1c66c95
and that will certainly do proper retries? So my reading was that something in code is doing some (probably less obviously wrong version of) err := something() // retry err
err = errors.Errorf("oops messing it up %s", err)
return err |
Hm, I'm confused. The
No wrapping / error modification is done on the newly-introduced Trying to deconstruct the error message:
cockroach/pkg/workload/tpcc/worker.go Line 233 in 79a4d4a
then ERROR is likely because of pgerror.DefaultSeverity being set in
then restart transaction is
then TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn probably is
Then because |
It does say "(SQLSTATE 40001)" in the error from |
Yeah, that's what puzzles me too. |
I'll kick off this roachtest with the streamer disabled on #75257. |
If we're looking for crackpot theories, could it be that we're getting the retry error on a BEGIN? |
Lol I hope not. |
Hm, all 5 builds failed. I think I kicked them off in a correct way (from https://github.com/cockroachdb/cockroach/tree/disable-streamer branch), so maybe it's not the streamer work after all to blame. |
That looks correct. Ugh, another bisection. |
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 58ceac139a7e83052171121b28026a7366f16f7e:
|
FWIW it failed on b3877b8, to my surprise.
|
cc @cockroachdb/bulk-io |
Is there any chance this is related to #76230. I don't see an oom there, but I don't see much of anything there. |
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on release-21.2 @ 31f167ca5bbe404abcb215f80524770ddc8c0163:
|
This is a very old issue on a branch that is EOL. |
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 78419450178335b31f542bd1b14fefdf4ecee0e8:
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-12308
The text was updated successfully, but these errors were encountered: