-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpccbench/nodes=3/cpu=16/mt-shared-process failed [replication span assertion; needs #107521] #107242
Comments
Basic analysis:
Over on n2:
Not sure if this is more on @cockroachdb/kv or @cockroachdb/multi-tenant. |
@renatolabs could you help by including more context from the stack trace on n2? It will help us diagnose and route. |
It's on Repl. Reassigning. Details
|
cc @cockroachdb/replication |
Concretely, a log entry comes up for application. It's local, i.e. we have an entry in cockroach/pkg/kv/kvserver/replica_application_decoder.go Lines 147 to 158 in f580049
cockroach/pkg/kv/kvserver/replica_raft.go Lines 316 to 331 in b2ba2e5
upd by @pavelkalinnikovThe above paragraph/assumption is false, and it's actually the problem. With the new reproposals #105625 code, the The other case, however, does have a weakness. Async consensus creates an extra span that must be managed by the apply pipeline: Long story short, I think the problem here is that there are cases where cockroach/pkg/kv/kvserver/replica_proposal.go Lines 225 to 232 in b2ba2e5
This is called, for example, here: cockroach/pkg/kv/kvserver/replica_raft.go Lines 1397 to 1405 in b2ba2e5
but there are many other callers. The point is, that method doesn't replace edit: realizing that the caller to |
Tracking the work needed to resolve this kind of issue in #107521. We can close this failure once that work is done (and we have found at least a theory of how this test failed). Adding X-noreuse since the next time this test fails it is more likely than not with another failure mode. |
There were lots of failed reproposals (found 297 in logs) prior to the panic, all on range
Similar 53 errors for ranges Upd: this is probably a coincidence or a symptom. In one of reproduction runs, I saw none of "failed to repropose" errors. |
Might be related to #101721 (comment). |
This one as well: #107853. |
Tracked as release blocker by #107521 |
roachtest.tpccbench/nodes=3/cpu=16/mt-shared-process failed with artifacts on master @ c6f4c0ed6e39ec4795755b8b477e6cac0abf818f:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=aws
,ROACHTEST_cpu=16
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=true
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-29957
The text was updated successfully, but these errors were encountered: