Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver/batcheval: TestLeaseTransferWithPipelinedWrite failed #104716

Closed
cockroach-teamcity opened this issue Jun 12, 2023 · 11 comments · Fixed by #106502
Closed

kv/kvserver/batcheval: TestLeaseTransferWithPipelinedWrite failed #104716

cockroach-teamcity opened this issue Jun 12, 2023 · 11 comments · Fixed by #106502
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 12, 2023

kv/kvserver/batcheval.TestLeaseTransferWithPipelinedWrite failed with artifacts on release-23.1 @ 8cfaf3b4731273cedaa9e81600a7b1778700b702:

=== RUN   TestLeaseTransferWithPipelinedWrite
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/e1ab32a24518a46d10d3b8d78552a36a/logTestLeaseTransferWithPipelinedWrite2061500666
    test_log_scope.go:79: use -show-logs to present logs inline
    cmd_lease_test.go:117: worker failed: pq: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_ASYNC_WRITE_FAILURE - missing intent on: /Table/121/1/0/1/0): "sql txn" meta={id=2daa4158 key=/Table/121/1/0/1/0 pri=0.00400017 epo=0 ts=1686562309.260845275,0 min=1686562309.260845275,0 seq=3} lock=true stat=PENDING rts=1686562309.260845275,0 wto=false gul=1686562309.760845275,0
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/e1ab32a24518a46d10d3b8d78552a36a/logTestLeaseTransferWithPipelinedWrite2061500666
--- FAIL: TestLeaseTransferWithPipelinedWrite (43.09s)

Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-28689

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team labels Jun 12, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Jun 12, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Jun 13, 2023

This test has a history of failing with the TransactionRetryWithProtoRefreshError error:

ABORT_REASON_NEW_LEASE_PREVENTS_TXN has been allowed in #85600, but we could still hit RETRY_ASYNC_WRITE_FAILURE as in #86935. It manifested briefly again in #96849.

@pav-kv
Copy link
Collaborator

pav-kv commented Jun 13, 2023

96756 runs so far, 0 failures, over 1h31m20s

@pav-kv pav-kv added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Jun 16, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 16, 2023

cc @cockroachdb/replication

@pav-kv
Copy link
Collaborator

pav-kv commented Jun 16, 2023

@tbg Any recent changes in the reproposals area that could affect this? See the historical links above, could be related.

@cockroach-teamcity
Copy link
Member Author

kv/kvserver/batcheval.TestLeaseTransferWithPipelinedWrite failed with artifacts on release-23.1 @ 3532484aa08c3acf6eb5770f1469af09f4ba704d:

=== RUN   TestLeaseTransferWithPipelinedWrite
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/e1ab32a24518a46d10d3b8d78552a36a/logTestLeaseTransferWithPipelinedWrite3846098567
    test_log_scope.go:79: use -show-logs to present logs inline
    cmd_lease_test.go:117: worker failed: pq: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_ASYNC_WRITE_FAILURE - missing intent on: /Table/181/1/0/1/0): "sql txn" meta={id=6c541677 key=/Table/181/1/0/1/0 pri=0.01566117 epo=0 ts=1687421038.443001517,0 min=1687421038.443001517,0 seq=3} lock=true stat=PENDING rts=1687421038.443001517,0 wto=false gul=1687421038.943001517,0
    panic.go:540: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/e1ab32a24518a46d10d3b8d78552a36a/logTestLeaseTransferWithPipelinedWrite3846098567
--- FAIL: TestLeaseTransferWithPipelinedWrite (158.57s)

Parameters: TAGS=bazel,gss,race

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@tbg
Copy link
Member

tbg commented Jun 22, 2023

@tbg Any recent changes in the reproposals area that could affect this? See the historical links above, could be related.

No, none have merged.

@aliher1911 aliher1911 self-assigned this Jun 23, 2023
@aliher1911
Copy link
Contributor

I can see
I230622 08:04:01.498396 92394 kv/kvserver/replica_application_result.go:201 ⋮ [T1,n1,s1,r138/1:‹/{Table/178-Max}›,raft] 2963 failed to repropose 0b9a11f4bef0e3d3 at idx 16 with new lease index: [NotLeaseHolderError] ‹reproposal failed due to closed timestamp›; r138: replica (n1,s1):1 not lease holder; current lease is repl=(n1,s1):1 seq=1 start=0,0 exp=1687421046.453882411,0 pro=1687421040.453882411,0
in the line preceeding quiesce of TC which happens when we detect the failure. So it is caused by reproposal failure.

@aliher1911
Copy link
Contributor

aliher1911 commented Jun 27, 2023

I tried stressing on linux box and I can't repro in 24h. I now added a panic at the point where we turn missing intent into retryable error to see if we hit the actual code we are trying to test and so far I have 0 panics. I'll give it some time to grind...
Upd - good news it paniced, so we engage retries after all.

@aliher1911
Copy link
Contributor

Reproed on 16 cpu machine in 8 hours with:

./dev test pkg/kv/kvserver/batcheval --stress --ignore-cache --filter=TestLeaseTransferWithPipelinedWrite --cpus 32

@tbg tbg assigned tbg and unassigned aliher1911 Jul 6, 2023
@aliher1911 aliher1911 self-assigned this Jul 7, 2023
@aliher1911
Copy link
Contributor

Hmm... I changed cput to skip actual command execution once when it sees this particular value inserted into a table. And it always fails with missing intent error. We seem to properly handle missing intent error in txn_interceptor_pipeliner and turn it into retryable transaction error, but we don't attempt any retry. In my test it should succeed as it only ever skips cput once in the test.

@aliher1911
Copy link
Contributor

Added some logging to raft and what I see during pipelined commit:

[NotLeaseHolderError] ‹stale proposal: command was proposed under lease #2 but is being applied under lease: repl=(n3,s3):2 seq=3 start=1688767596.765460545,0 epo=1 pro=1688767596.768243822,0›; r63: replica not lease holder; current lease is repl=(n3,s3):2 seq=3 start=1688767596.765460545,0 epo=1 pro=1688767596.768243822,0

I don't see it being retried in any way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants