-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: [dnm] add RACv2 failing split/merge/migrate repros #135339
kvserver: [dnm] add RACv2 failing split/merge/migrate repros #135339
Conversation
90abd9c
to
2b4d469
Compare
792b620
to
dd5456f
Compare
dd5456f
to
193c319
Compare
Adding @pav-kv and @sumeerbhola as reviewers. Not intending to merge, added for notification / tracking only. |
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
15e37ea
to
561ef87
Compare
This test currently fails, while also churning snapshots towards the RHS range post-split. It appears that the RHS post-split replica on the store which has no available tokens (from the leader, n1's perspective) is effectively wedged. ```bash dev test pkg/kv/kvserver -v --vmodule='replica_raft=1,kvflowcontroller=2,replica_proposal_buf=1,raft_transport=2,kvflowdispatch=1,kvadmission=1,kvflowhandle=1,work_queue=1,replica_flow_control=1,tracker=1,client_raft_helpers_test=1,raft=1,admission=1,replica_flow_control=1,work_queue=1,replica_raft=1,replica_proposal_buf=1,raft_transport=2,kvadmission=1,work_queue=1,replica_flow_control=1,client_raft_helpers_test=1,range_controller=2,token_counter=2,token_tracker=2,processor=2,kvflowhandle=1' -f TestFlowControlSendQueueRangeSplitSnap --show-logs --timeout=120s ``` Epic: none Release note: None
This adds a new RACv2 integration test, `TestFlowControlSendQueueRangeMerge`. The test exhausts the tokens on s3 (s1,s2,s3) by issuing 5 MiB / 4 MiB of writes to the RHS and blocking admission on n3. The test currently fails on the merge with: ``` testcluster.go:760: merging at /Table/Max: /Table/Max: merge unexpected error: kv/kvserver/replica_command.go:962: merge failed: waiting for all right-hand replicas to catch up: operation "waiting for merge application" timed out after 5.001s (given timeout 5s): grpc: context deadline exceeded [code 4/DeadlineExceeded] (1) attached stack trace -- stack trace: | github.com/cockroachdb/cockroach/pkg/server.(*testServer).MergeRanges | testserver.go:2039 | github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).MergeRanges | pkg/testutils/testcluster/testcluster.go:750 | github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).MergeRangesOrFatal | pkg/testutils/testcluster/testcluster.go:758 | github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestFlowControlSendQueueRangeMerge | pkg/kv/kvserver/flow_control_integration_test.go:5163 | testing.tRunner | GOROOT/src/testing/testing.go:1689 | runtime.goexit | src/runtime/asm_arm64.s:1222 Wraps: (2) /Table/Max: merge unexpected error: kv/kvserver/replica_command.go:962: merge failed: waiting for all right-hand replicas to catch up: operation "waiting for merge application" timed out after 5.001s (given timeout 5s): grpc: context deadline exceeded [code 4/DeadlineExceeded] ``` Run with the following for more logging: ``` dev test pkg/kv/kvserver -v --vmodule='replica_raft=1,kvflowcontroller=2,replica_proposal_buf=1,raft_transport=2,kvflowdispatch=1,kvadmission=1,kvflowhandle=1,work_queue=1,replica_flow_control=1,tracker=1,client_raft_helpers_test=1,raft=1,admission=1,replica_flow_control=1,work_queue=1,replica_raft=1,replica_proposal_buf=1,raft_transport=2,kvadmission=1,work_queue=1,replica_flow_control=1,client_raft_helpers_test=1,range_controller=2,token_counter=2,token_tracker=2,processor=2,kvflowhandle=1' -f TestFlowControlSendQueueRangeMerge --show-logs --timeout=30s ``` Epic: none Release note: None
Add a new RACv2 integration test, `TestFlowControlSendQueueRangeMigrate`. This test fails, but not by timing out (it will with the below args due to the timeout flag specified), instead s1 and s3 on the scratch range appear to ping pong messages back and forth like: ``` -- Test timed out at 2024-11-16 00:39:36 UTC -- I241116 00:39:37.221610 4048 kv/kvserver_test/client_raft_helpers_test.go:104 [T1,Vsystem,n3] 725 [raft] r69 Raft message 1->3 MsgApp Term:6 Log:6/25 Commit:27 I241116 00:39:37.222264 4068 kv/kvserver_test/client_raft_helpers_test.go:104 [T1,Vsystem,n1] 726 [raft] r69 Raft message 3->1 MsgAppResp Term:6 Log:0/25 Commit:25 I241116 00:39:38.221271 4048 kv/kvserver_test/client_raft_helpers_test.go:104 [T1,Vsystem,n3] 727 [raft] r69 Raft message 1->3 MsgApp Term:6 Log:6/25 Commit:27 I241116 00:39:38.221876 4068 kv/kvserver_test/client_raft_helpers_test.go:104 [T1,Vsystem,n1] 728 [raft] r69 Raft message 3->1 MsgAppResp Term:6 Log:0/25 Commit:25 I241116 00:39:38.971603 4048 kv/kvserver_test/client_raft_helpers_test.go:104 [T1,Vsystem,n3] 729 [raft] r69 Raft message 1->3 MsgApp Term:6 Log:6/25 Commit:27 I241116 00:39:38.972018 4068 kv/kvserver_test/client_raft_helpers_test.go:104 [T1,Vsystem,n1] 730 [raft] r69 Raft message 3->1 MsgAppResp Term:6 Log:0/25 Commit:25 ``` The timeout is on these lines, calling `kvserver.waitForApplication(..)` here: - https://github.com/kvoli/cockroach/blob/dd5456fea47451d5b47e10ca40991808d7e47780/pkg/kv/kvserver/replica_write.go#L289-L292 Run with the following for more logging: ```bash dev test pkg/kv/kvserver -v --vmodule='replica_raft=1,kvflowcontroller=2,replica_proposal_buf=1,raft_transport=2,kvflowdispatch=1,kvadmission=1,kvflowhandle=1,work_queue=1,replica_flow_control=1,tracker=1,client_raft_helpers_test=1,raft=1,admission=1,replica_flow_control=1,work_queue=1,replica_raft=1,replica_proposal_buf=1,raft_transport=2,kvadmission=1,work_queue=1,replica_flow_control=1,client_raft_helpers_test=1,range_controller=2,token_counter=2,token_tracker=2,processor=2,kvflowhandle=1' -f TestFlowControlSendQueueRangeMigrate --show-logs --timeout=60s ``` Epic: none Release note: None
561ef87
to
9b3a6c0
Compare
Splitting off the relocate test variant, as that works as expected: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r1, 2 of 2 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)
pkg/kv/kvserver/flow_control_integration_test.go
line 5120 at r2 (raw file):
require.NoError(t, err) h.enableVerboseRaftMsgLoggingForRange(desc.RangeID) h.enableVerboseRaftMsgLoggingForRange(desc.RangeID + 1)
Is the scratch range guaranteed to have the highest rangeID at this point in the test?
pkg/kv/kvserver/flow_control_integration_test.go
line 4979 at r1 (raw file):
} tc := testcluster.StartTestCluster(t, 5, base.TestClusterArgs{
nit: do we need 5 nodes given that we only use 3?
I'm going to close this PR out and fork the branch to extract the tests, on-top of the draft fix #136077. |
TestFlowControlSendQueueRangeSplitSnap
This test currently fails, while also churning snapshots towards the
RHS range post-split. It appears that the RHS post-split replica on the
store which has no available tokens (from the leader, n1's perspective)
is effectively wedged.
This adds a new RACv2 integration test,
TestFlowControlSendQueueRangeMerge
.The test exhausts the tokens on s3 (s1,s2,s3) by issuing 5 MiB / 4 MiB
of writes to the RHS and blocking admission on n3.
The test currently fails on the merge with:
Run with the following for more logging:
Add a new RACv2 integration test,
TestFlowControlSendQueueRangeMigrate
.This test fails, but not by timing out (it will with the below args due
to the timeout flag specified), instead s1 and s3 on the scratch range
appear to ping pong messages back and forth like:
The timeout is on these lines, calling
kvserver.waitForApplication(..)
here:
Run with the following for more logging:
Epic: none
Release note: None