Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: [dnm] add RACv2 failing split/merge/migrate repros #135339

Closed

Conversation

kvoli
Copy link
Collaborator

@kvoli kvoli commented Nov 15, 2024

TestFlowControlSendQueueRangeSplitSnap

This test currently fails, while also churning snapshots towards the
RHS range post-split. It appears that the RHS post-split replica on the
store which has no available tokens (from the leader, n1's perspective)
is effectively wedged.

dev test pkg/kv/kvserver -v --vmodule='replica_raft=1,kvflowcontroller=2,replica_proposal_buf=1,raft_transport=2,kvflowdispatch=1,kvadmission=1,kvflowhandle=1,work_queue=1,replica_flow_control=1,tracker=1,client_raft_helpers_test=1,raft=1,admission=1,replica_flow_control=1,work_queue=1,replica_raft=1,replica_proposal_buf=1,raft_transport=2,kvadmission=1,work_queue=1,replica_flow_control=1,client_raft_helpers_test=1,range_controller=2,token_counter=2,token_tracker=2,processor=2,kvflowhandle=1' -f  TestFlowControlSendQueueRangeSplitSnap --show-logs --timeout=120s

This adds a new RACv2 integration test,
TestFlowControlSendQueueRangeMerge.

The test exhausts the tokens on s3 (s1,s2,s3) by issuing 5 MiB / 4 MiB
of writes to the RHS and blocking admission on n3.

The test currently fails on the merge with:

testcluster.go:760: merging at /Table/Max: /Table/Max: merge unexpected error: kv/kvserver/replica_command.go:962: merge failed: waiting for all right-hand replicas to catch up: operation "waiting for merge application" timed out after 5.001s (given timeout 5s): grpc: context deadline exceeded [code 4/DeadlineExceeded]
        (1) attached stack trace
          -- stack trace:
          | github.com/cockroachdb/cockroach/pkg/server.(*testServer).MergeRanges
          |     testserver.go:2039
          | github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).MergeRanges
          |     pkg/testutils/testcluster/testcluster.go:750
          | github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).MergeRangesOrFatal
          |     pkg/testutils/testcluster/testcluster.go:758
          | github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestFlowControlSendQueueRangeMerge
          |     pkg/kv/kvserver/flow_control_integration_test.go:5163
          | testing.tRunner
          |     GOROOT/src/testing/testing.go:1689
          | runtime.goexit
          |     src/runtime/asm_arm64.s:1222
        Wraps: (2) /Table/Max: merge unexpected error: kv/kvserver/replica_command.go:962: merge failed: waiting for all right-hand replicas to catch up: operation "waiting for merge application" timed out after 5.001s (given timeout 5s): grpc: context deadline exceeded [code 4/DeadlineExceeded]

Run with the following for more logging:

dev test pkg/kv/kvserver -v --vmodule='replica_raft=1,kvflowcontroller=2,replica_proposal_buf=1,raft_transport=2,kvflowdispatch=1,kvadmission=1,kvflowhandle=1,work_queue=1,replica_flow_control=1,tracker=1,client_raft_helpers_test=1,raft=1,admission=1,replica_flow_control=1,work_queue=1,replica_raft=1,replica_proposal_buf=1,raft_transport=2,kvadmission=1,work_queue=1,replica_flow_control=1,client_raft_helpers_test=1,range_controller=2,token_counter=2,token_tracker=2,processor=2,kvflowhandle=1' -f  TestFlowControlSendQueueRangeMerge --show-logs --timeout=30s

Add a new RACv2 integration test, TestFlowControlSendQueueRangeMigrate.

This test fails, but not by timing out (it will with the below args due
to the timeout flag specified), instead s1 and s3 on the scratch range
appear to ping pong messages back and forth like:

-- Test timed out at 2024-11-16 00:39:36 UTC --
I241116 00:39:37.221610 4048 kv/kvserver_test/client_raft_helpers_test.go:104  [T1,Vsystem,n3] 725   [raft] r69 Raft message 1->3 MsgApp Term:6 Log:6/25 Commit:27
I241116 00:39:37.222264 4068 kv/kvserver_test/client_raft_helpers_test.go:104  [T1,Vsystem,n1] 726   [raft] r69 Raft message 3->1 MsgAppResp Term:6 Log:0/25 Commit:25
I241116 00:39:38.221271 4048 kv/kvserver_test/client_raft_helpers_test.go:104  [T1,Vsystem,n3] 727   [raft] r69 Raft message 1->3 MsgApp Term:6 Log:6/25 Commit:27
I241116 00:39:38.221876 4068 kv/kvserver_test/client_raft_helpers_test.go:104  [T1,Vsystem,n1] 728   [raft] r69 Raft message 3->1 MsgAppResp Term:6 Log:0/25 Commit:25
I241116 00:39:38.971603 4048 kv/kvserver_test/client_raft_helpers_test.go:104  [T1,Vsystem,n3] 729   [raft] r69 Raft message 1->3 MsgApp Term:6 Log:6/25 Commit:27
I241116 00:39:38.972018 4068 kv/kvserver_test/client_raft_helpers_test.go:104  [T1,Vsystem,n1] 730   [raft] r69 Raft message 3->1 MsgAppResp Term:6 Log:0/25 Commit:25

The timeout is on these lines, calling kvserver.waitForApplication(..)
here:

Run with the following for more logging:

dev test pkg/kv/kvserver -v --vmodule='replica_raft=1,kvflowcontroller=2,replica_proposal_buf=1,raft_transport=2,kvflowdispatch=1,kvadmission=1,kvflowhandle=1,work_queue=1,replica_flow_control=1,tracker=1,client_raft_helpers_test=1,raft=1,admission=1,replica_flow_control=1,work_queue=1,replica_raft=1,replica_proposal_buf=1,raft_transport=2,kvadmission=1,work_queue=1,replica_flow_control=1,client_raft_helpers_test=1,range_controller=2,token_counter=2,token_tracker=2,processor=2,kvflowhandle=1' -f  TestFlowControlSendQueueRangeMigrate --show-logs --timeout=60s

Epic: none
Release note: None

@kvoli kvoli self-assigned this Nov 15, 2024
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@kvoli kvoli force-pushed the 241115.rac-repro-send-queue-split-snap branch from 90abd9c to 2b4d469 Compare November 15, 2024 22:09
@kvoli kvoli changed the title kvserver: [dnm] add TestFlowControlSendQueueRangeSplitSnap test kvserver: [dnm] add RACv2 failing split/merge repros Nov 15, 2024
@kvoli kvoli force-pushed the 241115.rac-repro-send-queue-split-snap branch 2 times, most recently from 792b620 to dd5456f Compare November 16, 2024 00:43
@kvoli kvoli changed the title kvserver: [dnm] add RACv2 failing split/merge repros kvserver: [dnm] add RACv2 failing split/merge/migrate repros Nov 16, 2024
@kvoli kvoli force-pushed the 241115.rac-repro-send-queue-split-snap branch from dd5456f to 193c319 Compare November 16, 2024 00:49
@kvoli
Copy link
Collaborator Author

kvoli commented Nov 16, 2024

Adding @pav-kv and @sumeerbhola as reviewers. Not intending to merge, added for notification / tracking only.

Copy link

blathers-crl bot commented Nov 16, 2024

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@kvoli kvoli force-pushed the 241115.rac-repro-send-queue-split-snap branch from 15e37ea to 561ef87 Compare November 16, 2024 04:38
This test currently fails, while also churning snapshots towards the
RHS range post-split. It appears that the RHS post-split replica on the
store which has no available tokens (from the leader, n1's perspective)
is effectively wedged.

```bash
dev test pkg/kv/kvserver -v --vmodule='replica_raft=1,kvflowcontroller=2,replica_proposal_buf=1,raft_transport=2,kvflowdispatch=1,kvadmission=1,kvflowhandle=1,work_queue=1,replica_flow_control=1,tracker=1,client_raft_helpers_test=1,raft=1,admission=1,replica_flow_control=1,work_queue=1,replica_raft=1,replica_proposal_buf=1,raft_transport=2,kvadmission=1,work_queue=1,replica_flow_control=1,client_raft_helpers_test=1,range_controller=2,token_counter=2,token_tracker=2,processor=2,kvflowhandle=1' -f  TestFlowControlSendQueueRangeSplitSnap --show-logs --timeout=120s
```

Epic: none
Release note: None
This adds a new RACv2 integration test,
`TestFlowControlSendQueueRangeMerge`.

The test exhausts the tokens on s3 (s1,s2,s3) by issuing 5 MiB / 4 MiB
of writes to the RHS and blocking admission on n3.

The test currently fails on the merge with:

```
testcluster.go:760: merging at /Table/Max: /Table/Max: merge unexpected error: kv/kvserver/replica_command.go:962: merge failed: waiting for all right-hand replicas to catch up: operation "waiting for merge application" timed out after 5.001s (given timeout 5s): grpc: context deadline exceeded [code 4/DeadlineExceeded]
        (1) attached stack trace
          -- stack trace:
          | github.com/cockroachdb/cockroach/pkg/server.(*testServer).MergeRanges
          |     testserver.go:2039
          | github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).MergeRanges
          |     pkg/testutils/testcluster/testcluster.go:750
          | github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).MergeRangesOrFatal
          |     pkg/testutils/testcluster/testcluster.go:758
          | github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestFlowControlSendQueueRangeMerge
          |     pkg/kv/kvserver/flow_control_integration_test.go:5163
          | testing.tRunner
          |     GOROOT/src/testing/testing.go:1689
          | runtime.goexit
          |     src/runtime/asm_arm64.s:1222
        Wraps: (2) /Table/Max: merge unexpected error: kv/kvserver/replica_command.go:962: merge failed: waiting for all right-hand replicas to catch up: operation "waiting for merge application" timed out after 5.001s (given timeout 5s): grpc: context deadline exceeded [code 4/DeadlineExceeded]
```

Run with the following for more logging:

```
dev test pkg/kv/kvserver -v --vmodule='replica_raft=1,kvflowcontroller=2,replica_proposal_buf=1,raft_transport=2,kvflowdispatch=1,kvadmission=1,kvflowhandle=1,work_queue=1,replica_flow_control=1,tracker=1,client_raft_helpers_test=1,raft=1,admission=1,replica_flow_control=1,work_queue=1,replica_raft=1,replica_proposal_buf=1,raft_transport=2,kvadmission=1,work_queue=1,replica_flow_control=1,client_raft_helpers_test=1,range_controller=2,token_counter=2,token_tracker=2,processor=2,kvflowhandle=1' -f  TestFlowControlSendQueueRangeMerge --show-logs --timeout=30s
```

Epic: none
Release note: None
Add a new RACv2 integration test, `TestFlowControlSendQueueRangeMigrate`.

This test fails, but not by timing out (it will with the below args due
to the timeout flag specified), instead s1 and s3 on the scratch range
appear to ping pong messages back and forth like:

```
-- Test timed out at 2024-11-16 00:39:36 UTC --
I241116 00:39:37.221610 4048 kv/kvserver_test/client_raft_helpers_test.go:104  [T1,Vsystem,n3] 725   [raft] r69 Raft message 1->3 MsgApp Term:6 Log:6/25 Commit:27
I241116 00:39:37.222264 4068 kv/kvserver_test/client_raft_helpers_test.go:104  [T1,Vsystem,n1] 726   [raft] r69 Raft message 3->1 MsgAppResp Term:6 Log:0/25 Commit:25
I241116 00:39:38.221271 4048 kv/kvserver_test/client_raft_helpers_test.go:104  [T1,Vsystem,n3] 727   [raft] r69 Raft message 1->3 MsgApp Term:6 Log:6/25 Commit:27
I241116 00:39:38.221876 4068 kv/kvserver_test/client_raft_helpers_test.go:104  [T1,Vsystem,n1] 728   [raft] r69 Raft message 3->1 MsgAppResp Term:6 Log:0/25 Commit:25
I241116 00:39:38.971603 4048 kv/kvserver_test/client_raft_helpers_test.go:104  [T1,Vsystem,n3] 729   [raft] r69 Raft message 1->3 MsgApp Term:6 Log:6/25 Commit:27
I241116 00:39:38.972018 4068 kv/kvserver_test/client_raft_helpers_test.go:104  [T1,Vsystem,n1] 730   [raft] r69 Raft message 3->1 MsgAppResp Term:6 Log:0/25 Commit:25
```

The timeout is on these lines, calling `kvserver.waitForApplication(..)`
here:
- https://github.com/kvoli/cockroach/blob/dd5456fea47451d5b47e10ca40991808d7e47780/pkg/kv/kvserver/replica_write.go#L289-L292

Run with the following for more logging:

```bash
dev test pkg/kv/kvserver -v --vmodule='replica_raft=1,kvflowcontroller=2,replica_proposal_buf=1,raft_transport=2,kvflowdispatch=1,kvadmission=1,kvflowhandle=1,work_queue=1,replica_flow_control=1,tracker=1,client_raft_helpers_test=1,raft=1,admission=1,replica_flow_control=1,work_queue=1,replica_raft=1,replica_proposal_buf=1,raft_transport=2,kvadmission=1,work_queue=1,replica_flow_control=1,client_raft_helpers_test=1,range_controller=2,token_counter=2,token_tracker=2,processor=2,kvflowhandle=1' -f  TestFlowControlSendQueueRangeMigrate --show-logs --timeout=60s
```

Epic: none
Release note: None
@kvoli kvoli force-pushed the 241115.rac-repro-send-queue-split-snap branch from 561ef87 to 9b3a6c0 Compare November 18, 2024 15:27
@kvoli
Copy link
Collaborator Author

kvoli commented Nov 18, 2024

Splitting off the relocate test variant, as that works as expected:

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 2 files at r1, 2 of 2 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)


pkg/kv/kvserver/flow_control_integration_test.go line 5120 at r2 (raw file):

	require.NoError(t, err)
	h.enableVerboseRaftMsgLoggingForRange(desc.RangeID)
	h.enableVerboseRaftMsgLoggingForRange(desc.RangeID + 1)

Is the scratch range guaranteed to have the highest rangeID at this point in the test?


pkg/kv/kvserver/flow_control_integration_test.go line 4979 at r1 (raw file):

	}

	tc := testcluster.StartTestCluster(t, 5, base.TestClusterArgs{

nit: do we need 5 nodes given that we only use 3?

@kvoli
Copy link
Collaborator Author

kvoli commented Nov 26, 2024

I'm going to close this PR out and fork the branch to extract the tests, on-top of the draft fix #136077.

@kvoli kvoli closed this Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants