Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: fix flaky flow token return tests #132345

Merged
merged 1 commit into from
Oct 11, 2024

Conversation

pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Oct 10, 2024

Part of #129581

@pav-kv pav-kv requested review from kvoli and sumeerbhola October 10, 2024 20:47
@pav-kv pav-kv requested a review from a team as a code owner October 10, 2024 20:47
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 10, 2024

TODO:

  • rename the datadriven test files, according to the test renames

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 5 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)


pkg/kv/kvserver/testdata/flow_control_integration_v2/quiesced_range_v1_encoding line 48 at r1 (raw file):


-- (Allow below-raft admission to proceed, and enable piggybacking. All tokens
-- are returned via the piggybacking mechanism.)

RaftMessageRequest.AdmittedState was a mechanism not available to RACv1. Is that not returning tokens since by the time of the MsgAppResp the work is not admitted?

And why were we not able to disable piggybacking the way the v1 tests do? The code in raftTransport that looks at t.knobs.DisablePiggyBackedFlowTokenDispatch also applies to v2.

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @sumeerbhola)


pkg/kv/kvserver/testdata/flow_control_integration_v2/quiesced_range_v1_encoding line 48 at r1 (raw file):
This is a V2 test. One version of it tests that returns via AdmittedState are reliable (because MsgApp pings always happen if there are unadmitted deductions). Another version of the test sets AdmittedState to empty in all messages and makes sure the piggybacking channel + delivers the thing. The second test might be not reliable though, now that I think of it. Is the piggybacker's guarantee that the admitted vector is only going to be sent once?

And why were we not able to disable piggybacking the way the v1 tests do?

Not sure what you mean. We were able to, see TestFlowControlTokenReturnsPiggybackedV2 which uses the same knobs as v1.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)


pkg/kv/kvserver/testdata/flow_control_integration_v2/quiesced_range_v1_encoding line 48 at r1 (raw file):
Ack for the rest.

Is the piggybacker's guarantee that the admitted vector is only going to be sent once?

The piggybacker tries once since it has a queue and it pops from the queue and sends. The message could get dropped.
The MsgApp pinging is supposed to give us liveness.

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: -- modulo the extended CI failure for both:

TestFlowControlTokenReturnsPiggybackedV2/v2_enabled_when_leader_level=2
TestFlowControlTokenReturnsPiggybackedV2/v2_enabled_when_leader_level=1

Lets discuss these tomorrow when we sync.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @pav-kv and @sumeerbhola)

@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 10, 2024

@kvoli I won't be able to sync tomorrow, but I'm still trying to fix these flakes, and hopefully will by tomorrow.

@kvoli
Copy link
Collaborator

kvoli commented Oct 10, 2024

@kvoli I won't be able to sync tomorrow, but I'm still trying to fix these flakes, and hopefully will by tomorrow.

Ack -- let me know if I can help out, I'll be responsive on (internal) slack.

I'm sure you probably found some --vmodule combo which works already, but when debugging TestFlowControl.* integration tests, I have been using these flags:

./dev test pkg/kv/kvserver -v --vmodule='raft=1,admission=1,replica_flow_control=1,work_queue=1,replica_raft=1,replica_proposal_buf=1,raft_transport=2,kvadmission=1,work_queue=1,replica_flow_control=1,client_raft_helpers_test=1,range_controller=2,token_counter=2,token_tracker=2,processor=2,kvflowhandle=1' -f  TestFlowControlTokenReturnsPiggybackedV2/v2_enabled_when_leader_level=2 --stress --race

@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 11, 2024

@kvoli This is helpful, thanks. I couldn't yet get a single repro locally.

@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 11, 2024

One of the flakes is this:

        -  kvflowcontrol.tokens.eval.elastic.returned.disconnect             | 0 B
        +  kvflowcontrol.tokens.eval.elastic.returned.disconnect             | 2.0 MiB

After some extensive stressing, I managed to see a repro which has the following in the log:

replica_raft.go:2647 ⋮ [T1,Vsystem,n1,s1,r70/1:‹/{Table/Max-Max}›,raft] 696  transferring raft leadership to replica ID 3
...
range_controller.go:1351 ⋮ [T1,Vsystem,n1,s1,r70/1:‹/{Table/Max-Max}›,raft] 699  r70 setting range leaseholder replica_id=3
range_controller.go:2496 ⋮ [T1,Vsystem,n1,s1,r70/1:‹/{Table/Max-Max}›,raft] 700  r70:(n1,s1):1 stream t‹1›/s‹1› changing to probe                                        
range_controller.go:2496 ⋮ [T1,Vsystem,n1,s1,r70/1:‹/{Table/Max-Max}›,raft] 701  r70:(n2,s2):2 stream t‹1›/s‹2› changing to probe                                        
range_controller.go:2496 ⋮ [T1,Vsystem,n1,s1,r70/1:‹/{Table/Max-Max}›,raft] 702  r70:(n3,s3):3 stream t‹1›/s‹3› changing to probe                                        
token_counter.go:568 ⋮ [T1,Vsystem,n1,s1,r70/1:‹/{Table/Max-Max}›,raft] 703  adjusted flow tokens (wc=elastic stream=t‹1›/s‹3› delta=‹+1.0 MiB› flag=‹disconnect›): regul    ar=‹+16 MiB› elastic=‹+8.0 MiB›                                                                                                                                                                                                     
token_counter.go:568 ⋮ [T1,Vsystem,n1,s1,r70/1:‹/{Table/Max-Max}›,raft] 704  adjusted flow tokens (wc=elastic stream=t‹1›/s‹3› delta=‹+1.0 MiB› flag=‹disconnect›): regul    ar=‹+16 MiB› elastic=‹+8.0 MiB›

So this one flake is due to a leader and lease move. We should disable lease moves if we want this test stable. This race probably affects other tests which only disable election, but not the lease moves. A slow run can move the lease, and the leader will follow.

I think I'm getting to the bottom of the second flake, which is most likely related to a race between MsgAppResp and piggybacking. If MsgAppResp makes it first, the admitted vector is never piggybacked, but the test expects a piggybacked delivery. Should either relax the test, or make sure to block some things to prevent this race.

@pav-kv pav-kv force-pushed the rac2-token-return-tests branch from dc46773 to e56f03a Compare October 11, 2024 02:59
@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 11, 2024

The "disconnect" flake seems fixed.

The second flake still occurs, and various attempts to get rid of it were not fruitful. Maybe it's not worth trying to make a "piggybacking-only" version of this test (testing that this delivery channel does work if conditions are "ideal"). Since we're interested in a higher-level guarantee that, overall, token returns are reliable, and the mechanism that guarantees it is the MsgApp pings, we shouldn't completely disable it in a test like this.

@pav-kv pav-kv force-pushed the rac2-token-return-tests branch from e56f03a to bcee20b Compare October 11, 2024 11:30
Epic: none
Release note: none
@pav-kv pav-kv force-pushed the rac2-token-return-tests branch from bcee20b to ffd5301 Compare October 11, 2024 11:31
@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 11, 2024

Since we're interested in a higher-level guarantee that, overall, token returns are reliable, and the mechanism that guarantees it is the MsgApp pings, we shouldn't completely disable it in a test like this.

I picked this approach, and removed the test which disables pings. We now only have TestFlowControlUnquiescedRangeV2 which tests that: the range does not quiesce until everything is returned; and mixes MsgAppResp-only returns with MsgAppResp+piggybacking (yet MsgAppResp still being the main one that ensures reliability).

The TestFlowControlQuiescedRangeV2 is not needed because we can't have a quiesced range.

@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 11, 2024

bors r+

craig bot pushed a commit that referenced this pull request Oct 11, 2024
132345: kvserver: fix flaky flow token return tests r=pav-kv a=pav-kv

Part of #129581

Co-authored-by: Pavel Kalinnikov <[email protected]>
@craig
Copy link
Contributor

craig bot commented Oct 11, 2024

Build failed:

@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 11, 2024

bors retry

@craig craig bot merged commit 3005436 into cockroachdb:master Oct 11, 2024
22 of 23 checks passed
@pav-kv pav-kv deleted the rac2-token-return-tests branch October 11, 2024 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants