-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: fix flaky flow token return tests #132345
Conversation
TODO:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 5 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)
pkg/kv/kvserver/testdata/flow_control_integration_v2/quiesced_range_v1_encoding
line 48 at r1 (raw file):
-- (Allow below-raft admission to proceed, and enable piggybacking. All tokens -- are returned via the piggybacking mechanism.)
RaftMessageRequest.AdmittedState
was a mechanism not available to RACv1. Is that not returning tokens since by the time of the MsgAppResp
the work is not admitted?
And why were we not able to disable piggybacking the way the v1 tests do? The code in raftTransport
that looks at t.knobs.DisablePiggyBackedFlowTokenDispatch
also applies to v2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @sumeerbhola)
pkg/kv/kvserver/testdata/flow_control_integration_v2/quiesced_range_v1_encoding
line 48 at r1 (raw file):
This is a V2 test. One version of it tests that returns via AdmittedState
are reliable (because MsgApp
pings always happen if there are unadmitted deductions). Another version of the test sets AdmittedState
to empty in all messages and makes sure the piggybacking channel + delivers the thing. The second test might be not reliable though, now that I think of it. Is the piggybacker's guarantee that the admitted vector is only going to be sent once?
And why were we not able to disable piggybacking the way the v1 tests do?
Not sure what you mean. We were able to, see TestFlowControlTokenReturnsPiggybackedV2
which uses the same knobs as v1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)
pkg/kv/kvserver/testdata/flow_control_integration_v2/quiesced_range_v1_encoding
line 48 at r1 (raw file):
Ack for the rest.
Is the piggybacker's guarantee that the admitted vector is only going to be sent once?
The piggybacker tries once since it has a queue and it pops from the queue and sends. The message could get dropped.
The MsgApp pinging is supposed to give us liveness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- modulo the extended CI failure for both:
TestFlowControlTokenReturnsPiggybackedV2/v2_enabled_when_leader_level=2
TestFlowControlTokenReturnsPiggybackedV2/v2_enabled_when_leader_level=1
Lets discuss these tomorrow when we sync.
Reviewed all commit messages.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @pav-kv and @sumeerbhola)
@kvoli I won't be able to sync tomorrow, but I'm still trying to fix these flakes, and hopefully will by tomorrow. |
Ack -- let me know if I can help out, I'll be responsive on (internal) slack. I'm sure you probably found some ./dev test pkg/kv/kvserver -v --vmodule='raft=1,admission=1,replica_flow_control=1,work_queue=1,replica_raft=1,replica_proposal_buf=1,raft_transport=2,kvadmission=1,work_queue=1,replica_flow_control=1,client_raft_helpers_test=1,range_controller=2,token_counter=2,token_tracker=2,processor=2,kvflowhandle=1' -f TestFlowControlTokenReturnsPiggybackedV2/v2_enabled_when_leader_level=2 --stress --race |
@kvoli This is helpful, thanks. I couldn't yet get a single repro locally. |
One of the flakes is this:
After some extensive stressing, I managed to see a repro which has the following in the log:
So this one flake is due to a leader and lease move. We should disable lease moves if we want this test stable. This race probably affects other tests which only disable election, but not the lease moves. A slow run can move the lease, and the leader will follow. I think I'm getting to the bottom of the second flake, which is most likely related to a race between |
dc46773
to
e56f03a
Compare
The "disconnect" flake seems fixed. The second flake still occurs, and various attempts to get rid of it were not fruitful. Maybe it's not worth trying to make a "piggybacking-only" version of this test (testing that this delivery channel does work if conditions are "ideal"). Since we're interested in a higher-level guarantee that, overall, token returns are reliable, and the mechanism that guarantees it is the |
e56f03a
to
bcee20b
Compare
Epic: none Release note: none
bcee20b
to
ffd5301
Compare
I picked this approach, and removed the test which disables pings. We now only have The |
bors r+ |
132345: kvserver: fix flaky flow token return tests r=pav-kv a=pav-kv Part of #129581 Co-authored-by: Pavel Kalinnikov <[email protected]>
Build failed: |
bors retry |
Part of #129581