-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: test TestCannotCommitWithoutNewTermEntry w/ & wo/ store liveness #132242
raft: test TestCannotCommitWithoutNewTermEntry w/ & wo/ store liveness #132242
Conversation
312c840
to
34ea5c0
Compare
34ea5c0
to
bd9f2b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh)
pkg/raft/raft_test.go
line 4170 at r1 (raw file):
// bump the support epoch. type mockStoreLiveness struct { supportEpoch pb.Epoch
I worry we'll need something more general than this if we're to use this in all tests. This version of StoreLiveness is less expressive than the one we have in the datadriven tests, and I'm struggling with how restrictive that is over in #132108.
The reason I'm struggling there is that if you return hlc.MaxTimestamp
in SupportFrom
during the LeadSupportUntil
calculation, a leader will never de-fortify (because we don't de-fortify until the maxLeadSupportUntil
is in the past). This means that all tests where the leader gets demoted to a learner and steps down while switching configurations (read: the cases that require de-fortification to elect a new leader) will not work with this simplified version of StoreLiveness.
Separately, does this simplified version work for all tests that we'll need to change here? Are there ever test cases that need a per-store support epoch? Or is the plan to expand this if/when we find cases like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)
pkg/raft/raft_test.go
line 4170 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
I worry we'll need something more general than this if we're to use this in all tests. This version of StoreLiveness is less expressive than the one we have in the datadriven tests, and I'm struggling with how restrictive that is over in #132108.
The reason I'm struggling there is that if you return
hlc.MaxTimestamp
inSupportFrom
during theLeadSupportUntil
calculation, a leader will never de-fortify (because we don't de-fortify until themaxLeadSupportUntil
is in the past). This means that all tests where the leader gets demoted to a learner and steps down while switching configurations (read: the cases that require de-fortification to elect a new leader) will not work with this simplified version of StoreLiveness.Separately, does this simplified version work for all tests that we'll need to change here? Are there ever test cases that need a per-store support epoch? Or is the plan to expand this if/when we find cases like that?
For the tests that we need to change, I prepared this document to list them all.
It seems that all the cases can be fixed by either withdrawing liveness support, or bumping the support epoch.
Regarding the tests that require defortification. Do we have such tests as unit tests? Or do you mean that we will have them in the future?
One idea that comes to my mind if we ever need this is to utilize the fact that we use SupportExpired()
after calling LeadSupportUntil()
. So we change the behaviour of SupportExpired()
in the unit test mock to make it return true when we need that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh)
pkg/raft/raft_test.go
line 4170 at r1 (raw file):
For the tests that we need to change, I prepared this document to list them all.
Nice, thanks for doing this!
Regarding the tests that require defortification. Do we have such tests as unit tests? Or do you mean that we will have them in the future?
I think tests that require the leader to step down because of CheckQuorum
(from your doc) will fall into this category as well. The leader will not step down because of CheckQuorum
until its MaxLeadSupportUntil
is in the past. With hlc.MaxTimestamp
, that'll never be true.
We can continue this discussion in the pod meeting later today.
bd9f2b4
to
3017a72
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r4, 1 of 1 files at r5, 2 of 2 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh)
pkg/raft/raft_test.go
line 4165 at r4 (raw file):
} // mockStoreLiveness is a simple mock implementation of StoreLiveness that
nits:
s/simple//g
"It initially treats.."
"bump the supported epoch"
pkg/raft/raft_test.go
line 716 at r6 (raw file):
tt.maybeWithdrawSupportAllPeers() tt.send(pb.Message{From: 2, To: 2, Type: pb.MsgHup})
We've previously spoken about checking whether we have support from a quorum of peers when campaigning. Would the current structure, where we withdraw support from all peers, campaign, and then grant support from all peers work well with that change?
This seems like a case where we'd want a more general implementation of store liveness. Would we be better served if we introduced that now?
pkg/raft/raft_test.go
line 4341 at r6 (raw file):
} if mock, ok := p.(*raft).storeLiveness.(*mockStoreLiveness); ok {
I like this pattern!
3017a72
to
6742c11
Compare
6742c11
to
c844b87
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r7, 1 of 1 files at r8, 2 of 2 files at r9, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh)
pkg/raft/raft_test.go
line 4165 at r4 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nits:
s/simple//g
"It initially treats.."
"bump the supported epoch"
Did you mean to still address these?
pkg/raft/raft_test.go
line 4167 at r7 (raw file):
// livenessEntry is an entry in the mockStoreLiveness state. type livenessEntry struct { // Controls whether supportFor returns true or false.
nit: we have a soft recommendation for starting comments with the field name. So something like: "isSupporting controls whether..."
Here and below.
pkg/raft/raft_test.go
line 4191 at r7 (raw file):
// as live, but it can be configured to withdraw support, grant support, and // bump the support epoch to/from any two nodes. // Each node's state can independently be altered. This makes it possible to
nit: new line before the paragraph.
"can be altered independently"
pkg/raft/raft_test.go
line 4192 at r7 (raw file):
// bump the support epoch to/from any two nodes. // Each node's state can independently be altered. This makes it possible to // not have a shared liveness fabric between all the nodes in the test. This is
It also makes it possible to construct a uni-directional partition, if we want.
pkg/raft/raft_test.go
line 4213 at r7 (raw file):
// createStateEntryIfNotExist creates a new state entry for the given node if // it doesn't exist already. func (m mockStoreLiveness) createStateEntryIfNotExist(id pb.PeerID) {
s/createStateEntryIfNotExist/maybeCreateStoreLivenessEntry/g
pkg/raft/raft_test.go
line 4270 at r7 (raw file):
m.createStateEntryIfNotExist(id) entry := m.state[id] entry.supportFromEpoch++
With this (and other) methods, we're not trying to keep the supported/supporter node in sync. That places the burden of keeping them in sync on the test writer, which makes it easy to make mistakes.
I wonder if we should have common case wrappers for transitions we'll use frequently. For example, the case where we're bumping the SupportFrom
epoch, we also want to have the SupportFor
on the supporting node return the same epoch. Wdyt?
pkg/raft/raft_test.go
line 4165 at r8 (raw file):
} // livenessEntry is an entry in the mockStoreLiveness state.
Let's move this to a new file.
pkg/raft/raft_test.go
line 4419 at r9 (raw file):
// allWithdrawSupportForAndFromPeer makes all nodes withdraw support for and // from the given peer. func (nw *network) allWithdrawSupportForAndFromPeer(id pb.PeerID) {
The naming is a bit off here -- maybe let's call it withdrawSupportForAndFromAllPeers
?
(might be superseded by the comment to tie withdrawing support from/for in a single function)
c844b87
to
027f300
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)
pkg/raft/raft_test.go
line 4170 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
For the tests that we need to change, I prepared this document to list them all.
Nice, thanks for doing this!
Regarding the tests that require defortification. Do we have such tests as unit tests? Or do you mean that we will have them in the future?
I think tests that require the leader to step down because of
CheckQuorum
(from your doc) will fall into this category as well. The leader will not step down because ofCheckQuorum
until itsMaxLeadSupportUntil
is in the past. Withhlc.MaxTimestamp
, that'll never be true.We can continue this discussion in the pod meeting later today.
Done.
pkg/raft/raft_test.go
line 4165 at r4 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Did you mean to still address these?
Done.
pkg/raft/raft_test.go
line 716 at r6 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
We've previously spoken about checking whether we have support from a quorum of peers when campaigning. Would the current structure, where we withdraw support from all peers, campaign, and then grant support from all peers work well with that change?
This seems like a case where we'd want a more general implementation of store liveness. Would we be better served if we introduced that now?
Done
pkg/raft/raft_test.go
line 4341 at r6 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
I like this pattern!
Done.
pkg/raft/raft_test.go
line 4192 at r7 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
It also makes it possible to construct a uni-directional partition, if we want.
Done.
pkg/raft/raft_test.go
line 4213 at r7 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
s/createStateEntryIfNotExist/maybeCreateStoreLivenessEntry/g
Done.
pkg/raft/raft_test.go
line 4270 at r7 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
With this (and other) methods, we're not trying to keep the supported/supporter node in sync. That places the burden of keeping them in sync on the test writer, which makes it easy to make mistakes.
I wonder if we should have common case wrappers for transitions we'll use frequently. For example, the case where we're bumping the
SupportFrom
epoch, we also want to have theSupportFor
on the supporting node return the same epoch. Wdyt?
I don't think it's feasible to do it from this layer given that there is no shared fabric.
pkg/raft/raft_test.go
line 4165 at r8 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Let's move this to a new file.
Done.
pkg/raft/raft_test.go
line 4419 at r9 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
The naming is a bit off here -- maybe let's call it
withdrawSupportForAndFromAllPeers
?(might be superseded by the comment to tie withdrawing support from/for in a single function)
Done.
c2190a1
to
4016d7e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 4 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh)
pkg/raft/raft_test.go
line 4270 at r7 (raw file):
Previously, iskettaneh wrote…
I don't think it's feasible to do it from this layer given that there is no shared fabric.
Does it make sense to create a shared fabric? We'll still need these methods, to drive actions for store liveness entries, but we can have it such that tests interact with the shared fabric for the most part.
4016d7e
to
866e3b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @iskettaneh)
pkg/raft/raft_test.go
line 4270 at r7 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Does it make sense to create a shared fabric? We'll still need these methods, to drive actions for store liveness entries, but we can have it such that tests interact with the shared fabric for the most part.
Discussed offline. I updated the mock implementation based on our discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be good to go after this round of comments.
Reviewed 2 of 2 files at r13, 4 of 4 files at r14, 1 of 1 files at r15, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @iskettaneh)
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 75 at r13 (raw file):
// SupportFrom implements the StoreLiveness interface. func (m MockStoreLiveness) SupportFrom(id pb.PeerID) (pb.Epoch, hlc.Timestamp) {
Goland is complaining that some of the methods have pointer receivers and others have value receivers. Let's convert all of these to pointer receivers?
You'll need to make this change above when we do that:
var _ StoreLiveness = &MockStoreLiveness{}
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 133 at r13 (raw file):
// grantSupportFrom grants support from the given peer. func (m *MockStoreLiveness) grantSupportFrom(id pb.PeerID) { m.maybeCreateStoreLivenessEntry(id)
Do we still need these maybeCreateStoreLivenessEntry
calls for each of these methods? Or can we get rid of them (and assert that an entry exists?)
entry, ok := m.state[id]
if !ok {
panic("...")
}
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 140 at r13 (raw file):
// WithdrawSupportFor withdraws support for the given peer. func (m *MockStoreLiveness) WithdrawSupportFor(id pb.PeerID) {
Does this need to be public?
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 177 at r13 (raw file):
// MaybeAddPeer adds a peer to the liveness fabric if it doesn't already exist. // Also, it iterates over all existing peers and adds the new peer to their
nit: this "Also..." comment is an implementation detail which should be closer to the code and not in the top level function. I see you already have that below, so let's just drop this.
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 179 at r13 (raw file):
// Also, it iterates over all existing peers and adds the new peer to their // state. func (l *LivenessFabric) MaybeAddPeer(id pb.PeerID) {
Related to my comment below about not needing MaybeAddPeer
from other method calls in this file -- can we drop the "maybe", and instead have this be AddPeer
?
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 197 at r13 (raw file):
} // BumpEpoch bidirectionally bumps the support epoch from `from` to `to`.
nit: bidirectionally is a bit confusing here -- consider rewording this to talk about the state transition here instead. Something like:
// BumpEpoch bumps the epoch supported by "from" for "to" and starts supporting the new epoch. We also update state on "to" to reflect support at this new epoch"
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 198 at r13 (raw file):
// BumpEpoch bidirectionally bumps the support epoch from `from` to `to`. func (l *LivenessFabric) BumpEpoch(from pb.PeerID, to pb.PeerID) {
nit: here, and elsehwere, instead of "from" and "to", should these be "from" and "for"?
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 199 at r13 (raw file):
// BumpEpoch bidirectionally bumps the support epoch from `from` to `to`. func (l *LivenessFabric) BumpEpoch(from pb.PeerID, to pb.PeerID) { l.MaybeAddPeer(from)
Do we need these calls to MaybeAddPeer
? Or can we have the test initialize the StoreLiveness
state correctly instead?
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 201 at r13 (raw file):
l.MaybeAddPeer(from) l.state[from].bumpSupportForEpoch(to)
Should we be setting IsSupported
to true here explicitly as well? (my comment suggestion assumes this).
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 206 at r13 (raw file):
} // WithdrawSupport bidirectionally withdraws support from `from` to `to`.
nit: Same comment about bidirectionally -- let's talk about the state transition instead.
"WithdrawSupport withdraws support from from
for to
. We also update state on to
to reflect withdrawal of support.."
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 215 at r13 (raw file):
} // WithdrawSupport bidirectionally grants support `from` from to `to`.
nit: same comment about bidirectional as above.
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 225 at r13 (raw file):
// SetSupportExpired explicitly controls what SupportExpired returns regardless // of the timestamp.
"timestamp supplied to it".
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 231 at r13 (raw file):
} // WithdrawAllSupportForPeer causes all peers in the liveness fabric to withdraw
nit: "WithdrawSupportForPeerFromAllPeers withdraws support for the target peer from all peers in the liveness fabric"
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 233 at r13 (raw file):
// WithdrawAllSupportForPeer causes all peers in the liveness fabric to withdraw // their support to the target peer. func (l *LivenessFabric) WithdrawAllSupportForPeer(target pb.PeerID) {
nit: "WithdrawSupportForPeerFromAllPeers"
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 240 at r13 (raw file):
// GrantAllSupportForPeer causes all peers in the liveness fabric to grant // their support to the target peer.
nit: "GrantSupportForPeerFromAllPeers grants support for the target peer from all peers in the liveness fabric."
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 241 at r13 (raw file):
// GrantAllSupportForPeer causes all peers in the liveness fabric to grant // their support to the target peer. func (l *LivenessFabric) GrantAllSupportForPeer(targetID pb.PeerID) {
nit: "GrantSupportForPeerFromAllPeers
pkg/raft/raft_test.go
line 717 at r15 (raw file):
// Bumping all epochs will make all followers stop supporting the current // fortified leader.
Do we still need this? Or does the call to WithdrawAllSupportForPeer(1)
give us this already?
866e3b5
to
00e00b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 75 at r13 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Goland is complaining that some of the methods have pointer receivers and others have value receivers. Let's convert all of these to pointer receivers?
You'll need to make this change above when we do that:
var _ StoreLiveness = &MockStoreLiveness{}
Done.
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 133 at r13 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Do we still need these
maybeCreateStoreLivenessEntry
calls for each of these methods? Or can we get rid of them (and assert that an entry exists?)entry, ok := m.state[id] if !ok { panic("...") }
Done.
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 140 at r13 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Does this need to be public?
Done.
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 179 at r13 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Related to my comment below about not needing
MaybeAddPeer
from other method calls in this file -- can we drop the "maybe", and instead have this beAddPeer
?
Done.
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 199 at r13 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Do we need these calls to
MaybeAddPeer
? Or can we have the test initialize theStoreLiveness
state correctly instead?
Done.
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 201 at r13 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Should we be setting
IsSupported
to true here explicitly as well? (my comment suggestion assumes this).
I did it for now, but I am not entirely sure about it. My thought that we might need to bump the epoch without granting support to simulate something
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 225 at r13 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
"timestamp supplied to it".
Done.
pkg/raft/raft_test.go
line 717 at r15 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Do we still need this? Or does the call to
WithdrawAllSupportForPeer(1)
give us this already?
Now, BumpAllSupportEpochs() will restore support (but at a higher epoch).
So the workflow is this:
- Withdraw support
- Bump epoch (and implicitly restore support)
We need (1) to avoid the leader refortifying the follower at the higher epoch (in 2). Technically though, we can just do (1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r16, 4 of 4 files at r17, 1 of 1 files at r18, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @iskettaneh)
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 201 at r13 (raw file):
Previously, iskettaneh wrote…
I did it for now, but I am not entirely sure about it. My thought that we might need to bump the epoch without granting support to simulate something
I think you can still do that with bump + withdraw. You can do things the other way around too, so the question really is which default makes sense -- I think this one does.
pkg/raft/raftstoreliveness/mock_store_liveness.go
line 253 at r16 (raw file):
} // WithdrawSupport withdraws the support by "fromID" for "forID". We also update
nit: now that we've renamed these, you could get rid of the quotes around "fromID" and "forID" -- typically, we don't wrap parameter names in quotes.
(here and elsewhere).
pkg/raft/raft_test.go
line 717 at r15 (raw file):
Previously, iskettaneh wrote…
Now, BumpAllSupportEpochs() will restore support (but at a higher epoch).
So the workflow is this:
- Withdraw support
- Bump epoch (and implicitly restore support)
We need (1) to avoid the leader refortifying the follower at the higher epoch (in 2). Technically though, we can just do (1)
Ack. I think we should just do 1 and not do 2. Like we spoke about offline, doing 2 can lead to a race where the old leader is able to re-fortify.
00e00b1
to
cb581a9
Compare
This commit creates MockStoreLiveness to be used in Raft unit tests. Initially, it acts like the `AlwaysLive` store liveness, but it has methods that can alter its behavior to withdraw/grant support to other stores, and bump the support epoch. Epic: None Release note: None
This commit adds the test modifier withStoreLiveness which allows us to provide any StoreLiveness implementation to the test. Epic: None Release note: None
This commit changes TestCannotCommitWithoutNewTermEntry to allow testing it when store liveness is both enabled and disabled. References: cockroachdb#132241 Release note: None
cb581a9
to
36958cc
Compare
bors r+ |
blathers backport 24.3 |
This commit changes TestCannotCommitWithoutNewTermEntry to allow testing it when store liveness is both enabled and disabled.
In order to achieve that, it adds a mockStoreLiveness for raft unit tests. It acts like AlwaysLive store liveness by default, but it could be to change the supportFor/SupportFrom between any two nodes, also, it allows to bump the support epochs.
References: #132241
Release note: None