-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storepool: consider stores suspect on join #97532
Conversation
ccf4194
to
4caed55
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit message mentions both new and rejoined nodes are treated as suspect. Is that correct still?
4caed55
to
91a6b25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the commit note.
Reviewable status: complete! 0 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions and minor nits. Great stuff.
We should consider adding a unit test which asserts (unless something like this is already around):
- stores on new nodes go from unknown -> available
- stores on rejoining nodes go from unavailable (or dead) -> suspect
Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist)
-- commits
line 4 at r2:
nit: than the
pkg/kv/kvserver/allocator/storepool/store_pool.go
line 165 at r2 (raw file):
l livenesspb.Liveness, now time.Time, deadThreshold time.Duration, ) livenesspb.NodeLivenessStatus { // If we don't have a liveness expiration time, treat the status as unknown.
Does this prevent new nodes being treated as suspect? The comment kind of infers this but it would be good to be very explicit so we don't break this behavior by accident.
pkg/kv/kvserver/allocator/storepool/store_pool.go
line 268 at r2 (raw file):
// heartbeat // // The store is considered dead if it hasn't been updated via gossip
Was this code redundant with the liveness dead check below?
pkg/kv/kvserver/allocator/storepool/store_pool.go
line 293 at r2 (raw file):
} // check whether the store is currently suspect. We measure that by
nit : Check
pkg/kv/kvserver/client_replica_test.go
line 1559 at r2 (raw file):
// Set the clock far enough forward to cause a lease extension, but not far // enough to make the node appear unhealthy. l.manualClock.Increment((lease.Expiration.WallTime - l.manualClock.UnixNano()) / 2)
Why was this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add tests on this. There is some partial coverage, but not very complete.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli)
Previously, kvoli (Austen) wrote…
nit: than the
Done
pkg/kv/kvserver/allocator/storepool/store_pool.go
line 165 at r2 (raw file):
Previously, kvoli (Austen) wrote…
Does this prevent new nodes being treated as suspect? The comment kind of infers this but it would be good to be very explicit so we don't break this behavior by accident.
Cleaned up the comment to make this more clear.
pkg/kv/kvserver/allocator/storepool/store_pool.go
line 268 at r2 (raw file):
Previously, kvoli (Austen) wrote…
Was this code redundant with the liveness dead check below?
Unfortunately no. This is a little messy, but there is a check for sd.Desc.Node.NodeID
so we first have to check if sd.Desc is null. Even after that is set as non-null, there is a still a window where we don't have a valid liveness heartbeat so both checks are necessary.
pkg/kv/kvserver/allocator/storepool/store_pool.go
line 293 at r2 (raw file):
Previously, kvoli (Austen) wrote…
nit : Check
Done
pkg/kv/kvserver/client_replica_test.go
line 1559 at r2 (raw file):
Previously, kvoli (Austen) wrote…
Why was this necessary?
This test used to move the clock forward by almost 6 seconds. This was somewhat racy before, and resulted in a store being marked as suspect and then this test failing. Changing to only updating be 3 seconds forces a lease extension without making the store unhealthy. This test only passed before because it wasn't always clearing the suspect state.
91a6b25
to
9ae2ae9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist)
pkg/kv/kvserver/allocator/storepool/store_pool.go
line 165 at r2 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
Cleaned up the comment to make this more clear.
Thanks that is clearer.
pkg/kv/kvserver/client_replica_test.go
line 1559 at r2 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
This test used to move the clock forward by almost 6 seconds. This was somewhat racy before, and resulted in a store being marked as suspect and then this test failing. Changing to only updating be 3 seconds forces a lease extension without making the store unhealthy. This test only passed before because it wasn't always clearing the suspect state.
Makes sense. Thanks for clarifying.
Previously, nodes that failed to heartbeat for longer than the lease expiration interval (currently 6 seconds) would only be considered suspect when re-joining the cluster within 5 minutes. After 5 minutes the store would be marked as dead, and if a store rejoins after it is dead it could immediately begin accepting leases. This caused problems since there is often a lot of recovery work to do when a node rejoins. This change always marks a node that rejoins a cluster as suspect. Release note (ops change): Nodes will now be considered suspect when re-joining a cluster and not accept lease transfers for one server.time_after_store_suspect window, which defaults to 30s. Epic: none
9ae2ae9
to
a653180
Compare
bors r=kvoli |
Build failed (retrying...): |
This PR was included in a batch that was canceled, it will be automatically retried |
Build failed (retrying...): |
Build failed (retrying...): |
Build succeeded: |
In cockroachdb#97532 we removed a check that would consider stores as dead if they hadn't gossiped their store descriptor within the last `server.time_until_store_dead` period. This patch updates adds back in the dead check that was removed, so that stores which have not updated their store descriptor in gossip, in the last `server.time_until_store_dead` are considered dead. This is a mostly redundant check with node liveness mirroring this behavior, however necessary in some tests. Resolves: cockroachdb#97794 Release note: None
97889: server: cleanups around a tenant's metrics registry r=andreimatei a=andreimatei Please see individual commits. Release note: None Epic: None 98101: storepool: consider nodes dead after no gossip r=andrewbaptist a=kvoli In #97532 we removed a check that would consider stores as dead if they hadn't gossiped their store descriptor within the last `server.time_until_store_dead` period. This patch updates adds back in the dead check that was removed, so that stores which have not updated their store descriptor in gossip, in the last `server.time_until_store_dead` are considered dead. This is a mostly redundant check with node liveness mirroring this behavior, however necessary in some tests. Resolves: #97794 Release note: None Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Austen McClernon <[email protected]>
In cockroachdb#97532 we removed a check that would consider stores as dead if they hadn't gossiped their store descriptor within the last `server.time_until_store_dead` period. This patch updates adds back in the dead check that was removed, so that stores which have not updated their store descriptor in gossip, in the last `server.time_until_store_dead` are considered dead. This is a mostly redundant check with node liveness mirroring this behavior, however necessary in some tests. Resolves: cockroachdb#97794 Release note: None
Previously, stores would not be considered suspect when joining the cluster. This patch updates logic so that stores are now considered suspect on joining the cluster (rejoining or startup).
Release note (ops change): Stores will now be considered suspect when joining the cluster.
Epic: none