Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storepool: consider stores suspect on join #97532

Merged
merged 1 commit into from
Feb 27, 2023

Conversation

andrewbaptist
Copy link
Collaborator

Previously, stores would not be considered suspect when joining the cluster. This patch updates logic so that stores are now considered suspect on joining the cluster (rejoining or startup).

Release note (ops change): Stores will now be considered suspect when joining the cluster.

Epic: none

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andrewbaptist andrewbaptist force-pushed the 230222.start_suspect branch 6 times, most recently from ccf4194 to 4caed55 Compare February 23, 2023 21:58
@andrewbaptist andrewbaptist marked this pull request as ready for review February 23, 2023 22:50
@andrewbaptist andrewbaptist requested a review from a team as a code owner February 23, 2023 22:50
@andrewbaptist andrewbaptist requested a review from kvoli February 23, 2023 22:50
Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message mentions both new and rejoined nodes are treated as suspect. Is that correct still?

Copy link
Collaborator Author

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the commit note.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

A few questions and minor nits. Great stuff.

We should consider adding a unit test which asserts (unless something like this is already around):

  • stores on new nodes go from unknown -> available
  • stores on rejoining nodes go from unavailable (or dead) -> suspect

Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist)


-- commits line 4 at r2:
nit: than the


pkg/kv/kvserver/allocator/storepool/store_pool.go line 165 at r2 (raw file):

	l livenesspb.Liveness, now time.Time, deadThreshold time.Duration,
) livenesspb.NodeLivenessStatus {
	// If we don't have a liveness expiration time, treat the status as unknown.

Does this prevent new nodes being treated as suspect? The comment kind of infers this but it would be good to be very explicit so we don't break this behavior by accident.


pkg/kv/kvserver/allocator/storepool/store_pool.go line 268 at r2 (raw file):

	//                                                heartbeat
	//
	// The store is considered dead if it hasn't been updated via gossip

Was this code redundant with the liveness dead check below?


pkg/kv/kvserver/allocator/storepool/store_pool.go line 293 at r2 (raw file):

	}

	// check whether the store is currently suspect. We measure that by

nit : Check


pkg/kv/kvserver/client_replica_test.go line 1559 at r2 (raw file):

	// Set the clock far enough forward to cause a lease extension, but not far
	// enough to make the node appear unhealthy.
	l.manualClock.Increment((lease.Expiration.WallTime - l.manualClock.UnixNano()) / 2)

Why was this necessary?

Copy link
Collaborator Author

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add tests on this. There is some partial coverage, but not very complete.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @kvoli)


-- commits line 4 at r2:

Previously, kvoli (Austen) wrote…

nit: than the

Done


pkg/kv/kvserver/allocator/storepool/store_pool.go line 165 at r2 (raw file):

Previously, kvoli (Austen) wrote…

Does this prevent new nodes being treated as suspect? The comment kind of infers this but it would be good to be very explicit so we don't break this behavior by accident.

Cleaned up the comment to make this more clear.


pkg/kv/kvserver/allocator/storepool/store_pool.go line 268 at r2 (raw file):

Previously, kvoli (Austen) wrote…

Was this code redundant with the liveness dead check below?

Unfortunately no. This is a little messy, but there is a check for sd.Desc.Node.NodeID so we first have to check if sd.Desc is null. Even after that is set as non-null, there is a still a window where we don't have a valid liveness heartbeat so both checks are necessary.


pkg/kv/kvserver/allocator/storepool/store_pool.go line 293 at r2 (raw file):

Previously, kvoli (Austen) wrote…

nit : Check

Done


pkg/kv/kvserver/client_replica_test.go line 1559 at r2 (raw file):

Previously, kvoli (Austen) wrote…

Why was this necessary?

This test used to move the clock forward by almost 6 seconds. This was somewhat racy before, and resulted in a store being marked as suspect and then this test failing. Changing to only updating be 3 seconds forces a lease extension without making the store unhealthy. This test only passed before because it wasn't always clearing the suspect state.

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist)


pkg/kv/kvserver/allocator/storepool/store_pool.go line 165 at r2 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

Cleaned up the comment to make this more clear.

Thanks that is clearer.


pkg/kv/kvserver/client_replica_test.go line 1559 at r2 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

This test used to move the clock forward by almost 6 seconds. This was somewhat racy before, and resulted in a store being marked as suspect and then this test failing. Changing to only updating be 3 seconds forces a lease extension without making the store unhealthy. This test only passed before because it wasn't always clearing the suspect state.

Makes sense. Thanks for clarifying.

Previously, nodes that failed to heartbeat for longer than the
lease expiration interval (currently 6 seconds) would only be considered
suspect when re-joining the cluster within 5 minutes. After 5 minutes
the store would be marked as dead, and if a store rejoins after it is
dead it could immediately begin accepting leases. This caused problems
since there is often a lot of recovery work to do when a node rejoins.

This change always marks a node that rejoins a cluster as suspect.

Release note (ops change): Nodes will now be considered suspect when
re-joining a cluster and not accept lease transfers for one
server.time_after_store_suspect window, which defaults to 30s.

Epic: none
@andrewbaptist
Copy link
Collaborator Author

bors r=kvoli

@craig
Copy link
Contributor

craig bot commented Feb 27, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 27, 2023

This PR was included in a batch that was canceled, it will be automatically retried

@craig
Copy link
Contributor

craig bot commented Feb 27, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 27, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 27, 2023

Build succeeded:

@craig craig bot merged commit 46120ff into cockroachdb:master Feb 27, 2023
kvoli added a commit to kvoli/cockroach that referenced this pull request Mar 6, 2023
In cockroachdb#97532 we removed a check that would consider stores as dead if they
hadn't gossiped their store descriptor within the last
`server.time_until_store_dead` period.

This patch updates adds back in the dead check that was removed, so
that stores which have not updated their store descriptor in gossip, in
the last `server.time_until_store_dead` are considered dead.

This is a mostly redundant check with node liveness mirroring this
behavior, however necessary in some tests.

Resolves: cockroachdb#97794

Release note: None
craig bot pushed a commit that referenced this pull request Mar 6, 2023
97889: server: cleanups around a tenant's metrics registry r=andreimatei a=andreimatei

Please see individual commits.

Release note: None
Epic: None

98101: storepool: consider nodes dead after no gossip r=andrewbaptist a=kvoli

In #97532 we removed a check that would consider stores as dead if they
hadn't gossiped their store descriptor within the last
`server.time_until_store_dead` period.

This patch updates adds back in the dead check that was removed, so
that stores which have not updated their store descriptor in gossip, in
the last `server.time_until_store_dead` are considered dead.

This is a mostly redundant check with node liveness mirroring this
behavior, however necessary in some tests.

Resolves: #97794

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
kvoli added a commit to kvoli/cockroach that referenced this pull request Aug 29, 2023
In cockroachdb#97532 we removed a check that would consider stores as dead if they
hadn't gossiped their store descriptor within the last
`server.time_until_store_dead` period.

This patch updates adds back in the dead check that was removed, so
that stores which have not updated their store descriptor in gossip, in
the last `server.time_until_store_dead` are considered dead.

This is a mostly redundant check with node liveness mirroring this
behavior, however necessary in some tests.

Resolves: cockroachdb#97794

Release note: None
@andrewbaptist andrewbaptist deleted the 230222.start_suspect branch September 6, 2023 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants