New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

storepool: consider stores suspect on join #97532

Merged

craig merged 1 commit into cockroachdb:master from andrewbaptist:230222.start_suspect

Feb 27, 2023

Collaborator

andrewbaptist commented Feb 22, 2023

Previously, stores would not be considered suspect when joining the cluster. This patch updates logic so that stores are now considered suspect on joining the cluster (rejoining or startup).

Release note (ops change): Stores will now be considered suspect when joining the cluster.

Epic: none

Member

cockroach-teamcity commented Feb 22, 2023

This change is

andrewbaptist force-pushed the 230222.start_suspect branch 6 times, most recently from ccf4194 to 4caed55 Compare

February 23, 2023 21:58

andrewbaptist marked this pull request as ready for review

February 23, 2023 22:50

andrewbaptist requested a review from a team as a code owner

February 23, 2023 22:50

andrewbaptist requested a review from kvoli

February 23, 2023 22:50

kvoli reviewed

View reviewed changes

Collaborator

kvoli left a comment

The commit message mentions both new and rejoined nodes are treated as suspect. Is that correct still?

andrewbaptist force-pushed the 230222.start_suspect branch from 4caed55 to 91a6b25 Compare

February 24, 2023 19:02

andrewbaptist commented

View reviewed changes

Collaborator Author

andrewbaptist left a comment

Fixed the commit note.

Reviewable status: complete! 0 of 0 LGTMs obtained

kvoli approved these changes

View reviewed changes

Collaborator

kvoli left a comment

A few questions and minor nits. Great stuff.

We should consider adding a unit test which asserts (unless something like this is already around):

stores on new nodes go from unknown -> available
stores on rejoining nodes go from unavailable (or dead) -> suspect

Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist)

-- commits line 4 at r2:
nit: than the

pkg/kv/kvserver/allocator/storepool/store_pool.go line 165 at r2 (raw file):

	l livenesspb.Liveness, now time.Time, deadThreshold time.Duration,
) livenesspb.NodeLivenessStatus {
	// If we don't have a liveness expiration time, treat the status as unknown.

Does this prevent new nodes being treated as suspect? The comment kind of infers this but it would be good to be very explicit so we don't break this behavior by accident.

pkg/kv/kvserver/allocator/storepool/store_pool.go line 268 at r2 (raw file):

	//                                                heartbeat
	//
	// The store is considered dead if it hasn't been updated via gossip

Was this code redundant with the liveness dead check below?

pkg/kv/kvserver/allocator/storepool/store_pool.go line 293 at r2 (raw file):

	}

	// check whether the store is currently suspect. We measure that by

nit : Check

pkg/kv/kvserver/client_replica_test.go line 1559 at r2 (raw file):

	// Set the clock far enough forward to cause a lease extension, but not far
	// enough to make the node appear unhealthy.
	l.manualClock.Increment((lease.Expiration.WallTime - l.manualClock.UnixNano()) / 2)

Why was this necessary?

andrewbaptist commented

View reviewed changes

Collaborator Author

andrewbaptist left a comment

I'll add tests on this. There is some partial coverage, but not very complete.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli)

-- commits line 4 at r2:

Previously, kvoli (Austen) wrote…

nit: than the

Done

pkg/kv/kvserver/allocator/storepool/store_pool.go line 165 at r2 (raw file):

Previously, kvoli (Austen) wrote…

Does this prevent new nodes being treated as suspect? The comment kind of infers this but it would be good to be very explicit so we don't break this behavior by accident.

Cleaned up the comment to make this more clear.

pkg/kv/kvserver/allocator/storepool/store_pool.go line 268 at r2 (raw file):

Previously, kvoli (Austen) wrote…

Was this code redundant with the liveness dead check below?

Unfortunately no. This is a little messy, but there is a check for sd.Desc.Node.NodeID so we first have to check if sd.Desc is null. Even after that is set as non-null, there is a still a window where we don't have a valid liveness heartbeat so both checks are necessary.

pkg/kv/kvserver/allocator/storepool/store_pool.go line 293 at r2 (raw file):

Previously, kvoli (Austen) wrote…

nit : Check

Done

pkg/kv/kvserver/client_replica_test.go line 1559 at r2 (raw file):

Previously, kvoli (Austen) wrote…

Why was this necessary?

This test used to move the clock forward by almost 6 seconds. This was somewhat racy before, and resulted in a store being marked as suspect and then this test failing. Changing to only updating be 3 seconds forces a lease extension without making the store unhealthy. This test only passed before because it wasn't always clearing the suspect state.

andrewbaptist force-pushed the 230222.start_suspect branch from 91a6b25 to 9ae2ae9 Compare

February 27, 2023 15:38

kvoli reviewed

View reviewed changes

Collaborator

kvoli left a comment

ack

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist)

pkg/kv/kvserver/allocator/storepool/store_pool.go line 165 at r2 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

Cleaned up the comment to make this more clear.

Thanks that is clearer.

pkg/kv/kvserver/client_replica_test.go line 1559 at r2 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

This test used to move the clock forward by almost 6 seconds. This was somewhat racy before, and resulted in a store being marked as suspect and then this test failing. Changing to only updating be 3 seconds forces a lease extension without making the store unhealthy. This test only passed before because it wasn't always clearing the suspect state.

Makes sense. Thanks for clarifying.


          storepool: consider stores suspect on re-join

a653180

Previously, nodes that failed to heartbeat for longer than the
lease expiration interval (currently 6 seconds) would only be considered
suspect when re-joining the cluster within 5 minutes. After 5 minutes
the store would be marked as dead, and if a store rejoins after it is
dead it could immediately begin accepting leases. This caused problems
since there is often a lot of recovery work to do when a node rejoins.

This change always marks a node that rejoins a cluster as suspect.

Release note (ops change): Nodes will now be considered suspect when
re-joining a cluster and not accept lease transfers for one
server.time_after_store_suspect window, which defaults to 30s.

Epic: none

andrewbaptist force-pushed the 230222.start_suspect branch from 9ae2ae9 to a653180 Compare

February 27, 2023 17:14

Collaborator Author

andrewbaptist commented Feb 27, 2023

bors r=kvoli

Contributor

craig bot commented Feb 27, 2023

Build failed (retrying...):

Bazel Essential CI (Cockroach)

Contributor

craig bot commented Feb 27, 2023

This PR was included in a batch that was canceled, it will be automatically retried

Contributor

craig bot commented Feb 27, 2023

Build failed (retrying...):

Bazel Essential CI (Cockroach)

Contributor

craig bot commented Feb 27, 2023

Build failed (retrying...):

Bazel Essential CI (Cockroach)

Contributor

craig bot commented Feb 27, 2023

Build succeeded:

Bazel Essential CI (Cockroach)

craig bot merged commit 46120ff into cockroachdb:master

cockroach-teamcity mentioned this pull request

PR #97532 - storepool: consider stores suspect on re-join cockroachdb/docs#16375

Open

This was referenced Mar 1, 2023

kvserver: avoid lease transfers to restarted nodes that could become overloaded #96521

Closed

kvserver: add configurable wait period before transferring leases to a restarted node #96519

Closed

roachtest: replicate/wide failed #97794

Closed

kvoli mentioned this pull request

storepool: consider nodes dead after no gossip #98101

Merged

kvoli added a commit to kvoli/cockroach that referenced this pull request


          storepool: consider nodes dead after no gossip

2f8156c

In cockroachdb#97532 we removed a check that would consider stores as dead if they
hadn't gossiped their store descriptor within the last
`server.time_until_store_dead` period.

This patch updates adds back in the dead check that was removed, so
that stores which have not updated their store descriptor in gossip, in
the last `server.time_until_store_dead` are considered dead.

This is a mostly redundant check with node liveness mirroring this
behavior, however necessary in some tests.

Resolves: cockroachdb#97794

Release note: None

craig bot pushed a commit that referenced this pull request


          Merge #97889 #98101

b0f6985

97889: server: cleanups around a tenant's metrics registry r=andreimatei a=andreimatei

Please see individual commits.

Release note: None
Epic: None

98101: storepool: consider nodes dead after no gossip r=andrewbaptist a=kvoli

In #97532 we removed a check that would consider stores as dead if they
hadn't gossiped their store descriptor within the last
`server.time_until_store_dead` period.

This patch updates adds back in the dead check that was removed, so
that stores which have not updated their store descriptor in gossip, in
the last `server.time_until_store_dead` are considered dead.

This is a mostly redundant check with node liveness mirroring this
behavior, however necessary in some tests.

Resolves: #97794

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>

This was referenced Mar 7, 2023

kvserver: Always treat restarted nodes as suspect #97263

Closed

A node that is offline for a short period can cause a performance issue when it rejoins the cluster #95159

Closed

kvoli mentioned this pull request

roachtest: kv/restart/nodes=12 failed #102655

Closed

kvoli mentioned this pull request

release-22.2: storepool: consider stores suspect on join #107670

Merged

kvoli added a commit to kvoli/cockroach that referenced this pull request


          storepool: consider nodes dead after no gossip

37e69a2

In cockroachdb#97532 we removed a check that would consider stores as dead if they
hadn't gossiped their store descriptor within the last
`server.time_until_store_dead` period.

This patch updates adds back in the dead check that was removed, so
that stores which have not updated their store descriptor in gossip, in
the last `server.time_until_store_dead` are considered dead.

This is a mostly redundant check with node liveness mirroring this
behavior, however necessary in some tests.

Resolves: cockroachdb#97794

Release note: None

andrewbaptist deleted the 230222.start_suspect branch

September 6, 2023 21:12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet