Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: count draining nodes as live when computing quorum #68652

Merged
merged 3 commits into from
Sep 2, 2021

Conversation

aayushshah15
Copy link
Contributor

@aayushshah15 aayushshah15 commented Aug 10, 2021

Similar to #67714

Draining nodes were considered non-live by the allocator when it made the
determination of whether a range could achieve quorum. This meant that, for
instance, on a cluster with a replication factor of 5, if we had 3 or more
nodes marked draining, we (with a high likelihood) wouldn't be able to
decommission any other nodes from the cluster.

Furthermore, due to the same reason as above the system also would incorrectly
decide to not rebalance ranges that had more than a quorum of replicas on
draining nodes.

This patch fixes this problem by considering replicas on draining nodes as live
for the purposes of determining whether a range has quorum. This likely fixes a
subset of "stuck decommissioning" issues we've seen in the wild.

Follows from cockroachlabs/support#1105

Release justification: bug fix

Release note(bug fix): Previously, draining a quorum of nodes (i.e. >=2 if
replication factor is 3, >=3 if replication factor is 5, etc) would block the
subsequent decommissioning of any other nodes in the cluster. This patch fixes
this bug. Now, if the lowest replication factor of some zone in the cluster is
RF, operators should be able to safely drain up to RF-1 nodes simultaneously.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aayushshah15 aayushshah15 force-pushed the 20210805_fixStuckDraining branch 6 times, most recently from dfa244d to 5665822 Compare August 13, 2021 00:27
@aayushshah15 aayushshah15 marked this pull request as ready for review August 13, 2021 00:27
@aayushshah15 aayushshah15 requested review from a team as code owners August 13, 2021 00:27
@aayushshah15 aayushshah15 requested review from erikgrinaker and nvanbenschoten and removed request for a team and erikgrinaker August 13, 2021 00:27
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: modulo test failures

Reviewed 5 of 5 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly :lgtm:, though I'm a little unclear on why we want draining nodes to be considered live for the purposes of computing quorum. Is my understanding correct that once a node has finished draining, it will terminate and become non-live? If so, then why are we finding ourselves in states where we have 3 nodes that are draining and also have a node performing a decommission and we need these draining nodes to contribute to quorum? If the nodes completed draining and terminated then they certainly wouldn't contribute to quorum, so aren't the expectations here racy? Put differently, if a draining node can at any moment finish draining and terminate, then isn't a draining node as good as dead?

Reviewed 5 of 5 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @aayushshah15)


pkg/kv/kvserver/store_pool.go, line 706 at r1 (raw file):

		// No-op.
		case storeStatusSuspect, storeStatusDraining:
			if includeSuspectAndDrainingStores {

Did any unit tests pick up this change? If not, are we missing some test coverage?


pkg/kv/kvserver/store_rebalancer.go, line 429 at r2 (raw file):

		preferred := sr.rq.allocator.preferredLeaseholders(zone, candidates)

		// Filter both the list of preferred stores as well as the list of all

Do we have a way to unit test this change in behavior? Ideally in something targetted like TestChooseLeaseToTransfer.


pkg/kv/kvserver/store_rebalancer.go, line 434 at r2 (raw file):

		const includeSuspectAndDrainingStores = false
		preferred, _ = sr.rq.allocator.storePool.liveAndDeadReplicas(preferred, includeSuspectAndDrainingStores)
		candidates, _ = sr.rq.allocator.storePool.liveAndDeadReplicas(candidates, includeSuspectAndDrainingStores)

Were we previously able to transfer the lease to dead nodes as well? I don't see what was preventing that.

@aayushshah15
Copy link
Contributor Author

Is my understanding correct that once a node has finished draining, it will terminate and become non-live?

The nodes wont terminate once the draining process completes, they stay in this state of not accepting new leases/replicas and new SQL connections until they're either restarted or terminated explicitly.

One of our customers had a workflow where if they had to remove, say 10 nodes from the cluster, they would first mark all these 10 nodes as draining and then gradually decommission them (this was to avoid moving replicas to node that are soon to be decommissioned anyway). One of the main reasons this decommissioning would get stuck is because the allocator would incorrectly think that a bunch of ranges were unavailable (and thus would not take any action on them).

You're right about the lacking unit test coverage, I'll update the PR with some.

@nvanbenschoten
Copy link
Member

Got it, thanks for the explanation. In that case, I agree with the changes that this PR is making.

@aayushshah15 aayushshah15 force-pushed the 20210805_fixStuckDraining branch 3 times, most recently from 1de5f62 to 58494cf Compare August 13, 2021 23:53
@knz
Copy link
Contributor

knz commented Aug 16, 2021

I really like this PR a lot.
I also think it needs a release note, because it answers a question that @taroface had a few months ago: "what would be a good use case for a separate drain command?"

The explanation you provide above is such a good use case, and it deserves to be in the release note so the docs team can pick it up once this change is in.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15, @erikgrinaker, and @nvanbenschoten)


pkg/kv/kvserver/replica_command.go, line 2047 at r4 (raw file):

Otherwise, they are excluded from the returned slices.

Is it true that replicas can be excluded from both results? So, if this flag is not passed, the caller needs to consider three categories of replicas - live, dead, neither, with the latter not being explicitly represented? This seems like a weird contract to me. Would any caller complain if the suspect/draining replicas were included in the deadReplicas slice?


pkg/kv/kvserver/store_pool.go, line 239 at r4 (raw file):

	storeStatusAvailable
	// The store is decommissioning.
	storeStatusDecommissioning

maybe take the opportunity to explain the priority between this and dead/suspect/draining.


pkg/kv/kvserver/store_pool.go, line 240 at r4 (raw file):

	// The store is decommissioning.
	storeStatusDecommissioning
	// The store failed it's liveness heartbeat recently and is considered

maybe take the opportunity to explain that stores move from dead to suspicious.


pkg/kv/kvserver/store_pool.go, line 833 at r4 (raw file):

// the required attributes and their associated stats. The storeList is filtered
// according to the provided storeFilter. It also returns the total number of
// alive and throttled stores.

The comment on returning the number of throttled stores seems obsolete. Perhaps take the opportunity to explain the last retval; it seems like a weird return value because it doesn't include store ids.

@aayushshah15 aayushshah15 force-pushed the 20210805_fixStuckDraining branch from e97935e to c63875d Compare August 21, 2021 00:42
Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explanation you provide above is such a good use case

Even with this patch, we cannot actually support draining >= RF nodes at the same time where RF is the lowest replication factor of some zone in the cluster. This is because decommissioning any of these draining nodes will require lease transfers away from the decommissioning node (since the leaseholders on the decommissioning node cannot remove themselves). These lease transfers need to succeed, so every range needs to have at least one replica on a non-draining node.

This patch does let the user simultaneously drain RF-1 nodes from their cluster though. I've amended the release note to reflect this.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @andreimatei, @erikgrinaker, and @nvanbenschoten)


pkg/kv/kvserver/replica_command.go, line 2047 at r4 (raw file):

This seems like a weird contract to me. Would any caller complain if the suspect/draining replicas were included in the deadReplicas slice?

The callers use that deadReplicas slice essentially as the list of replicas that need to be replaced -- which is not what we'd like to do for replicas on draining or suspect nodes. So in this case, it feels like we do need the disambiguation.


pkg/kv/kvserver/store_pool.go, line 706 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Did any unit tests pick up this change? If not, are we missing some test coverage?

Added a short new test for it.


pkg/kv/kvserver/store_pool.go, line 239 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

maybe take the opportunity to explain the priority between this and dead/suspect/draining.

Done.


pkg/kv/kvserver/store_pool.go, line 240 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

maybe take the opportunity to explain that stores move from dead to suspicious.

Done.


pkg/kv/kvserver/store_pool.go, line 833 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

The comment on returning the number of throttled stores seems obsolete. Perhaps take the opportunity to explain the last retval; it seems like a weird return value because it doesn't include store ids.

Fixed.


pkg/kv/kvserver/store_rebalancer.go, line 429 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we have a way to unit test this change in behavior? Ideally in something targetted like TestChooseLeaseToTransfer.

TestChooseLeaseToTransfer is getting rehauled in #65379. Would you mind if I handled it there?


pkg/kv/kvserver/store_rebalancer.go, line 434 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Were we previously able to transfer the lease to dead nodes as well? I don't see what was preventing that.

Yeah, nothing was preventing that.

@aayushshah15 aayushshah15 force-pushed the 20210805_fixStuckDraining branch from c63875d to a7e323f Compare August 22, 2021 19:59
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

One of our customers had a workflow where if they had to remove, say 10 nodes from the cluster, they would first mark all these 10 nodes as draining and then gradually decommission them (this was to avoid moving replicas to node that are soon to be decommissioned anyway). One of the main reasons this decommissioning would get stuck is because the allocator would incorrectly think that a bunch of ranges were unavailable (and thus would not take any action on them).

Please include this paragraph in the commit message somehow.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @aayushshah15, @erikgrinaker, and @nvanbenschoten)


pkg/kv/kvserver/replica_command.go, line 2047 at r4 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

This seems like a weird contract to me. Would any caller complain if the suspect/draining replicas were included in the deadReplicas slice?

The callers use that deadReplicas slice essentially as the list of replicas that need to be replaced -- which is not what we'd like to do for replicas on draining or suspect nodes. So in this case, it feels like we do need the disambiguation.

Ack on the distinction between dead and suspect/draining.
I won't make a big deal out of it, but I'd still encourage figuring out a different format for the returned values. Perhaps get rid of includeSuspectStores and return a replicasWithStatus object with filtering methods that let the caller more explicitly choose what they're looking at.


pkg/kv/kvserver/store_pool.go, line 239 at r13 (raw file):

	storeStatusAvailable
	// The store is decommissioning. If draining or suspect stores are
	// decommissioned, this status takes precedence over `storeStatusDraining`

nit: s/decommissioned/decommissioning

@andreimatei
Copy link
Contributor

how come the 2nd commit doesn't close cockroachlabs/support#1105 ?

@aayushshah15 aayushshah15 force-pushed the 20210805_fixStuckDraining branch from a7e323f to 812c74c Compare August 30, 2021 15:28
Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come the 2nd commit doesn't close cockroachlabs/support#1105 ?

I think its possible we still have an undiagnosed issue with draining nodes that are failing liveness heartbeats. I think #67714 plus this patch may solve it but I'd like to test it out with that customer before we close that ticket.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @andreimatei, @erikgrinaker, and @nvanbenschoten)


pkg/kv/kvserver/replica_command.go, line 2047 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Ack on the distinction between dead and suspect/draining.
I won't make a big deal out of it, but I'd still encourage figuring out a different format for the returned values. Perhaps get rid of includeSuspectStores and return a replicasWithStatus object with filtering methods that let the caller more explicitly choose what they're looking at.

👍 Agreed. Added a TODO for now since there's a ton of callers for this function and I'd like to keep the patch small.


pkg/kv/kvserver/store_pool.go, line 239 at r13 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: s/decommissioned/decommissioning

done.

Similar to cockroachdb#67714

Draining nodes were considered non-live by the allocator when it made the
determination of whether a range could achieve quorum. This meant that, for
instance, on a cluster with a replication factor of 5, if we had 3 or more
nodes marked draining, we (with a high likelihood) wouldn't be able to
decommission nodes from the cluster.

Furthermore, due to the same reason as above the system also would incorrectly
decide to not rebalance ranges that had more than a quorum of replicas on
draining nodes.

This patch fixes this problem by considering replicas on draining nodes as live
for the purposes of determining whether a range has quorum. This likely fixes a
considerable subset of "stuck decommissioning" issues we've seen in the wild.

Follows from https://github.com/cockroachlabs/support/issues/1105

Release note: None
…toreRebalancer

This commit prevents the StoreRebalancer from transferring leases to replicas
on draining or suspect nodes. In some cases, we've seen this to cause new
leases to be pushed to nodes that take too long to drain or that are stuck
while draining due to other bugs.

Informs https://github.com/cockroachlabs/support/issues/1105

Release note: None
This commit adds a roachtest that's meant to be a regression test against the
hazard addressed by the first commit in this PR.

This roachtest is meant to ensure that nodes that are marked "draining" are
considered "live" by the allocator for when it makes the determination of
whether a range can achieve quorum.

Release note: None
@aayushshah15 aayushshah15 force-pushed the 20210805_fixStuckDraining branch from 812c74c to ff2ae38 Compare September 2, 2021 16:11
@aayushshah15
Copy link
Contributor Author

CI was failing due to missing release justification.

TFTRs!

bors r+

@craig craig bot merged commit 2a33514 into cockroachdb:master Sep 2, 2021
@craig
Copy link
Contributor

craig bot commented Sep 2, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants