kvserver: avoid redundant liveness heartbeats under a thundering herd #51888

nvanbenschoten · 2020-07-24T19:54:15Z

When a node's liveness expires, either because its liveness record's
epoch is incremented or it is just slow to heartbeat its record, all of
its epoch-based leases immediately become invalid. As a result, we often
see a thundering herd of requests attempt to synchronously heartbeat the
node's liveness record, on the order of the number of ranges that lost
their lease.

We already limit the concurrency of these heartbeats to 1, so there is
not a huge concern that this will lead to overwhelming the liveness
range, but it does cause other issues. For one, it means that we end up
heartbeating the liveness record very rapidly, which causes large growth
in MVCC history. It also means that heartbeats at the end of the queue
have to wait for all other heartbeats in front of it to complete. Even
if these heartbeats only take 5ms each, if there are 100 of them
waiting, then the last one in line will wait for 500ms and its range
will be unavailable during this time. This also has the potential to
starve the liveness heartbeat loop, which isn't a problem in and of
itself as long as other synchronous heartbeats are succeeding, but
leads to concerning log warnings. Finally, this was an instance where
we were adding additional load to a cluster once it was close to being
overloaded. That's generally a bad property for a system that wants
to stay stable, and this change helps avoid it.

The solution here is to detect redundant heartbeats and make them no-ops
where possible. This has a similar effect to if we were to explicitly
coalesce heartbeats, but it's easier to reason about and requires us to
maintain less state.

The commit is conservative about this, providing a fairly strong
guarantee that a heartbeat attempt, if successful, will ensure that the
liveness record's expiration will be at least the liveness threshold
above the time that the method was called. We may be able to relax this
and say that the heartbeat attempt will just ensure that the expiration
is now above that of the oldLiveness provided, but this weakened
guarantee seems harder to reason about as a consumer of this interface.

Release note (performance improvement): ranges recover moderately faster
when their leaseholder is briefly down before becoming live again.

cockroach-teamcity · 2020-07-24T19:54:22Z

This change is

Mitigates the runtime regression in cockroachdb#50865. Implements the proposal from cockroachdb#50865 (comment). Based on cockroachdb#51816. In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live. This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system. This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_quiescence` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored. This allows us to continue to provide the guarantee that: > If at least one up-to-date replica in a Raft group is alive, the > Raft group will catch up any lagging replicas that are also alive. For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment). A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios. This commit deliberately does not worry about the migration of this change for clarity. The next commit adds in the migration. Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing.

@nvanbenschoten

This test is currently flaking on master. Let's skip it until @nvanbenschoten's PRs (cockroachdb#51894 and cockroachdb#51888) resulting from the fallout from the Go1.14 upgrade are in. Release note: None

bdarnell

Reviewed 7 of 7 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15)

aayushshah15

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/kvserver/node_liveness.go, line 702 at r1 (raw file):

	// [*]: see TODO below about how errNodeAlreadyLive handling does not
	//      enforce this guarantee.
	beforeQueue := nl.clock.Now()

super nit: for consistency with timestamp variables in other places, how do you feel about beforeQueueTs and afterQueueTs below?

nvanbenschoten

TFTRs!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @bdarnell)

pkg/kv/kvserver/node_liveness.go, line 702 at r1 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

super nit: for consistency with timestamp variables in other places, how do you feel about beforeQueueTs and afterQueueTs below?

Done.

Mitigates the runtime regression in cockroachdb#50865. Implements the proposal from cockroachdb#50865 (comment). Based on cockroachdb#51816. In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live. This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system. This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_quiescence` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored. This allows us to continue to provide the guarantee that: > If at least one up-to-date replica in a Raft group is alive, the > Raft group will catch up any lagging replicas that are also alive. For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment). A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios. This is a large improvement on top of the existing behavior because it reduces the set of ranges with a replica on a given dead node that unquiesce when that node becomes live to only those ranges that had write activity after the node went down. This is typically only a small fraction of the total number of ranges on each node (operating on the usual assumption that in an average large cluster, most data is write-cold), especially when the outage is short. However, if all ranges had write activity and subsequently quiesced before the outage resolved, we would still have to unquiesce all ranges upon the node coming back up. For this reason, the change is primarily targeted towards reducing the impact of node liveness blips and not reducing the impact of bringing nodes back up after sustained node outages. This is intentional, as the former category of outage is the one more likely to be caused by overload due to unquiescence traffic itself (as we see in cockroachdb#50865), so it is the category we choose to focus on here. This commit deliberately does not worry about the migration of this change for clarity. The next commit adds in the migration. Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing.

irfansharif · 2020-08-07T21:11:34Z

CI failed here (looks like lint failures), which will cause bors to silently crap out (which sucks, and we should probably change bors code to not do that).

When a node's liveness expires, either because its liveness record's epoch is incremented or it is just slow to heartbeat its record, all of its epoch-based leases immediately become invalid. As a result, we often see a thundering herd of requests attempt to synchronously heartbeat the node's liveness record, on the order of the number of ranges that lost their lease. We already limit the concurrency of these heartbeats to 1, so there is not a huge concern that this will lead to overwhelming the liveness range, but it does cause other issues. For one, it means that we end up heartbeating the liveness record very rapidly, which causes large growth in MVCC history. It also means that heartbeats at the end of the queue have to wait for all other heartbeats in front of it to complete. Even if these heartbeats only take 5ms each, if there are 100 of them waiting, then the last one in line will wait for 500ms and its range will be unavailable during this time. This also has the potential to starve the liveness heartbeat loop, which isn't a problem in and of itself as long as other synchronous heartbeats are succeeding, but leads to concerning log warnings. Finally, this was an instance where we were adding additional load to a cluster once it was close to being overloaded. That's generally a bad property for a system that wants to stay stable, and this change helps avoid it. The solution here is to detect redundant heartbeats and make them no-ops where possible. This has a similar effect to if we were to explicitly coalesce heartbeats, but it's easier to reason about and requires us to maintain less state. The commit is conservative about this, providing a fairly strong guarantee that a heartbeat attempt, if successful, will ensure that the liveness record's expiration will be at least the liveness threshold above the time that the method was called. We may be able to relax this and say that the heartbeat attempt will just ensure that the expiration is now above that of the oldLiveness provided, but this weakened guarantee seems harder to reason about as a consumer of this interface. Release note (performance improvement): ranges recover moderately faster when their leaseholder is briefly down before becoming live again.

nvanbenschoten · 2020-08-10T23:49:59Z

bors r+

craig · 2020-08-11T00:14:50Z

Build succeeded:

Compile Build (Cockroach)

Mitigates the runtime regression in cockroachdb#50865. Implements the proposal from cockroachdb#50865 (comment). Based on cockroachdb#51816. In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live. This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system. This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_followers_on_quiesce` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored. This allows us to continue to provide the guarantee that: > If at least one up-to-date replica in a Raft group is alive, the > Raft group will catch up any lagging replicas that are also alive. For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment). A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios. This is a large improvement on top of the existing behavior because it reduces the set of ranges with a replica on a given dead node that unquiesce when that node becomes live to only those ranges that had write activity after the node went down. This is typically only a small fraction of the total number of ranges on each node (operating on the usual assumption that in an average large cluster, most data is write-cold), especially when the outage is short. However, if all ranges had write activity and subsequently quiesced before the outage resolved, we would still have to unquiesce all ranges upon the node coming back up. For this reason, the change is primarily targeted towards reducing the impact of node liveness blips and not reducing the impact of bringing nodes back up after sustained node outages. This is intentional, as the former category of outage is the one more likely to be caused by overload due to unquiescence traffic itself (as we see in cockroachdb#50865), so it is the category we choose to focus on here. This commit deliberately does not worry about the migration of this change for clarity. The next commit adds in the migration. Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing.

@petermattis

51894: kvserver: don't unquiesce on liveness changes of up-to-date replicas r=nvanbenschoten a=nvanbenschoten Mitigates the runtime regression in #50865. Implements the proposal from #50865 (comment). In #26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live. This turns out to be a destabilizing behavior. In scenarios like #50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like #51888 and #45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system. This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_quiescence` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored. This allows us to continue to provide the guarantee that: > If at least one up-to-date replica in a Raft group is alive, the > Raft group will catch up any lagging replicas that are also alive. For a pseudo-proof of this guarantee, see #50865 (comment). A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios. This is a large improvement on top of the existing behavior because it reduces the set of ranges with a replica on a given dead node that unquiesce when that node becomes live to only those ranges that had write activity after the node went down. This is typically only a small fraction of the total number of ranges on each node (operating on the usual assumption that in an average large cluster, most data is write-cold), especially when the outage is short. However, if all ranges had write activity and subsequently quiesced before the outage resolved, we would still have to unquiesce all ranges upon the node coming back up. For this reason, the change is primarily targeted towards reducing the impact of node liveness blips and not reducing the impact of bringing nodes back up after sustained node outages. This is intentional, as the former category of outage is the one more likely to be caused by overload due to unquiescence traffic itself (as we see in #50865), so it is the category we choose to focus on here. Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing. @petermattis I'm adding you as a reviewer because you've done more thinking about Range quiescence than anyone else. But I can find a new second reviewer if you don't have time to take a look at this. Co-authored-by: Nathan VanBenschoten <[email protected]>

nvanbenschoten requested review from bdarnell and aayushshah15 July 24, 2020 19:54

nvanbenschoten mentioned this pull request Jul 24, 2020

kvserver: don't unquiesce on liveness changes of up-to-date replicas #51894

Merged

nvanbenschoten force-pushed the nvanbenschoten/redundantLiveness branch from 0fd1044 to c128770 Compare July 24, 2020 22:42

aayushshah15 mentioned this pull request Jul 27, 2020

roachtest: skip kv/rangelookups/split #51918

Closed

bdarnell approved these changes Jul 27, 2020

View reviewed changes

aayushshah15 approved these changes Jul 27, 2020

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/redundantLiveness branch from c128770 to fade883 Compare August 7, 2020 19:19

nvanbenschoten commented Aug 7, 2020

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/redundantLiveness branch from fade883 to 1dc18df Compare August 10, 2020 03:33

craig bot merged commit a458630 into cockroachdb:master Aug 11, 2020

nvanbenschoten deleted the nvanbenschoten/redundantLiveness branch August 14, 2020 22:05

irfansharif mentioned this pull request Aug 31, 2020

Cluster stops working after a couple of minutes #53609

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: avoid redundant liveness heartbeats under a thundering herd #51888

kvserver: avoid redundant liveness heartbeats under a thundering herd #51888

nvanbenschoten commented Jul 24, 2020

cockroach-teamcity commented Jul 24, 2020

bdarnell left a comment

aayushshah15 left a comment

nvanbenschoten left a comment

irfansharif commented Aug 7, 2020

nvanbenschoten commented Aug 10, 2020

craig bot commented Aug 11, 2020

kvserver: avoid redundant liveness heartbeats under a thundering herd #51888

kvserver: avoid redundant liveness heartbeats under a thundering herd #51888

Conversation

nvanbenschoten commented Jul 24, 2020

cockroach-teamcity commented Jul 24, 2020

bdarnell left a comment

Choose a reason for hiding this comment

aayushshah15 left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

irfansharif commented Aug 7, 2020

nvanbenschoten commented Aug 10, 2020

craig bot commented Aug 11, 2020