Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: avoid redundant liveness heartbeats under a thundering herd #51888

Merged

Conversation

nvanbenschoten
Copy link
Member

When a node's liveness expires, either because its liveness record's
epoch is incremented or it is just slow to heartbeat its record, all of
its epoch-based leases immediately become invalid. As a result, we often
see a thundering herd of requests attempt to synchronously heartbeat the
node's liveness record, on the order of the number of ranges that lost
their lease.

We already limit the concurrency of these heartbeats to 1, so there is
not a huge concern that this will lead to overwhelming the liveness
range, but it does cause other issues. For one, it means that we end up
heartbeating the liveness record very rapidly, which causes large growth
in MVCC history. It also means that heartbeats at the end of the queue
have to wait for all other heartbeats in front of it to complete. Even
if these heartbeats only take 5ms each, if there are 100 of them
waiting, then the last one in line will wait for 500ms and its range
will be unavailable during this time. This also has the potential to
starve the liveness heartbeat loop, which isn't a problem in and of
itself as long as other synchronous heartbeats are succeeding, but
leads to concerning log warnings. Finally, this was an instance where
we were adding additional load to a cluster once it was close to being
overloaded. That's generally a bad property for a system that wants
to stay stable, and this change helps avoid it.

The solution here is to detect redundant heartbeats and make them no-ops
where possible. This has a similar effect to if we were to explicitly
coalesce heartbeats, but it's easier to reason about and requires us to
maintain less state.

The commit is conservative about this, providing a fairly strong
guarantee that a heartbeat attempt, if successful, will ensure that the
liveness record's expiration will be at least the liveness threshold
above the time that the method was called. We may be able to relax this
and say that the heartbeat attempt will just ensure that the expiration
is now above that of the oldLiveness provided, but this weakened
guarantee seems harder to reason about as a consumer of this interface.

Release note (performance improvement): ranges recover moderately faster
when their leaseholder is briefly down before becoming live again.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jul 24, 2020
Mitigates the runtime regression in cockroachdb#50865.
Implements the proposal from cockroachdb#50865 (comment).
Based on cockroachdb#51816.

In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas
that were behind on the Raft log. This was a big win, as it prevented a
node failure from stopping any range with a replica on that node from
quiescing, even if the range was dormant. However, in order to ensure
that the replica was caught up quickly once its node being brought back
on-line, we began aggressively unquiescing when nodes moved from not-live
to live.

This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865
where nodes contain large numbers of replicas (i.e. O(100k)), suddenly
unquiescing all replicas on a node due to a transient liveness blip can
bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another
case where we add work to the system when it gets sufficiently loaded,
causing stability to be divergent instead of convergent and resulting in
an unstable system.

This fix removes the need to unquiesce ranges that have up-to-date
replicas on nodes that undergo liveness changes. It does so by socializing
the liveness state of lagging replicas during Range quiescence. In
dong so through a new `lagging_quiescence` set attached to quiescence
requests, it allows all quiesced replicas to agree on which node liveness
events would need to result in the range being unquiesced and which ones
can be ignored.

This allows us to continue to provide the guarantee that:

> If at least one up-to-date replica in a Raft group is alive, the
> Raft group will catch up any lagging replicas that are also alive.

For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment).

A secondary effect of this change is that it fixes a bug in quiescence
as it exists today. Before this change, we did not provide this
guarantee in the case that a leader was behind on its node liveness
information, broadcasted a quiescence notification, and then crashed. In
that case, a node that was brought back online might not be caught up
for minutes because the range wouldn't unquiesce until the lagging
replica reached out to it. Interestingly, this may be a semi-common
occurrence during rolling restart scenarios.

This commit deliberately does not worry about the migration of this
change for clarity. The next commit adds in the migration.

Release note (performance improvement): transient node liveness blips
no longer cause up-to-date Ranges to unquiesce, which makes these events
less destabilizing.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/redundantLiveness branch from 0fd1044 to c128770 Compare July 24, 2020 22:42
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Jul 27, 2020
This test is currently flaking on master. Let's skip it until
@nvanbenschoten's PRs (cockroachdb#51894 and cockroachdb#51888) resulting from the fallout
from the Go1.14 upgrade are in.

Release note: None
Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15)

Copy link
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/node_liveness.go, line 702 at r1 (raw file):

	// [*]: see TODO below about how errNodeAlreadyLive handling does not
	//      enforce this guarantee.
	beforeQueue := nl.clock.Now()

super nit: for consistency with timestamp variables in other places, how do you feel about beforeQueueTs and afterQueueTs below?

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/redundantLiveness branch from c128770 to fade883 Compare August 7, 2020 19:19
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @bdarnell)


pkg/kv/kvserver/node_liveness.go, line 702 at r1 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

super nit: for consistency with timestamp variables in other places, how do you feel about beforeQueueTs and afterQueueTs below?

Done.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 7, 2020
Mitigates the runtime regression in cockroachdb#50865.
Implements the proposal from cockroachdb#50865 (comment).
Based on cockroachdb#51816.

In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas
that were behind on the Raft log. This was a big win, as it prevented a
node failure from stopping any range with a replica on that node from
quiescing, even if the range was dormant. However, in order to ensure
that the replica was caught up quickly once its node being brought back
on-line, we began aggressively unquiescing when nodes moved from not-live
to live.

This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865
where nodes contain large numbers of replicas (i.e. O(100k)), suddenly
unquiescing all replicas on a node due to a transient liveness blip can
bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another
case where we add work to the system when it gets sufficiently loaded,
causing stability to be divergent instead of convergent and resulting in
an unstable system.

This fix removes the need to unquiesce ranges that have up-to-date
replicas on nodes that undergo liveness changes. It does so by socializing
the liveness state of lagging replicas during Range quiescence. In
dong so through a new `lagging_quiescence` set attached to quiescence
requests, it allows all quiesced replicas to agree on which node liveness
events would need to result in the range being unquiesced and which ones
can be ignored.

This allows us to continue to provide the guarantee that:

> If at least one up-to-date replica in a Raft group is alive, the
> Raft group will catch up any lagging replicas that are also alive.

For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment).

A secondary effect of this change is that it fixes a bug in quiescence
as it exists today. Before this change, we did not provide this
guarantee in the case that a leader was behind on its node liveness
information, broadcasted a quiescence notification, and then crashed. In
that case, a node that was brought back online might not be caught up
for minutes because the range wouldn't unquiesce until the lagging
replica reached out to it. Interestingly, this may be a semi-common
occurrence during rolling restart scenarios.

This is a large improvement on top of the existing behavior because it
reduces the set of ranges with a replica on a given dead node that
unquiesce when that node becomes live to only those ranges that had
write activity after the node went down. This is typically only a small
fraction of the total number of ranges on each node (operating on the
usual assumption that in an average large cluster, most data is
write-cold), especially when the outage is short. However, if all ranges
had write activity and subsequently quiesced before the outage resolved,
we would still have to unquiesce all ranges upon the node coming back
up. For this reason, the change is primarily targeted towards reducing
the impact of node liveness blips and not reducing the impact of
bringing nodes back up after sustained node outages. This is
intentional, as the former category of outage is the one more likely to
be caused by overload due to unquiescence traffic itself (as we see
in cockroachdb#50865), so it is the category we choose to focus on here.

This commit deliberately does not worry about the migration of this
change for clarity. The next commit adds in the migration.

Release note (performance improvement): transient node liveness blips
no longer cause up-to-date Ranges to unquiesce, which makes these events
less destabilizing.
@irfansharif
Copy link
Contributor

CI failed here (looks like lint failures), which will cause bors to silently crap out (which sucks, and we should probably change bors code to not do that).

When a node's liveness expires, either because its liveness record's
epoch is incremented or it is just slow to heartbeat its record, all of
its epoch-based leases immediately become invalid. As a result, we often
see a thundering herd of requests attempt to synchronously heartbeat the
node's liveness record, on the order of the number of ranges that lost
their lease.

We already limit the concurrency of these heartbeats to 1, so there is
not a huge concern that this will lead to overwhelming the liveness
range, but it does cause other issues. For one, it means that we end up
heartbeating the liveness record very rapidly, which causes large growth
in MVCC history. It also means that heartbeats at the end of the queue
have to wait for all other heartbeats in front of it to complete. Even
if these heartbeats only take 5ms each, if there are 100 of them
waiting, then the last one in line will wait for 500ms and its range
will be unavailable during this time. This also has the potential to
starve the liveness heartbeat loop, which isn't a problem in and of
itself as long as other synchronous heartbeats are succeeding, but
leads to concerning log warnings. Finally, this was an instance where
we were adding additional load to a cluster once it was close to being
overloaded. That's generally a bad property for a system that wants
to stay stable, and this change helps avoid it.

The solution here is to detect redundant heartbeats and make them no-ops
where possible. This has a similar effect to if we were to explicitly
coalesce heartbeats, but it's easier to reason about and requires us to
maintain less state.

The commit is conservative about this, providing a fairly strong
guarantee that a heartbeat attempt, if successful, will ensure that the
liveness record's expiration will be at least the liveness threshold
above the time that the method was called. We may be able to relax this
and say that the heartbeat attempt will just ensure that the expiration
is now above that of the oldLiveness provided, but this weakened
guarantee seems harder to reason about as a consumer of this interface.

Release note (performance improvement): ranges recover moderately faster
when their leaseholder is briefly down before becoming live again.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/redundantLiveness branch from fade883 to 1dc18df Compare August 10, 2020 03:33
@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 11, 2020

Build succeeded:

@craig craig bot merged commit a458630 into cockroachdb:master Aug 11, 2020
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/redundantLiveness branch August 14, 2020 22:05
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 18, 2020
Mitigates the runtime regression in cockroachdb#50865.
Implements the proposal from cockroachdb#50865 (comment).
Based on cockroachdb#51816.

In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas
that were behind on the Raft log. This was a big win, as it prevented a
node failure from stopping any range with a replica on that node from
quiescing, even if the range was dormant. However, in order to ensure
that the replica was caught up quickly once its node being brought back
on-line, we began aggressively unquiescing when nodes moved from not-live
to live.

This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865
where nodes contain large numbers of replicas (i.e. O(100k)), suddenly
unquiescing all replicas on a node due to a transient liveness blip can
bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another
case where we add work to the system when it gets sufficiently loaded,
causing stability to be divergent instead of convergent and resulting in
an unstable system.

This fix removes the need to unquiesce ranges that have up-to-date
replicas on nodes that undergo liveness changes. It does so by socializing
the liveness state of lagging replicas during Range quiescence. In
dong so through a new `lagging_followers_on_quiesce` set attached to
quiescence requests, it allows all quiesced replicas to agree on which
node liveness events would need to result in the range being unquiesced
and which ones can be ignored.

This allows us to continue to provide the guarantee that:

> If at least one up-to-date replica in a Raft group is alive, the
> Raft group will catch up any lagging replicas that are also alive.

For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment).

A secondary effect of this change is that it fixes a bug in quiescence
as it exists today. Before this change, we did not provide this
guarantee in the case that a leader was behind on its node liveness
information, broadcasted a quiescence notification, and then crashed. In
that case, a node that was brought back online might not be caught up
for minutes because the range wouldn't unquiesce until the lagging
replica reached out to it. Interestingly, this may be a semi-common
occurrence during rolling restart scenarios.

This is a large improvement on top of the existing behavior because it
reduces the set of ranges with a replica on a given dead node that
unquiesce when that node becomes live to only those ranges that had
write activity after the node went down. This is typically only a small
fraction of the total number of ranges on each node (operating on the
usual assumption that in an average large cluster, most data is
write-cold), especially when the outage is short. However, if all ranges
had write activity and subsequently quiesced before the outage resolved,
we would still have to unquiesce all ranges upon the node coming back
up. For this reason, the change is primarily targeted towards reducing
the impact of node liveness blips and not reducing the impact of
bringing nodes back up after sustained node outages. This is
intentional, as the former category of outage is the one more likely to
be caused by overload due to unquiescence traffic itself (as we see
in cockroachdb#50865), so it is the category we choose to focus on here.

This commit deliberately does not worry about the migration of this
change for clarity. The next commit adds in the migration.

Release note (performance improvement): transient node liveness blips
no longer cause up-to-date Ranges to unquiesce, which makes these events
less destabilizing.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 20, 2020
Mitigates the runtime regression in cockroachdb#50865.
Implements the proposal from cockroachdb#50865 (comment).
Based on cockroachdb#51816.

In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas
that were behind on the Raft log. This was a big win, as it prevented a
node failure from stopping any range with a replica on that node from
quiescing, even if the range was dormant. However, in order to ensure
that the replica was caught up quickly once its node being brought back
on-line, we began aggressively unquiescing when nodes moved from not-live
to live.

This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865
where nodes contain large numbers of replicas (i.e. O(100k)), suddenly
unquiescing all replicas on a node due to a transient liveness blip can
bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another
case where we add work to the system when it gets sufficiently loaded,
causing stability to be divergent instead of convergent and resulting in
an unstable system.

This fix removes the need to unquiesce ranges that have up-to-date
replicas on nodes that undergo liveness changes. It does so by socializing
the liveness state of lagging replicas during Range quiescence. In
dong so through a new `lagging_followers_on_quiesce` set attached to
quiescence requests, it allows all quiesced replicas to agree on which
node liveness events would need to result in the range being unquiesced
and which ones can be ignored.

This allows us to continue to provide the guarantee that:

> If at least one up-to-date replica in a Raft group is alive, the
> Raft group will catch up any lagging replicas that are also alive.

For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment).

A secondary effect of this change is that it fixes a bug in quiescence
as it exists today. Before this change, we did not provide this
guarantee in the case that a leader was behind on its node liveness
information, broadcasted a quiescence notification, and then crashed. In
that case, a node that was brought back online might not be caught up
for minutes because the range wouldn't unquiesce until the lagging
replica reached out to it. Interestingly, this may be a semi-common
occurrence during rolling restart scenarios.

This is a large improvement on top of the existing behavior because it
reduces the set of ranges with a replica on a given dead node that
unquiesce when that node becomes live to only those ranges that had
write activity after the node went down. This is typically only a small
fraction of the total number of ranges on each node (operating on the
usual assumption that in an average large cluster, most data is
write-cold), especially when the outage is short. However, if all ranges
had write activity and subsequently quiesced before the outage resolved,
we would still have to unquiesce all ranges upon the node coming back
up. For this reason, the change is primarily targeted towards reducing
the impact of node liveness blips and not reducing the impact of
bringing nodes back up after sustained node outages. This is
intentional, as the former category of outage is the one more likely to
be caused by overload due to unquiescence traffic itself (as we see
in cockroachdb#50865), so it is the category we choose to focus on here.

This commit deliberately does not worry about the migration of this
change for clarity. The next commit adds in the migration.

Release note (performance improvement): transient node liveness blips
no longer cause up-to-date Ranges to unquiesce, which makes these events
less destabilizing.
craig bot pushed a commit that referenced this pull request Aug 20, 2020
51894: kvserver: don't unquiesce on liveness changes of up-to-date replicas r=nvanbenschoten a=nvanbenschoten

Mitigates the runtime regression in #50865.
Implements the proposal from #50865 (comment).

In #26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live.

This turns out to be a destabilizing behavior. In scenarios like #50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like #51888 and #45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system.

This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_quiescence` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored.

This allows us to continue to provide the guarantee that:

> If at least one up-to-date replica in a Raft group is alive, the
> Raft group will catch up any lagging replicas that are also alive.

For a pseudo-proof of this guarantee, see #50865 (comment).

A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios.

This is a large improvement on top of the existing behavior because it reduces the set of ranges with a replica on a given dead node that unquiesce when that node becomes live to only those ranges that had write activity after the node went down. This is typically only a small fraction of the total number of ranges on each node (operating on the usual assumption that in an average large cluster, most data is write-cold), especially when the outage is short. However, if all ranges had write activity and subsequently quiesced before the outage resolved, we would still have to unquiesce all ranges upon the node coming back up. For this reason, the change is primarily targeted towards reducing the impact of node liveness blips and not reducing the impact of bringing nodes back up after sustained node outages. This is intentional, as the former category of outage is the one more likely to be caused by overload due to unquiescence traffic itself (as we see in #50865), so it is the category we choose to focus on here.

Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing.

@petermattis I'm adding you as a reviewer because you've done more thinking about Range quiescence than anyone else. But I can find a new second reviewer if you don't have time to take a look at this.

Co-authored-by: Nathan VanBenschoten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants