Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: new RFC on range leaseholder cache invalidation #52593

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

knz
Copy link
Contributor

@knz knz commented Aug 10, 2020

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Aug 10, 2020

This change is Reviewable

@knz
Copy link
Contributor Author

knz commented Aug 10, 2020

@nvanbenschoten @andreimatei
For context, as soon as I got PR #52572 out, Nathan came back to me with "wait maybe this gossip cost is too high".

I want to have all the solutions side-by-side and ensure that everyone is clear on the pros/cons of each. I am ok with changing course but I don't want to do it multiple times.

Please review this carefully, and indicate whether you think we have enough evidence to abandon solution 3 (the current approach) and explore solutions 4/5 instead.

@knz knz force-pushed the 20200810-rfc-cache branch 3 times, most recently from 6117c43 to 161288a Compare August 10, 2020 19:38
Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. As a general principle, we need to limit how bad stale cache entries can be. We should time out and move on to the next node in no more than a few seconds.
  2. We should make use of the gossiped liveness information to either proactively invalidate the lease cache or bypass entries pointing to down nodes.
  3. It's not out of the question to gossip lease acquisitions with a short TTL as a best-effort supplement to pre-populate caches. It sounds potentially expensive so we'd need to do a lot of testing, but I think if this were available as an option I'd turn it on.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @nvanbenschoten)


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 82 at r1 (raw file):

Worse even, if there is a network partition, the attempt to
connect to the now-unreachable node may encounter a much higher delay
(TCP timeout, 30s by default).

If something's blocking this long, it's a bug. We're supposed to be using grpc heartbeats to fail any rpcs that can't be sent in IIRC 3 seconds. (3s may still be too long, but we can tweak this timeout relatively easily instead of dealing with the black-box tcp timeout)

We used to have a parallel retry mechanism in DistSender to further avoid problems around network partitions, but it was removed in #16088 because it also caused ambiguity errors.

In the event of an unclean node shutdown, we want the largest factor in recovery time to be the time it takes for the old liveness to expire. So anything that could wait longer than the liveness expiration really needs to be fixed.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 174 at r1 (raw file):

end-users of CockroachDB.

For the CockroachDB eng teams, the "bird eye's view" of the selected

This is not the same as any of the 5 solutions detailed below, right? It should probably be written up in the same format with pros/cons.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 218 at r1 (raw file):

  range entry for that node from their lease cache.
- upon the next first request, the nodes would go through
  the discovery process to gain knowledge of the new leaseholder.

This discovery process could be skipped if the InvalidateNodeRequest included the new leaseholders (as in the "drain record" proposed for option 2). OTOH, this would mean pushing a lot of data out of the draining node in a less network-efficient way than gossip (but more memory-efficient).


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 220 at r1 (raw file):

  the discovery process to gain knowledge of the new leaseholder.

**Pros:**

The biggest pro of the RPC solution is that it doesn't rely on gossip, so assuming the cluster is otherwise healthy it will reliable invalidate/update all the caches as a synchronous part of the draining process.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 274 at r1 (raw file):

  We can partially alleviate this cost by populating the lease expiry
  time as the value of [gossip TTL](#gossip). This reduces
  the space complexity from O(#ranges) to O(#active ranges).

If I'm understanding you correctly, this isn't really about "active ranges" but about "ranges leased by a node that recently restarted".

Choosing a timeout here is tricky (there's no particular reason for it to be related to the lease expiry time). Too long and you increase the gossip memory requirements. Too short and it may expire before getting propagated to all nodes and you're back to dealing with stale caches. (and because it's all on one gossip key, expiration is all or nothing. It may even be possible that large gossip values are more likely to experience delays propagating through the system).


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 284 at r1 (raw file):

  process by a "sleep period" to let gossip propagate, calibrated to
  the standard cross-cluster gossip delivery latency (for reference,
  that's around 10-20s for up to 100 geo-distributed nodes in normal

It's good to have a number for this. We should have a tech note or wiki page about the experimental setup for determining it. I suspect there's low-hanging fruit to improve this, but first we need a good metric to target. (since this is slower than the lease expiration, I think it's worth putting some effort into).


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 306 at r1 (raw file):

This approach was prototyped here: https://github.com/cockroachdb/cockroach/pull/52572

**Pros:**

Gossiping lease acquisition has the nice advantage that it pre-populates the caches for rebalancing even when no nodes are restarting.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 319 at r1 (raw file):

**Cons:**

- There is one "lease record" in gossip per range.

There is some room for cleverness here - since we're gossiping per range instead of per node, we could choose not to gossip every lease acquisition (and similarly, we could use a rather short TTL and not worry if some records get dropped before propagating). Maybe have a rate limit and prioritize those ranges with higher traffic, for example.

With a short enough TTL, we can think of the gossip costs here primarily in terms of the communications cost instead of the memory size.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 390 at r1 (raw file):

  where the range caches are stored.

## 5. Liveness-based cache invalidation with bypass

I think overall this is my first choice. (in fact, I thought we were already doing this).


Today the mechanism for graceful shutdown moves all the leases from
the node being shut down to other nodes. Both the stopped node and
the other nodes with the transferred leases have up-to-date
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently lease transfers are not communicated even to the cache on the replicas for the respective range. So for the purposes of kvclient, nobody has "up-to-date knowledge".


Worse even, if there is a network partition, the attempt to
connect to the now-unreachable node may encounter a much higher delay
(TCP timeout, 30s by default).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe it's not quite as bad as the TPC timeout since we have our own heartbeats lower timeouts on the gRPC connections (in case of established connections). I'm not sure what timeouts apply to connection establishment - there's a lot of indirection behind the circuit breaker.

shutting down calls a new RPC `InvalidateNode()` on every other node
in the cluster to announce itself as now-unavailable.
- upon receiving `InvalidateNode()`, each other node removes any
range entry for that node from their lease cache.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there's no longer a "lease cache". There's a range cache which can also have leases.


## General CockroachDB design principles

“Don't pay for what you don't use”
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on who the implied subject of this sentence is, I wish we'd take this principle to heart in much more consequential areas - like closed timestamps or mvcc versions for backup with revision_history :)

The way I see it, it's fine for every node to pay some cost for the common good of the cluster; nodes don't inspect every gossipped key to see if it does anything for them and throw it away if it doesn't. One node forwards something for another node today, and that other node will return the favor tomorrow.
The altruism should have a limit, so indeed I think it'd be bad if info on too many ranges ends up stored in gossip at once.

- n3 is not part of the replication group, and mistakenly tries
to reach out to n1 which it believes is still leaseholder.

We want to avoid this mistaken routing and associated delays.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, the motivations for being proactive with updating these caches are:

  1. reduce tail latencies by reducing mis-routing
  2. help with the pathological behaviors in cases when a node is unresponsive and people contact it even though it lost its leases (e.g. storage: restarted node in need of snapshots can be wedged for long time #37906). So, whereas I think Nathan sees more gossipping to be a risk to stability, I see it as aiming to help stability. It sucks for a node to lose its leases in whatever ways (e.g. say it shed them away through load balancing, or its so overloaded that it's failing to heartbeat its node liveness) but then that has no effect because nobody can find out about where the new leases are. For each individual situation where something like this happens there'll be something else to fix, but still I see people being up to date with lease locations as a stabilizing factor.

- upon a cache lookup, the range cache consults liveness.

If the entry in the cache point to a node for which liveness
currently says the node is not live, the cache entry is ignored (is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do some lip service around here to the fact that we can determine separately whether a lease is stale (its epoch is lower than the corresponding liveness record's liveness) and whether a node's expiration record is expired. I'm don't think the first info is very useful - everything else being equal, the stale leaseholder node is still the most likely node to be the leaseholder I think (e.g. if it had a liveness blip but everybody behaves like today, it'll get all its leases back).
A case that I think deserve thought is a valid lease on a dead node (so, an expiration-based lease on a node with an expired liveness record).

Btw there's another source of liveness info that we sometimes use - the connection health exposed by the RPCContext; I don't know if you want to bring it into this discussion though.

- No new data stored in the range cache to mark nodes
as poisoned.

**Cons:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the biggest con from 4 remains for me - it doesn't help regular transfers


## Scenarios

This RFC looks at two scenarios:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me an important thing is also looking at how quickly caches react to regular cooperative transfers, not just transfers done while draining or new leases issued after a node dies. It'd be nice to eliminate as much bad routing as we can after such a transfer, to help with tail latencies. Like, ideally, no request would ever be routed according to a stale lease. In the particular case of draining, while a node is draining, a lot of requests could be mis-routed according to most of the proposals until the node finally shuts down.

as poisoned.

**Cons:**

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this option generally do anything for "scenario A"? If a node drains and restarts quickly, it's liveness record is never expired.
That's not necessarily a bad thing, but I think our goal should be to react to transfers (particularly to large numbers of them).

lease. In contrast, solutions 2 & 3 pre-populate the caches with the
location of the new lease.

Solution 5 is simpler to implement than 4 but may incur lock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally still stand in the Gossip camp :)
I think we'd benefit from caches being kept up to date under all conditions (well, maybe not under extreme conditions where there's too many updates to propagate), and I'm also interested in expanding a gossiping mechanism that I hope we introduce here to also deal with range splits/merges. Range descriptor changes are even more disruptive to traffic than lease changes, so I want to be proactive about them too.

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your mindshare. I have worked all your remarks into the text, in particular the introduction of a new "scenario C" that describes natural lease transfers that occur from rebalancing.

I also added a new solution 6 "best-effort gossip of lease acquisition", the one suggested by Ben, which maintains caches up-to-date for scenario C, but does not answer scenarios A & B.

I'd also like to highlight a natural tension in the team that stems from individual interests and priorities:

  • Andrei (and, I believe, Ben - edit from below: not Ben) are prioritizing scenario C (behavior when cluster is stable) above and before scenarios A&B
  • whereas I (edit from below: and Ben) am prioritizing scenario A due to our roadmap objectives, with scenario C as a vague second.

The synthesis so far is this:

  • Andrei was pushing for solution 3 (gossip lease acquisitions) because it comprehensively solves scenario C (his interest), and gives a good solution to scenarios A&B (my interest).

    Nathan (and Ben) don't love solution 3 because of the gossip cost.

  • Nathan and Ben like solution 4&5. They address scenarios A&B well, so they work for me.

    However 4&5 don't address scenario C at all, so Andrei doesn't like them.

  • Ben goes so far as to say we should implement 4 or 5 for scenario B no matter what we do for scenarios A&C. I don't disagree.

  • Ben acknowledges that 4&5 doesn't work for scenario C, and acknowledges Nathan's concern about solution 3, and thus proposes a new gossip solution 6 with limited TTL, which works for scenario C but not A&B.

    Ben's position is that we should implement solutions 4 or 5 for scenarios A/B (my interest), and solution 6 for scenario C (Andrei's and Ben's interest).

I'd like to point out that the roadmap mandates a solution for scenario A, and does not say anything about scenario C. I don't want to dismiss it entirely (hence my commitment to cover all 3 in the RFC) but I'd like to focus on the right priorities.

So from my perspective I don't care whether we do solutions 3, 4 or 5 first (they all address scenario A, which I care about it) but I don't like starting with solution 6 (best-effort gossip) or degrade solution 3 to become best-effort.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @knz, and @nvanbenschoten)


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 45 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

For me an important thing is also looking at how quickly caches react to regular cooperative transfers, not just transfers done while draining or new leases issued after a node dies. It'd be nice to eliminate as much bad routing as we can after such a transfer, to help with tail latencies. Like, ideally, no request would ever be routed according to a stale lease. In the particular case of draining, while a node is draining, a lot of requests could be mis-routed according to most of the proposals until the node finally shuts down.

Good point, I had missed this entirely. Added a scenario C to cover exactly this.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 53 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Currently lease transfers are not communicated even to the cache on the replicas for the respective range. So for the purposes of kvclient, nobody has "up-to-date knowledge".

I did not know this. Updated the text accordingly.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 82 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think maybe it's not quite as bad as the TPC timeout since we have our own heartbeats lower timeouts on the gRPC connections (in case of established connections). I'm not sure what timeouts apply to connection establishment - there's a lot of indirection behind the circuit breaker.

Added a note that my comment only pertains to new connections.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 82 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If something's blocking this long, it's a bug. We're supposed to be using grpc heartbeats to fail any rpcs that can't be sent in IIRC 3 seconds. (3s may still be too long, but we can tweak this timeout relatively easily instead of dealing with the black-box tcp timeout)

We used to have a parallel retry mechanism in DistSender to further avoid problems around network partitions, but it was removed in #16088 because it also caused ambiguity errors.

In the event of an unclean node shutdown, we want the largest factor in recovery time to be the time it takes for the old liveness to expire. So anything that could wait longer than the liveness expiration really needs to be fixed.

I believe (from past observations) our RPC heartbeat code only kicks in for previously-established connections.

New connections are subject to different timeouts, which I believe have not been configured in our code. The gRPC defaults apply, in particular the default timeouts for DNS resolution (30s) and TCP connect/handshake timeout (30s or 5mn, depending on OS). If I am wrong, please point me to the right place in the code.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 93 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

For me, the motivations for being proactive with updating these caches are:

  1. reduce tail latencies by reducing mis-routing
  2. help with the pathological behaviors in cases when a node is unresponsive and people contact it even though it lost its leases (e.g. storage: restarted node in need of snapshots can be wedged for long time #37906). So, whereas I think Nathan sees more gossipping to be a risk to stability, I see it as aiming to help stability. It sucks for a node to lose its leases in whatever ways (e.g. say it shed them away through load balancing, or its so overloaded that it's failing to heartbeat its node liveness) but then that has no effect because nobody can find out about where the new leases are. For each individual situation where something like this happens there'll be something else to fix, but still I see people being up to date with lease locations as a stabilizing factor.

Pulled your phrasing into the text. Thanks.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 174 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This is not the same as any of the 5 solutions detailed below, right? It should probably be written up in the same format with pros/cons.

Changed to forward to the end directly.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 216 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: there's no longer a "lease cache". There's a range cache which can also have leases.

Fixed.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 218 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This discovery process could be skipped if the InvalidateNodeRequest included the new leaseholders (as in the "drain record" proposed for option 2). OTOH, this would mean pushing a lot of data out of the draining node in a less network-efficient way than gossip (but more memory-efficient).

Thanks, added this note in the text.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 220 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The biggest pro of the RPC solution is that it doesn't rely on gossip, so assuming the cluster is otherwise healthy it will reliable invalidate/update all the caches as a synchronous part of the draining process.

Added this. thanks.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 274 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If I'm understanding you correctly, this isn't really about "active ranges" but about "ranges leased by a node that recently restarted".

Choosing a timeout here is tricky (there's no particular reason for it to be related to the lease expiry time). Too long and you increase the gossip memory requirements. Too short and it may expire before getting propagated to all nodes and you're back to dealing with stale caches. (and because it's all on one gossip key, expiration is all or nothing. It may even be possible that large gossip values are more likely to experience delays propagating through the system).

Added your comment in the text.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 306 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Gossiping lease acquisition has the nice advantage that it pre-populates the caches for rebalancing even when no nodes are restarting.

Added this as a reference to the newly described scenario C.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 319 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

There is some room for cleverness here - since we're gossiping per range instead of per node, we could choose not to gossip every lease acquisition (and similarly, we could use a rather short TTL and not worry if some records get dropped before propagating). Maybe have a rate limit and prioritize those ranges with higher traffic, for example.

With a short enough TTL, we can think of the gossip costs here primarily in terms of the communications cost instead of the memory size.

Added your remark in the text.

Disclaimer: dropping records will not play well with scenario A, when a node drains and a large number of leases are transferred at once.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 322 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Most leases don't have an expiration, so what TTL do we use for those? Even for leases that do have an expiration, why should that be connected to the TTL? The lease expiration is about how often someone the owner needs to renew it. The TTL is about how much memory you're willing to use for gossipped keys (the higher the TTL, the more memory you use, the likelier that the record lives enough for all the nodes to hear about it).
I'd use a fixed TTL.

And besides the TTL, the other big possible mitigation here is various rate limits - like a rate limit per node dictating how many acquisitions per second it can add to gossip, and maybe also one in gossip about how many of these updates it forwards every second (like, once that is reached, a node could pretend that the TTL is hit for everything else it receives). I don't know how feasible the latter is.

Added your remark in the text.

Disclaimer: dropping records will not play well with scenario A, when a node drains and a large number of leases are transferred at once.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 347 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

You don't necessarily even need to update the cache in order to take advantage of a liveness update; since we generally have full leases in the cache, a DistSender knows the epoch of each lease. If it's lower than the liveness info, then it knows that that lease is not valid (that another reason why the cache maintains full lease info and also why we update the cache even on successful RPCs when the client had stale info).

So I don't think you need code to explicitly deal with any "poisoning"; I think between the node info telling you if the cache is stale, and the liveness record expiration timestamp and also the connection health info exposed by the RPC context, we have enough to determine both if we trust a lease and if a node currently looks to be up or down (and thus should be de-prioritized for discovery).

later update: I see now option 5. Let's get rid of 4 then; it's dominated isn't it?

Nathan thinks that solution 5 will cause lock contention on the liveness struct and slow everything down.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 368 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

the biggest con here, in my opinion, is that this doesn't help avoid bad routing on regular lease transfers; it just helps with nodes restarting or losing liveness or such.

Added this remark as reference to the new scenario C.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 390 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think overall this is my first choice. (in fact, I thought we were already doing this).

I don't dislike it (it's also Nathan's first choice) but does not play nice with scenario C.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 400 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think we should do some lip service around here to the fact that we can determine separately whether a lease is stale (its epoch is lower than the corresponding liveness record's liveness) and whether a node's expiration record is expired. I'm don't think the first info is very useful - everything else being equal, the stale leaseholder node is still the most likely node to be the leaseholder I think (e.g. if it had a liveness blip but everybody behaves like today, it'll get all its leases back).
A case that I think deserve thought is a valid lease on a dead node (so, an expiration-based lease on a node with an expired liveness record).

Btw there's another source of liveness info that we sometimes use - the connection health exposed by the RPCContext; I don't know if you want to bring it into this discussion though.

I do not know what node expiration records are. Can you explain?


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 416 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

the biggest con from 4 remains for me - it doesn't help regular transfers

Added this remark.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 417 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Does this option generally do anything for "scenario A"? If a node drains and restarts quickly, it's liveness record is never expired.
That's not necessarily a bad thing, but I think our goal should be to react to transfers (particularly to large numbers of them).

That's incorrect. The "drain" boolean is set in the liveness record, which triggers a gossip update. The other nodes can take the drain bool into account to mark the node as unusable.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 459 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I personally still stand in the Gossip camp :)
I think we'd benefit from caches being kept up to date under all conditions (well, maybe not under extreme conditions where there's too many updates to propagate), and I'm also interested in expanding a gossiping mechanism that I hope we introduce here to also deal with range splits/merges. Range descriptor changes are even more disruptive to traffic than lease changes, so I want to be proactive about them too.

Added your remark in the text.

@knz knz force-pushed the 20200810-rfc-cache branch from 161288a to fa7341d Compare August 11, 2020 09:49
@knz knz marked this pull request as ready for review August 11, 2020 09:51
@knz knz requested a review from a team as a code owner August 11, 2020 09:51
@knz knz force-pushed the 20200810-rfc-cache branch 4 times, most recently from 30dee8d to 12ba2a6 Compare August 11, 2020 10:09
@knz
Copy link
Contributor Author

knz commented Aug 11, 2020

Something to consider about solution 3 (gossip all lease acquisitions)

  • as per Andrei, a lease protobuf is ~100 bytes
  • so if we have 100k ranges total, we'd be looking at a 10MB gossip payload total
  • 10MB takes 0.1s to propagate over a 100MB link. Even with 100 nodes (<5 nodes gossip network depth), I don't believe it would significantly impact the whole-cluster propagation latency

So it's possible that solution 3 is not so expensive after all? @nvanbenschoten what do you think?

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a quick read and left some comments. I'm a bit time crunched today, so I likely missed context or additional details in the RFC.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @knz, and @nvanbenschoten)


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 218 at r1 (raw file):

Previously, knz (kena) wrote…

Thanks, added this note in the text.

I think it is worthwhile to quantify how much data this would be. Assuming 100k ranges per node and 1/3 a node is leaseholder for 1/3 of those replicas, we'd need to send out leaseholder info for 30k replicas. What is the minimal info we can send about a new leaseholder for a range? We need the range ID (8 bytes) and the new leaseholder replica ID (4 bytes)? Maybe something about the epoch of the new leaseholder node. So ballpark 512KB - 1MB. That is large, but not horrifically so. The bigger problem I see is that we'd have to send this to every node which doesn't scale well to large cluster sizes.

Update: I bet a clever encoding could reduce the space overhead further. For example, rather than a list of invalidations, we could maintain a map from nodeID to a list of new leaseholder info. The new leaseholder info itself could be sorted by range ID and the range IDs delta encoded. I didn't put much thought into this, but it seems like we could get a fairly compact representation with some elbow grease.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 28 at r2 (raw file):

happens to still be running, this works; however, if n1 is down (either
planned or unplanned), the request fails to reach n1, then
n3 starts trying every other node and discover the lease on n2.

To be more precise, n3 starts trying every other replica. If you have 10 nodes and 3 replicas, it is only going to try 3 nodes, not 10.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 30 at r2 (raw file):

n3 starts trying every other node and discover the lease on n2.

This stall and subsequent discovery process have been observed to

Nit: s/have been/has been/g


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 90 at r2 (raw file):

**Scenario C.** Leases get transferred from one node to another
as part of the background rebalancing that is continuously happening
throughout the culture, or in response to zone config changes.

s/culture/cluster/g(assuming you're not talking about https://en.wikipedia.org/wiki/Culture_series).


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 278 at r2 (raw file):

- By using gossip, the data will eventually reach every other
  node even those that were temporarily unavailable.
  (i.e. eliminates one cons of solution 1 above)

Nit: s/one cons/one con/g


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 355 at r2 (raw file):

  We can partially alleviate this cost by populating the lease expiry
  time as the value of [gossip TTL](#gossip). This reduces
  the space complexity from O(#ranges) to O(#active ranges).

Another thought is that we could batch the gossiped leaseholder updates from a node. For example, when a node acquires a lease it gossips that it is the leaseholder and starts a timer for X seconds. Any new lease acquisitions while the timer is running get bundled together and are only sent when the timer expires.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 362 at r2 (raw file):

  > rate limits - like a rate limit per node dictating how many
  > acquisitions per second it can add to gossip, and maybe also one
  > in gossip about how many of these updates it forwards every second

I'd be very wary of adding additional smarts to gossip itself. The first rule of using gossip: do not modify gossip.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 388 at r2 (raw file):

  Like above, this last cons can be alleviated by extending step 5 of
  the draining process by a "sleep period".

Another con to gossiping lease transfers is that it changes the cache behavior. Right now, the range cache is populated as a node accesses ranges. If we populate it based on lease transfers and the size of the cache causes evictions to occur, then we may end up evicting leaseholder info that the node is using in favor of leaseholder info for ranges containing idle data.

Cc @andreimatei you seem keen on this approach and you're most up to date on the range cache behavior.

@knz knz force-pushed the 20200810-rfc-cache branch from 12ba2a6 to e4c97b8 Compare August 11, 2020 13:05
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @knz, @nvanbenschoten, and @petermattis)


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 218 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think it is worthwhile to quantify how much data this would be. Assuming 100k ranges per node and 1/3 a node is leaseholder for 1/3 of those replicas, we'd need to send out leaseholder info for 30k replicas. What is the minimal info we can send about a new leaseholder for a range? We need the range ID (8 bytes) and the new leaseholder replica ID (4 bytes)? Maybe something about the epoch of the new leaseholder node. So ballpark 512KB - 1MB. That is large, but not horrifically so. The bigger problem I see is that we'd have to send this to every node which doesn't scale well to large cluster sizes.

Update: I bet a clever encoding could reduce the space overhead further. For example, rather than a list of invalidations, we could maintain a map from nodeID to a list of new leaseholder info. The new leaseholder info itself could be sorted by range ID and the range IDs delta encoded. I didn't put much thought into this, but it seems like we could get a fairly compact representation with some elbow grease.

Thanks Peter for this remark. I added it to the text. I also think it applies equally to the gossip-based solutions described below, so I'll add a note about that too.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 28 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

To be more precise, n3 starts trying every other replica. If you have 10 nodes and 3 replicas, it is only going to try 3 nodes, not 10.

Corrected.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 30 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: s/have been/has been/g

Done.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 90 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/culture/cluster/g(assuming you're not talking about https://en.wikipedia.org/wiki/Culture_series).

Done.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 355 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Another thought is that we could batch the gossiped leaseholder updates from a node. For example, when a node acquires a lease it gossips that it is the leaseholder and starts a timer for X seconds. Any new lease acquisitions while the timer is running get bundled together and are only sent when the timer expires.

Thanks added this to the text.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 362 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'd be very wary of adding additional smarts to gossip itself. The first rule of using gossip: do not modify gossip.

Added your reaction.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 388 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Another con to gossiping lease transfers is that it changes the cache behavior. Right now, the range cache is populated as a node accesses ranges. If we populate it based on lease transfers and the size of the cache causes evictions to occur, then we may end up evicting leaseholder info that the node is using in favor of leaseholder info for ranges containing idle data.

Cc @andreimatei you seem keen on this approach and you're most up to date on the range cache behavior.

We discussed this before: the agreement is that we would only update existing cache entries, and not add new entries. So the proposal is size-neutral for the cache.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @knz, @nvanbenschoten, and @petermattis)


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 362 at r2 (raw file):

Previously, knz (kena) wrote…

Added your reaction.

It's only tangential to the issues here, but one thing I've been curious about is understanding how gossip works - when a node communicates with another one, does it pass all of the data in one big RPC or is there chunking/streaming of some sort?


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 379 at r2 (raw file):

  > memory size.

  NB: the above remarks do not take into account scenarios A and B,

In scenario A (graceful shutdown), there's natural throttling to the lease transfers I think (in particular if you don't drain too many nodes at once), so I thought it should work fine.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @knz, @nvanbenschoten, and @petermattis)


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 417 at r1 (raw file):

Previously, knz (kena) wrote…

That's incorrect. The "drain" boolean is set in the liveness record, which triggers a gossip update. The other nodes can take the drain bool into account to mark the node as unusable.

but is it a good idea for the DistSender to use that flag? How would it work exactly? If n2 sees that n1 is draining, and it knows about a lease on n1, what exactly does it do? Does it ignore n1 once for each request, even if that lease is staying on n1 for a while? That seems like the most disrupting thing.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @knz, and @nvanbenschoten)


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 362 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

It's only tangential to the issues here, but one thing I've been curious about is understanding how gossip works - when a node communicates with another one, does it pass all of the data in one big RPC or is there chunking/streaming of some sort?

Gossip sends a delta of its "info store" based on what the local gossip instance knows the remote node contains. If one node gossips a new record, only that new record will be sent to remote nodes. Because the gossip network is a graph, the same record may arrive at a node from multiple paths.

@bdarnell
Copy link
Contributor

I'd also like to highlight a natural tension in the team that stems from individual interests and priorities:

Andrei (and, I believe, Ben) are prioritizing scenario C (behavior when cluster is stable) above and before scenarios A&B
whereas I am prioritizing scenario A due to our roadmap objectives, with scenario C as a vague second.

No, I also think that A is the highest priority to fix. I'd like to fix scenario B at the same time as A, but that comes mainly from adherence to crash-only principles than its practical importance. I would eliminate solution 2 because it does not address scenario B at all, and does not seem substantially better than solution 1 for scenario A. I don't like the fact that solution 1 only addresses scenario A, but I could be persuaded to choose it if gossip turns out to be too slow or unreliable for one of the gossip-based solutions to work.

Scenario C is also important, but any solution that addresses scenario C carries more risk since it puts a lot of stress on the gossip system (or requires a new non-gossip system for propagating this information). I think it's likely that we'd need to put some sort of opt-in/opt-out switch in place for this system, and I wouldn't that to also give up our solution for scenario A, hence the recommendation for a combination something like 4/5+3. (this also lets us split the work across multiple releases if need be)

@knz knz force-pushed the 20200810-rfc-cache branch from e4c97b8 to 4ec1e3f Compare August 11, 2020 22:07
@knz knz force-pushed the 20200810-rfc-cache branch from 4ec1e3f to 7e4ec18 Compare August 11, 2020 22:09
@knz
Copy link
Contributor Author

knz commented Aug 11, 2020

No, I also think that A is the highest priority to fix. I'd like to fix scenario B at the same time as A, but that comes mainly from adherence to crash-only principles than its practical importance.

Thanks for the correction. I updated my comment above to reflect this.

I would eliminate solution 2 [...] this also lets us split the work across multiple releases if need be.

Thanks for this additional remark, I added it into the RFC.

@knz
Copy link
Contributor Author

knz commented Aug 11, 2020

So @andreimatei and @nvanbenschoten and I had a discussion today about this.

  • Nathan and Andrei have challenged the "problem to solve": that attempting to connect to a node that's currently off (e.g. due to a restart) is more latency-expensive than getting a NotLeaseHolder error from a different (live) node.

    To respond to this challenge, I propose to run some experiment (ie reproduce our scenario tests from last year).

  • Assuming the problem does indeed exist, then Andrei and Nathan challenging the benefit of preemptively marking the down node as "undesirable" in lease lookups, i.e. it's unclear that solutions 3-6 actually help with scenario A.

    From this challenge, a strawman "solution" appears, which I introduce in the RFC as "solution 7": an extra wait at the end of the drain process, to give other nodes a chance to get a NLE from the node being drained. Maybe that's all that's needed to refresh the caches on other nodes. This would address scenario A at a lower cost than solutions 4/5, and more clearly than solution 3.

  • The next step here is to try this out. If "solution 7" (just wait a bit) is good enough, then we'll do that for v20.2 and then delay solutions 3-6 to the next release.

    If it doesn't work out, then Ben's strategy to combine solution 6 with 4/5 is the next best step.

  • Regardless of the outcome in the previous step, we may still want to muck with the distsender lease discovery algorithm in this release though. Andrei feels that it doesn't hurt to make the discovery more aware of things. So exploring solution 5 "for research purposes" may still be desirable.

    • We floated a notion that we may also want to take the "decommissioned" bit into account to extend the scope of solution 5 to also improve the "repaving" use case, but Andrei pointed out that is best served in the RPC connect method, so probably out of scope.

@nvanbenschoten
Copy link
Member

Nathan and Andrei have challenged the "problem to solve": that attempting to connect to a node that's currently off (e.g. due to a restart) is more latency-expensive than getting a NotLeaseHolder error from a different (live) node.

I don't think I was questioning this. It sounded like we've seen this in practice before.

Assuming the problem does indeed exist, then Andrei and Nathan challenging the benefit of preemptively marking the down node as "undesirable" in lease lookups, i.e. it's unclear that solutions 3-6 actually help with scenario A.

From this challenge, a strawman "solution" appears, which I introduce in the RFC as "solution 7": an extra wait at the end of the drain process, to give other nodes a chance to get a NLE from the node being drained. Maybe that's all that's needed to refresh the caches on other nodes. This would address scenario A at a lower cost than solutions 4/5, and more clearly than solution 3.

Don't we know what this is going to show us? Assuming a fast NLE redirect before a node is offline and a slower connection failed redirect after a node is offline, we'd expect a minimal (and unavoidable) disruption of traffic to any ranges touched during the draining or "waiting" process and then a larger disruption of traffic to any ranges touched only after the node goes offline. So without anything else, I'm not sure what we're testing here.

I thought the idea was that we would have a node mark itself as dead in node liveness before this waiting period. We would then make the change in concert with solution 4 or 5 so that the waiting period only needs to be as long as we expect the liveness update to take to propagate throughout the system. Once it has, we don't expect any other requests to the node so it can feel free to terminate fully.

@knz
Copy link
Contributor Author

knz commented Aug 12, 2020

I thought the idea was that we would have a node mark itself as dead in node liveness before this waiting period. We would then make the change in concert with solution 4 or 5 so that the waiting period only needs to be as long as we expect the liveness update to take to propagate throughout the system. Once it has, we don't expect any other requests to the node so it can feel free to terminate fully.

I had missed that. Thanks for the reminder.

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @knz, and @nvanbenschoten)


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 281 at r1 (raw file):

If the node is not alive any more, generally the request should fail fast

We know this is not true in some cases, it's dependent on a fast RST by the IP stack. See issue #53410

will cause the sender to try another one of the two nodes. With 1/2 probability, it will select the wrong one

Only with 3 replicas. With 5 replicas, it will have 75% prob to select the wrong one; with 7 replicas, it's 86% chance, etc.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 417 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

but is it a good idea for the DistSender to use that flag? How would it work exactly? If n2 sees that n1 is draining, and it knows about a lease on n1, what exactly does it do? Does it ignore n1 once for each request, even if that lease is staying on n1 for a while? That seems like the most disrupting thing.

That is a good point.

The proposal calls for trying the next live replica. Presumably that will either have the lease, or know about a better location for it.

If the lease hasn't moved yet, then the "next replica" will point back to the one being drained. However: we are talking about the LH cache here. The KV request will still make it through. I suppose what this text here means is that we don't want to keep the draining node in the cache. I think it's OK to have the KV request take an extra hop in the network.


docs/RFCS/20200810_leaseholder_cache_invalidation.md, line 379 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

In scenario A (graceful shutdown), there's natural throttling to the lease transfers I think (in particular if you don't drain too many nodes at once), so I thought it should work fine.

Not really no, the "throttling" is just that the initiation of transfers is batched (100 at a time), but they are not waited on.
So the amount of changes delivered to gossip can spike without bound.

@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021
@nvanbenschoten nvanbenschoten removed their request for review September 27, 2022 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-noremind Bots won't notify about PRs with X-noremind
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants