kvserver: Always treat restarted nodes as suspect #97263

andrewbaptist · 2023-02-16T21:25:36Z

Is your feature request related to a problem? Please describe.

As seen in #95159. A store that is offline for a greater than 5 minutes (server.time_until_store_dead) is immediately seen as Available after its first gossip liveness update. Unfortunately, the store may not be fully healthy at this point as it is likely behind on Raft updates and should not be a target for "unnecessary" lease transfers.

We already have a status Suspect that is used for stores that are down for less than 5 minutes and keeps lease and replica transfers away for 30s (server.time_after_store_suspect). So for short store outages, the store becomes suspect, however, once a store is dead, it transitions immediately to Available

Describe the solution you'd like

Any store that is offline and rejoins should be treated as Suspect for the 30s window until it has had a chance to recover. Lease and replica transfers are not prohibited to suspect nodes, but they are only done in emergency cases. This will decrease the impact of a store being offline for an extended period.

Describe alternatives you've considered
An alternative that was explored in #96980 was to start nodes in a different state and post a different liveness update until they are healthy enough. This is unnecessarily complex however as we already send both the IO overload status through gossip and could reasonably figure out whether Raft is healthy enough on the range using #96304

Additional context
Performance after restart is a complex issue requiring a number of moving parts to fully address. This change alone will be a strict improvement, however without some of the other issues mentioned it won't fully address all impacts of restarted nodes.

Jira issue: CRDB-24600

The text was updated successfully, but these errors were encountered:

kvoli · 2023-03-07T14:01:37Z

completed by #97532

andrewbaptist added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Feb 16, 2023

andrewbaptist assigned andrewbaptist and kvoli and unassigned andrewbaptist Feb 16, 2023

andrewbaptist changed the title ~~Always treat restarted nodes as suspect~~ kvserver: Always treat restarted nodes as suspect Feb 16, 2023

andrewbaptist added the A-kv-distribution Relating to rebalancing and leasing. label Feb 16, 2023

kvoli closed this as completed Mar 7, 2023

andrewbaptist mentioned this issue Mar 14, 2024

kvclient: follower reads can be sent to slow node resulting in high latency #120519

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: Always treat restarted nodes as suspect #97263

kvserver: Always treat restarted nodes as suspect #97263

andrewbaptist commented Feb 16, 2023 •

edited by cockroach-jira-scripts

Loading

kvoli commented Mar 7, 2023

kvserver: Always treat restarted nodes as suspect #97263

kvserver: Always treat restarted nodes as suspect #97263

Comments

andrewbaptist commented Feb 16, 2023 • edited by cockroach-jira-scripts Loading

kvoli commented Mar 7, 2023

andrewbaptist commented Feb 16, 2023 •

edited by cockroach-jira-scripts

Loading