Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: Always treat restarted nodes as suspect #97263

Closed
andrewbaptist opened this issue Feb 16, 2023 · 1 comment
Closed

kvserver: Always treat restarted nodes as suspect #97263

andrewbaptist opened this issue Feb 16, 2023 · 1 comment
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@andrewbaptist
Copy link
Collaborator

andrewbaptist commented Feb 16, 2023

Is your feature request related to a problem? Please describe.

As seen in #95159. A store that is offline for a greater than 5 minutes (server.time_until_store_dead) is immediately seen as Available after its first gossip liveness update. Unfortunately, the store may not be fully healthy at this point as it is likely behind on Raft updates and should not be a target for "unnecessary" lease transfers.

We already have a status Suspect that is used for stores that are down for less than 5 minutes and keeps lease and replica transfers away for 30s (server.time_after_store_suspect). So for short store outages, the store becomes suspect, however, once a store is dead, it transitions immediately to Available

Describe the solution you'd like

Any store that is offline and rejoins should be treated as Suspect for the 30s window until it has had a chance to recover. Lease and replica transfers are not prohibited to suspect nodes, but they are only done in emergency cases. This will decrease the impact of a store being offline for an extended period.

Describe alternatives you've considered
An alternative that was explored in #96980 was to start nodes in a different state and post a different liveness update until they are healthy enough. This is unnecessarily complex however as we already send both the IO overload status through gossip and could reasonably figure out whether Raft is healthy enough on the range using #96304

Additional context
Performance after restart is a complex issue requiring a number of moving parts to fully address. This change alone will be a strict improvement, however without some of the other issues mentioned it won't fully address all impacts of restarted nodes.

Jira issue: CRDB-24600

@andrewbaptist andrewbaptist added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Feb 16, 2023
@andrewbaptist andrewbaptist changed the title Always treat restarted nodes as suspect kvserver: Always treat restarted nodes as suspect Feb 16, 2023
@andrewbaptist andrewbaptist added the A-kv-distribution Relating to rebalancing and leasing. label Feb 16, 2023
@kvoli
Copy link
Collaborator

kvoli commented Mar 7, 2023

completed by #97532

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

2 participants