Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: quota pool does not consider node liveness #84202

Open
erikgrinaker opened this issue Jul 11, 2022 · 4 comments
Open

kvserver: quota pool does not consider node liveness #84202

erikgrinaker opened this issue Jul 11, 2022 · 4 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jul 11, 2022

The replica quota pool does not take node liveness into account. Instead, it only considers the RPC status:

// Only consider followers that that have "healthy" RPC connections.
if err := r.store.cfg.NodeDialer.ConnHealth(rep.NodeID, r.connectionClass.get()); err != nil {
return
}

We should check the liveness record instead or in addition.

Jira issue: CRDB-17522

Epic CRDB-39898

@erikgrinaker erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv-replication labels Jul 11, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jul 11, 2022

cc @cockroachdb/replication

@blathers-crl blathers-crl bot added the A-kv-replication Relating to Raft, consensus, and coordination. label Jul 11, 2022
@erikgrinaker
Copy link
Contributor Author

As seen in #84943, we should get rid of the ConnHealth call entirely, and use liveness instead.

pav-kv added a commit to pav-kv/cockroach that referenced this issue Aug 3, 2022
The proposal quota release procedure checked node connection health for every
node that appears active after replica activity checks. This is expensive, and
previously caused issues like cockroachdb#84943.

This change replaces the node connection health check with the cheaper
NodeLiveness check.

Fixes cockroachdb#84202

Release note: None
pav-kv added a commit to pav-kv/cockroach that referenced this issue Aug 4, 2022
The proposal quota release procedure checked node connection health for every
node that appears active after replica activity checks. This is expensive, and
previously caused issues like cockroachdb#84943.

This change replaces the node connection health check with the cheaper
NodeLiveness check.

Fixes cockroachdb#84202

Release note: None
@pav-kv
Copy link
Collaborator

pav-kv commented Aug 5, 2022

After more discussion with @erikgrinaker, an edge case came up. If we check liveness instead of ConnHealth, the quota is subject to be ignored if a majority of followers are slow and fail to update their liveness records. This could result in a growing backlog.

We think it's okay to just remove the ConnHealth, because there is a subsequent paused replicas check which anyway covers the cases when followers are lagging behind.

craig bot pushed a commit that referenced this issue Aug 5, 2022
85565: kvserver: don't check ConnHealth when releasing proposal quota r=erikgrinaker a=pavelkalinnikov

The proposal quota release procedure checked node connection health for every
node that appears active after replica activity checks. This is expensive, and
previously caused issues like #84943.

This change removes the ConnHealth check, because other checks, such as
isFollowerActiveSince and paused replicas, provide sufficient protection from
various kinds of overloads.

Touches #84202

Release note: None

Co-authored-by: Pavel Kalinnikov <[email protected]>
@tbg
Copy link
Member

tbg commented Jul 3, 2023

This issue is obsolete if we disable/remove the quota pool, x-ref #106063

@exalate-issue-sync exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024
@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

3 participants