Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: gossip frequency is implicitly 10 seconds #81669

Closed
kvoli opened this issue May 23, 2022 · 1 comment
Closed

kvserver: gossip frequency is implicitly 10 seconds #81669

kvoli opened this issue May 23, 2022 · 1 comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@kvoli
Copy link
Collaborator

kvoli commented May 23, 2022

Description

Store gossip occurs more than what is expected. It occurs every ~10 seconds under no changes whilst it should occur every minute currently.

These updates are triggered by capacity changes, specifically lease add events .

This function declares itself as idempotent however that is not the case, as each call may cause a new gossip to kick off of the latest store descriptor.

Reproduce

To reproduce, run https://github.com/cockroachdb/cockroach/compare/master...kvoli:220519.gossip-metrics?expand=1 and open up DB console. Using kv.allocator.staleness you can examine the histogram of gossip store descriptor staleness used in allocation decisions.

What triggers gossip updates

We only update the storepool state with newer information here. We also update the storepool state following lease transfers and replica changes with the estimated impact.

Gossip updates occur for the local store every 1 minute and will also be triggered if between now and the last gossip, any of these are true:

Condition Condition Check Trigger Description
(replica count delta) > 2 Replica change code
(replica count delta)/(last gossipped replica count) > 1% of last gossipped value Replica change code
(lease count delta) > 0 and (lease count delta)/(last gossipped lease count) >1% Lease transfer code
(qps delta)/(last gossipped qps) > 50% and qps delta > 100 Every 10 seconds (when updating metrics gauges) code

Expected behavior

Gossip should occur only when the above conditions are met. Additionally we may wish to investigate lowering the timer from 1 minute, given this bug has existed for some time without issue. Making the interval explicit rather than implicit.

Jira issue: CRDB-16019

@kvoli kvoli added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv Anything in KV that doesn't belong in a more specific category. labels May 23, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label May 23, 2022
tbg added a commit to tbg/cockroach that referenced this issue Jul 5, 2022
In cockroachdb#81669 @kvoli discovered that we often gossip much more frequently
than at a 60s interval. Since we are also adding I/O liveness signals
to the store capacity to use in a solution for cockroachdb#79215, we want to make
this more reactive interval explicit.

This change has the side effect of reducing the minimum allowed value
for `server.time_until_store_dead` to 25s[^1]; this seems reasonable.

[^1]: https://github.com/cockroachdb/cockroach/blob/263fbb7c8fcf001fcf47d7d35894b5824c78dc14/pkg/kv/kvserver/allocator/storepool/store_pool.go#L94-L99

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jul 5, 2022
In cockroachdb#81669 @kvoli discovered that we often gossip much more frequently
than at a 60s interval. Since we are also adding I/O liveness signals
to the store capacity to use in a solution for cockroachdb#79215, we want to make
this more reactive interval explicit.

This change has the side effect of reducing the minimum allowed value
for `server.time_until_store_dead` to 25s[^1]; this seems reasonable.

[^1]: https://github.com/cockroachdb/cockroach/blob/263fbb7c8fcf001fcf47d7d35894b5824c78dc14/pkg/kv/kvserver/allocator/storepool/store_pool.go#L94-L99

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jul 5, 2022
In cockroachdb#81669 @kvoli discovered that we often gossip much more frequently
than at a 60s interval. Since we are also adding I/O liveness signals
to the store capacity to use in a solution for cockroachdb#79215, we want to make
this more reactive interval explicit.

This change has the side effect of reducing the minimum allowed value
for `server.time_until_store_dead` to 25s[^1]; this seems reasonable.

[^1]: https://github.com/cockroachdb/cockroach/blob/263fbb7c8fcf001fcf47d7d35894b5824c78dc14/pkg/kv/kvserver/allocator/storepool/store_pool.go#L94-L99

Release note: None
@kvoli
Copy link
Collaborator Author

kvoli commented Jul 5, 2022

Update: In the original experiments, the histogram that was being used could not record values greater than 10s - due to

MaxLatency = 10 * time.Second
.

In an updated experiment where this is set to 120s, we see expected theoretical results:

image

commit: kvoli@904f5c7

The updated values are:

% staleness
p99 60s
p90 53s
p75 45s
p50 25s

This invalidates the previous theory that the gossip interval is implicity 10s. Instead, under no triggers it is the stated value in the code, 60s.

Closing this issue in favor of another issue for lowering the gossip interval #83841.

@kvoli kvoli closed this as completed Jul 5, 2022
tbg added a commit to tbg/cockroach that referenced this issue Jul 14, 2022
In cockroachdb#81669 @kvoli discovered that we often gossip much more frequently
than at a 60s interval. Since we are also adding I/O liveness signals
to the store capacity to use in a solution for cockroachdb#79215, we want to make
this more reactive interval explicit.

This change has the side effect of reducing the minimum allowed value
for `server.time_until_store_dead` to 25s[^1]; this seems reasonable.

[^1]: https://github.com/cockroachdb/cockroach/blob/263fbb7c8fcf001fcf47d7d35894b5824c78dc14/pkg/kv/kvserver/allocator/storepool/store_pool.go#L94-L99

Release note: None
craig bot pushed a commit that referenced this issue Jul 14, 2022
83808: kvserver: set StoresInterval to 10s r=kvoli a=tbg

In #81669 `@kvoli` discovered that we often gossip much more frequently
than at a 60s interval. Since we are also adding I/O liveness signals
to the store capacity to use in a solution for #79215, we want to make
this more reactive interval explicit.

This change has the side effect of reducing the minimum allowed value
for `server.time_until_store_dead` to 25s[^1]; this seems reasonable.

[^1]: https://github.com/cockroachdb/cockroach/blob/263fbb7c8fcf001fcf47d7d35894b5824c78dc14/pkg/kv/kvserver/allocator/storepool/store_pool.go#L94-L99

Release note: None


Co-authored-by: Tobias Grieger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

1 participant