-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: gossip frequency is implicitly 10 seconds #81669
Comments
In cockroachdb#81669 @kvoli discovered that we often gossip much more frequently than at a 60s interval. Since we are also adding I/O liveness signals to the store capacity to use in a solution for cockroachdb#79215, we want to make this more reactive interval explicit. This change has the side effect of reducing the minimum allowed value for `server.time_until_store_dead` to 25s[^1]; this seems reasonable. [^1]: https://github.com/cockroachdb/cockroach/blob/263fbb7c8fcf001fcf47d7d35894b5824c78dc14/pkg/kv/kvserver/allocator/storepool/store_pool.go#L94-L99 Release note: None
In cockroachdb#81669 @kvoli discovered that we often gossip much more frequently than at a 60s interval. Since we are also adding I/O liveness signals to the store capacity to use in a solution for cockroachdb#79215, we want to make this more reactive interval explicit. This change has the side effect of reducing the minimum allowed value for `server.time_until_store_dead` to 25s[^1]; this seems reasonable. [^1]: https://github.com/cockroachdb/cockroach/blob/263fbb7c8fcf001fcf47d7d35894b5824c78dc14/pkg/kv/kvserver/allocator/storepool/store_pool.go#L94-L99 Release note: None
In cockroachdb#81669 @kvoli discovered that we often gossip much more frequently than at a 60s interval. Since we are also adding I/O liveness signals to the store capacity to use in a solution for cockroachdb#79215, we want to make this more reactive interval explicit. This change has the side effect of reducing the minimum allowed value for `server.time_until_store_dead` to 25s[^1]; this seems reasonable. [^1]: https://github.com/cockroachdb/cockroach/blob/263fbb7c8fcf001fcf47d7d35894b5824c78dc14/pkg/kv/kvserver/allocator/storepool/store_pool.go#L94-L99 Release note: None
Update: In the original experiments, the histogram that was being used could not record values greater than 10s - due to cockroach/pkg/util/metric/metric.go Line 30 in 933b684
In an updated experiment where this is set to 120s, we see expected theoretical results: commit: kvoli@904f5c7 The updated values are:
This invalidates the previous theory that the gossip interval is implicity 10s. Instead, under no triggers it is the stated value in the code, 60s. Closing this issue in favor of another issue for lowering the gossip interval #83841. |
In cockroachdb#81669 @kvoli discovered that we often gossip much more frequently than at a 60s interval. Since we are also adding I/O liveness signals to the store capacity to use in a solution for cockroachdb#79215, we want to make this more reactive interval explicit. This change has the side effect of reducing the minimum allowed value for `server.time_until_store_dead` to 25s[^1]; this seems reasonable. [^1]: https://github.com/cockroachdb/cockroach/blob/263fbb7c8fcf001fcf47d7d35894b5824c78dc14/pkg/kv/kvserver/allocator/storepool/store_pool.go#L94-L99 Release note: None
83808: kvserver: set StoresInterval to 10s r=kvoli a=tbg In #81669 `@kvoli` discovered that we often gossip much more frequently than at a 60s interval. Since we are also adding I/O liveness signals to the store capacity to use in a solution for #79215, we want to make this more reactive interval explicit. This change has the side effect of reducing the minimum allowed value for `server.time_until_store_dead` to 25s[^1]; this seems reasonable. [^1]: https://github.com/cockroachdb/cockroach/blob/263fbb7c8fcf001fcf47d7d35894b5824c78dc14/pkg/kv/kvserver/allocator/storepool/store_pool.go#L94-L99 Release note: None Co-authored-by: Tobias Grieger <[email protected]>
Description
Store gossip occurs more than what is expected. It occurs every ~10 seconds under no changes whilst it should occur every minute currently.
These updates are triggered by capacity changes, specifically lease add events .
This function declares itself as idempotent however that is not the case, as each call may cause a new gossip to kick off of the latest store descriptor.
Reproduce
To reproduce, run https://github.com/cockroachdb/cockroach/compare/master...kvoli:220519.gossip-metrics?expand=1 and open up DB console. Using
kv.allocator.staleness
you can examine the histogram of gossip store descriptor staleness used in allocation decisions.What triggers gossip updates
We only update the storepool state with newer information here. We also update the storepool state following lease transfers and replica changes with the estimated impact.
Gossip updates occur for the local store every 1 minute and will also be triggered if between now and the last gossip, any of these are true:
Expected behavior
Gossip should occur only when the above conditions are met. Additionally we may wish to investigate lowering the timer from 1 minute, given this bug has existed for some time without issue. Making the interval explicit rather than implicit.
Jira issue: CRDB-16019
The text was updated successfully, but these errors were encountered: