Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit statsz updates #5470

Merged
merged 6 commits into from
Jun 5, 2024
Merged

Limit statsz updates #5470

merged 6 commits into from
Jun 5, 2024

Conversation

wjordan
Copy link
Contributor

@wjordan wjordan commented May 23, 2024

This PR adds a filter to sendStatsz that limits statsz updates to the current heartbeat interval (max once per second), adding a time.Time field to track the time the last statsz update was sent. This limit should reduce overall STATSZ system event load in large clusters while still allowing initial statsz update to quickly reach newly-discovered nodes.

Fixes #5469.

Signed-off-by: Will Jordan [email protected]

@wjordan wjordan requested a review from a team as a code owner May 23, 2024 23:53
@wjordan wjordan changed the title Limit statsz updates per client Limit statsz updates May 24, 2024
Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, few small comments.

server/events.go Outdated Show resolved Hide resolved
server/events.go Outdated Show resolved Hide resolved
Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekcollison
Copy link
Member

This change seems to be causing some tests to flap since the meta layer for jetstream does not have up to date information in order to complete the test.

e.g. TestJetStreamSuperClusterOverflowPlacement

@derekcollison
Copy link
Member

Might need to fix these flappers before merging.

We could set the cstatsz to a lower value possibly for those tests.

@derekcollison
Copy link
Member

Also seems we now see a race..

image

@derekcollison
Copy link
Member

TestRoutePings has the datarace now.

@wjordan
Copy link
Contributor Author

wjordan commented Jun 4, 2024

the race seems unrelated, neither (TestRoutePings) nor adjustPingInterval behavior is touched by this PR.

This test is timing-sensitive and a lower statsz rate limit
seems to reduce test flakiness.
Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekcollison derekcollison merged commit 958b61f into nats-io:main Jun 5, 2024
2 checks passed
@derekcollison
Copy link
Member

The race was because your new test TestClusterSetupMsgs was not shutting down the cluster when done, so it kept running and it was accessing that routeMaxPingInterval. Will clean up.

Also need to make the tweak to statszRateLimit global since other tests are flapping too.

Will take care of it.

wallyqs pushed a commit that referenced this pull request Jun 5, 2024
This PR adds a filter to `sendStatsz` that limits statsz updates to the
current heartbeat interval (max once per second), adding a `time.Time`
field to track the time the last statsz update was sent. This limit
should reduce overall `STATSZ` system event load in large clusters while
still allowing initial statsz update to quickly reach newly-discovered
nodes.

Fixes #5469.

Signed-off-by: Will Jordan <[email protected]>
wallyqs added a commit that referenced this pull request Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flood of STATSZ system events starting nodes in a large cluster
2 participants