Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert Timer#max to use a windowed max rather than last publishing interval. #1993

Closed
jkschneider opened this issue Apr 13, 2020 · 3 comments
Labels
release notes Noteworthy change to call out in the release notes type: task A general task
Milestone

Comments

@jkschneider
Copy link
Contributor

jkschneider commented Apr 13, 2020

In #1856, the behavior of StepTimer was changed to publish max only for the last publishing interval. For the purpose of availability monitoring, it is better to allow maximum to decay over a longer period.

cc / @crankydillo

Further explanation as to why:

Micrometer decays the maximum rather than aligning it to the publishing interval like it does for sum and count. If we perfectly aligned the view of maximum time to the push-interval then a dropped metrics payload means we potentially miss out on seeing a particularly high maximum value (because in the next interval we'd only consider samples that occurred in that interval).

Practically, there are many reasons why a high maximum latency and a dropped metrics payload would be correlated. For example, if the application is under heavy resource pressure (like a saturated network interface), a response to the user for an API endpoint that is being timed (and for which a maximum value is being tracked) may be exceedingly high at the same time that a metrics post request to the monitoring system fails with a read timeout. But such conditions can be (and many times are) temporary.

Perhaps you have a client-side load balancing strategy that recognizes that (from the client's perspective) API latency has gone up sharply for this instance that is under resource pressure, and begins preferring other instances. By relieving pressure on this instance it recovers.

In some subsequent interval, after the instance has recovered, it's nice to be able to push a maximum latency seen during this time of trouble that would otherwise have been skipped. In fact, it's precisely these times of duress that we care about the most, not the maximum latency under fair-weather conditions!

@jkschneider jkschneider modified the milestones: 1.4.x, 1.4.2 Apr 13, 2020
@shakuzen shakuzen added release notes Noteworthy change to call out in the release notes type: task A general task labels Apr 14, 2020
@shakuzen
Copy link
Member

We'll also need to revert the changes made to the docs: micrometer-metrics/micrometer-docs#120

@crankydillo
Copy link
Contributor

Thanks for the heads up. I can understand how the importance of max values makes it behavior different from other 'step values'. It appears you left the refactoring around isolating the rolling logic, which will allow us to continue align max with the last step (and live with the ramifications of missing values) without copying too much of your code. If this comes up in the future, you might consider providing 2 versions of StepTimer.

Please consider making some additions to the javadoc/hosted doc/etc that makes it clear that max deviates from other 'step values' and why that is done.

Lastly, thanks for the project. It's working great for us so far!

@izeye
Copy link
Contributor

izeye commented Apr 14, 2020

I created micrometer-metrics/micrometer-docs#125 to revert micrometer-metrics/micrometer-docs#120. I also added the reason for time window max to the doc based on the comment from @jkschneider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release notes Noteworthy change to call out in the release notes type: task A general task
Projects
None yet
Development

No branches or pull requests

4 participants