Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align DeltaHistogram in SignalFx registry with count and total #3799

Merged

Conversation

lenin-jaganathan
Copy link
Contributor

@lenin-jaganathan lenin-jaganathan commented May 2, 2023

fixes #3774, and partially this one too

Known Issues,

  • Need help from someone from SignalFx to confirm the expected behaviors when both percentiles and sla are enabled on a Timer. This will help in writing failing tests based on the expected behavior and then fixing the. - Kept the old behavior as is and fixed the issue by using StepBucketHistogram for Delta Usecases.
  • pollMetersToRollover() in StepMeterregistry should be overridden so that when the SignalFx registry runs in deltaHistogram mode, it rolls over the step and reports the right data.

@lenin-jaganathan
Copy link
Contributor Author

@atoulme / @bogdandrutu Need your input on the first item in the description. I believe this is critical to be part of 1.11.0 to align with the latest changes made to the micrometer-core.

@breedx-splk
Copy link
Contributor

Hey @lenin-jaganathan. We set up a test to examine the current behavior of the Timer with both percentiles and slo enabled. You can see the code and the results here: https://github.com/breedx-splk/micrometer_perc_sla_timer

The current behavior results in 4 gauges and 2 sums.

Most interestingly are the my.timer.histogram and my.timer.percentile gauges. The my.timer.histogram has a timeseries that corresponds with each of the SLO values configured into the timer, and the my.timer.percentile has a timeseries per configured percentile.

Given that some users might rely on this existing behavior, it would be ideal to maintain the shape of those metrics. 👍🏻

@lenin-jaganathan lenin-jaganathan changed the base branch from main to 1.11.x May 17, 2023 19:07
@lenin-jaganathan
Copy link
Contributor Author

@breedx-splk Thanks for confirming this. Since the feature is completely broken in micrometer 1.11.0, I have targeted this PR for 1.11.x and it makes sense to only fix the broken behavior in the patch release.

@jonatan-ivanov Does it makes sense for this to be targeted for 1.11.x? #3774 completely blocks us from using the 1.11.x micrometer.

@lenin-jaganathan lenin-jaganathan force-pushed the signalfx_deltahistogram branch from ad5b63c to b011789 Compare May 17, 2023 19:35
@lenin-jaganathan
Copy link
Contributor Author

@breedx-splk Can you have a look at this PR? Kept the existing behavior in SignalFxMeterRegistry for histograms and only the DeltaHistogram is altered so that it address the issue (i,e align with Sum and Count).

@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented May 18, 2023

I need to take a deeper look but since it seems a bug it makes sense to fix it in a patch release. Also, if we decide to back-port it to earlier releases (1.9.x), maybe this can be fixed earlier.

@breedx-splk
Copy link
Contributor

@breedx-splk Can you have a look at this PR? Kept the existing behavior in SignalFxMeterRegistry for histograms and only the DeltaHistogram is altered so that it address the issue (i,e align with Sum and Count).

Hey @lenin-jaganathan . Respectfully, without knowing much about micrometer implementation internals and not knowing much about the SignalFxMeterRegistry implementation, I'm having a hard time following what you're trying to fix here. I've read the description in #3774 maybe 10 times and have tried to recreate it here.

Expected behavior: Irrespective of the publishing time, the count metric and the count of histogram buckets should match.

I don't think I follow this. With delta histograms turned on in the registry, I see a Sum named my.timer.histogram for each of the configure SLO values. In my sample code, there are two -- one for the 1.75 slo and one for the 0.999 slo, and a third one with the same name for +Inf. So, on the first report I see something like this:

  • Sum my.timer.histogram (le=0.999) value = 82.000
  • Sum my.timer.histogram (le=1.75) value = 82.000
  • Sum my.timer.histogram (le=+Inf) value = 82.000
  • Sum my.timer.count (le=0.999) value = 51

I take this to mean that the "histogram" for the slo has 3 buckets -- one for each value and one for +inf. Am I understanding that incorrectly? Is the problem then that the 51 and the 82 don't match?

Regardless, the count does seem to be notably lower than the actual number of ticks issued by the app.

@lenin-jaganathan
Copy link
Contributor Author

@jonatan-ivanov That would be good. Just a note,

So, the only major version where this is broken is 1.11.0, which is why I based this on 1.11.x. It would be nice to have some feedback on this from you guys and get this in time for the next patch.

Respectfully, without knowing much about micrometer implementation internals and not knowing much about the SignalFxMeterRegistry implementation, I'm having a hard time following what you're trying to fix here

@breedx-splk I get you and I don't want to be hard on you here. But yeah, the exact thing you mentioned in your comment where the histogram reports 82 values and the count reports 51 is the issue this PR is trying to fix. (Do note that this is the behavior when publishCumulativeHistogram and publishDeltaHistogram are enabled).
Thanks for confirming this from the SignalFx side. As suggested in this PR (and with the test cases added in this commit ), this PR keeps all the existing behavior and addresses only what is described in the bug.

@jonatan-ivanov / @shakuzen Need your help in getting this merged.

@lenin-jaganathan lenin-jaganathan force-pushed the signalfx_deltahistogram branch 3 times, most recently from 6213de6 to 0d3a461 Compare May 19, 2023 05:57
@sonatype-lift
Copy link
Contributor

sonatype-lift bot commented May 20, 2023

🛠 Lift Auto-fix

Some of the Lift findings in this PR can be automatically fixed. You can download and apply these changes in your local project directory of your branch to review the suggestions before committing.1

# Download the patch
curl https://lift.sonatype.com/api/patch/github.com/micrometer-metrics/micrometer/3799.diff -o lift-autofixes.diff

# Apply the patch with git
git apply lift-autofixes.diff

# Review the changes
git diff

Want it all in a single command? Open a terminal in your project's directory and copy and paste the following command:

curl https://lift.sonatype.com/api/patch/github.com/micrometer-metrics/micrometer/3799.diff | git apply

Once you're satisfied, commit and push your changes in your project.

Footnotes

  1. You can preview the patch by opening the patch URL in the browser.

@lenin-jaganathan lenin-jaganathan force-pushed the signalfx_deltahistogram branch from 23871ba to d02789c Compare May 22, 2023 05:17
@jonatan-ivanov
Copy link
Member

I think it is totally fine that this is based on 1.11.x (cannot really be elswhere), what I meant above if we decide to back-port #3750, we might have these options:

  • Fix this on 1.11.x and back-port the two together to 1.9.x and 1.10.x
  • Back-port the feature to 1.9.x, fix it there and forward-merge the fix to 1.10.x and 1.11.x

@lenin-jaganathan
Copy link
Contributor Author

Got it. Thanks @jonatan-ivanov for the explanation 👍🏽.

I guess 1.11.1 is scheduled for first week June right? So ajy help with the review will be appreciated. Thanks a lot.

@lenin-jaganathan
Copy link
Contributor Author

@jonatan-ivanov/ @shakuzen Any help with the review is appreciated. This is kind of a blocker for us to move to 1.11.x which has some fixes and improvements. Basically, if this makes it in time for 1.11.1, it would be really helpful.

Thank you!!

Copy link
Member

@shakuzen shakuzen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pull request, and for updating it with a separate test. I think overall it's good. My biggest concern is opening up API we'll need to maintain in a patch release. I'd like to explore more if we can reasonably avoid that.

@lenin-jaganathan lenin-jaganathan force-pushed the signalfx_deltahistogram branch from 7bf45e9 to 1793b11 Compare June 8, 2023 12:40
- rollOver StepBucketHistogram on calls to count()
- PollingAwareMockStepClock was introduced to ensure pollMetersToRollover as soon as step crosses
- Restrict the visibility of pollMetersToRollover to package-private
@lenin-jaganathan lenin-jaganathan force-pushed the signalfx_deltahistogram branch from 07597a2 to 5eb7eeb Compare June 8, 2023 20:54
Copy link
Member

@shakuzen shakuzen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I left a couple comments on small things to fix. Thanks for the work on this.

- additional comments on SignalfxDistributionSummary and SignalfxTimer
@lenin-jaganathan
Copy link
Contributor Author

Addressed the open queries and comments.

@shakuzen shakuzen merged commit 03a227a into micrometer-metrics:1.11.x Jun 12, 2023
shakuzen added a commit that referenced this pull request Jun 12, 2023
@lenin-jaganathan lenin-jaganathan deleted the signalfx_deltahistogram branch June 20, 2023 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DeltaHistogram in SignalFx registry doesn't align with count and total
4 participants