-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/util/metric: option to use legacy hdrhistogram model, increase bucket fidelity #96029
pkg/util/metric: option to use legacy hdrhistogram model, increase bucket fidelity #96029
Conversation
e685d2f
to
6e08154
Compare
435cf40
to
31deaba
Compare
@aadityasondhi I've updated the commit messages/release notes to make this more obvious. Reintroduction of the hdrhistograms exists only as a band-aid mitigation, in the event that our fixed bucket boundaries once again cause problems for customers. Ideally, they should never have to be used. |
EDIT: Nevermind - keeping in mind that we need to backport this change, I think we should keep the changes as minimal as possible. If this has been working fine, we can leave it as is so long as it's not an exported metric.
cockroach/pkg/cli/syncbench/syncbench.go Lines 52 to 66 in 136ef1b
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, @irfansharif, @jordanlewis, @miretskiy, @tbg, and @ZhouXing19)
pkg/util/metric/metric_test.go
line 249 at r1 (raw file):
Previously, abarganier (Alex Barganier) wrote…
Write()
would transform the windowed histogram into theprometheusgo.Metric
, which we then made assertions against.Now with the new interface we have functions like
TotalSumWindowed()
which implicitly transformw the hdrhistogram into a windowed Prometheus histogram and then pulls values from that, so I don't believe we need thisWrite
function anymore since that transformation is handled elsewhere.cc @aadityasondhi for a sanity check on this.
That sounds correct to me.
I've finished testing these changes in roachprod. I wanted to run these reproduction steps against the following environments:
The results look good. The improvement in precision is very obvious when comparing the v22.2.2 build against this commit's build making use of the new Prometheus histogram buckets. See screenshots from DB Console below. Based on this test, the core issue seems to be fixed. HDR vs. Prometheus reported latencies are quite similar. When compared to v22.2.2, which uses the old buckets, the improvement is obvious. On the commit builds, reported latency quantiles are nearly equivalent (P90 of ~230ms) between HDR and Prometheus histograms. On the broken v22.2.2 build, reported latency is broken @ ~490ms, which is using the old Prometheus histogram buckets. There's a bit of a noticeable difference in fidelity still between Prometheus & HDR. Prometheus seems to have a "smoothing" effect on the histogram charts, whereas HDR appears to more accurately capture "peaks" in latency. However, keep in mind that quantile calculations for these HDR histograms are somewhat broken, since empty histogram buckets are omitted for HDR: #89532 cc @irfansharif @tbg - curious to hear your opinions on whether these results are acceptable, or if you have any further tests you'd like me to run. Thanks for your continued input 🥇 1. Commit build,
|
bors r=tbg,aadityasondhi |
Build failed: |
Addresses cockroachdb#95833 This patch reeintroduces the old HdrHistogram model to optionally be enabled in favor of the new Prometheus model, gated behind an environment variable called `COCKROACH_ENABLE_HDR_HISTOGRAMS`, allowing users a means to "fall back" to the old model in the event that the new model does not adequately serve their needs (think of this as an "insurance policy" to protect against this from happening again with no real mitigation - ideally, this environment variable should never have to be used). Note: some histograms were introduced *after* the new Prometheus histograms were added to CockroachDB. In this case, we use the `ForceUsePrometheus` option in the `HistogramOptions` struct to ignore the value of the env var, since there never was a time where these specific histograms used the HdrHistogram model. Release note (ops change): Histogram metrics can now optionally use the legacy HdrHistogram model by setting the environment var `COCKROACH_ENABLE_HDR_HISTOGRAMS=true` on CockroachDB nodes. **Note that this is not recommended** unless users are having difficulties with the newer Prometheus-backed histogram model. Enabling can cause performance issues with timeseries databases like Prometheus, as processing and storing the increased number of buckets is taxing on both CPU and storage. Note that the HdrHistogram model is slated for full deprecation in upcoming releases.
This patch increases the fidelity of the histogram buckets for the new Prometheus model. This is primarily done by increasing the bucket counts for all latency buckets, but may also be manually tweaked according to feedback from various engineering teams for their own use cases. Release note (ops change): Prometheus histograms will now export more buckets across the board to improve precision & fidelity of information reported by histogram metrics, such as quantiles. This will lead to an increase in storage requirements to process these histogram metrics in downstream systems like Prometheus, but should still be a marked improvement when compared to the legacy HdrHistogram model. If users have issues with the precision of these bucket boundaries, they can set the environment variable `COCKROACH_ENABLE_HDR_HISTOGRAMS=true` to revert to using the legacy HdrHistogram model instead, although this is not recommended otherwise as the HdrHistogram strains systems like Prometheus with excessive numbers of histogram buckets. Note that HdrHistograms are slated for full deprecation in upcoming releases.
31deaba
to
4b32a98
Compare
bors r=tbg,aadityasondhi |
Build failed (retrying...): |
This PR was included in a batch that was canceled, it will be automatically retried |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from a28aa6c to blathers/backport-release-22.2-96029: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
cockroachdb#96029 The above patch introduced some `fmt.Println` statements in a test accidentally. This patch removes them. Release note: none
The v22.2 backport of cockroachdb#96029 experienced some linter issues that didn't occur in the original patch. This patch fixes those linter errors. Release note: none
The v22.2 backport of cockroachdb#96029 experienced some linter issues that didn't occur in the original patch. This patch fixes those linter errors. Release note: none
This patch reeintroduces the old HdrHistogram model to optionally be
enabled in favor of the new Prometheus model, gated behind
an environment variable called
COCKROACH_ENABLE_HDR_HISTOGRAMS
,allowing users a means to "fall back" to the old model in the
event that the new model does not adequately serve their needs
(think of this as an "insurance policy" to protect against
this from happening again with no real mitigation - ideally,
this environment variable should never have to be used).
It also updates the pre-defined bucket boundaries used by the Prometheus
backed histograms with more buckets. This aims to improve precision,
especially for latency histograms, when calculating quantiles (the low precision
being the core cause of the issue at hand).
Note: some histograms were introduced after the new
Prometheus histograms were added to CockroachDB. In this
case, we use the
ForceUsePrometheus
option in theHistogramOptions
struct to ignore the value of the envvar, since there never was a time where these specific
histograms used the HdrHistogram model.
Release note (ops change): Histogram metrics can now optionally
use the legacy HdrHistogram model by setting the environment var
COCKROACH_ENABLE_HDR_HISTOGRAMS=true
on CockroachDB nodes.Note that this is not recommended unless users are having
difficulties with the newer Prometheus-backed histogram model.
Enabling can cause performance issues with timeseries databases
like Prometheus, as processing and storing the increased number
of buckets is taxing on both CPU and storage. Note that the
HdrHistogram model is slated for full deprecation in upcoming
releases.
Fixes: #95833