-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-22.2: pkg/util/metric: optionally reintroduce legacy hdrhistogram model #96514
release-22.2: pkg/util/metric: optionally reintroduce legacy hdrhistogram model #96514
Conversation
Ref: cockroachdb#96514 The above patch migrated all histogram constructors over to use a generic constructor, which has the ability to specify a HistogramMode. While preparing a backport of this patch, I noticed that the following streamingest histograms were incorrectly tagged with a HistogramMode of HistogramModePreferHdrLatency: 1. FlushHistNanos 2. CommitLatency 3. AdmitLatency If you look at the original migration, when HDR was still being used, these histograms were *not* using the HDR latency defaults: cockroachdb@a82aa82#diff-1bc5bdba63149e8efeadce17e7eb62bb5cd1dcee22974b37881a627e13c0501dL137-L143 This patch fixes these histograms to no longer incorrectly specify the HistogramModePreferHdrLatency mode in the histogram options. This was also fixed in the backport PR. Release note: none
94b2125
to
5b84381
Compare
@abarganier looks like you're missing some test files for |
Addresses cockroachdb#95833 This patch reeintroduces the old HdrHistogram model to optionally be enabled in favor of the new Prometheus model, gated behind an environment variable called `COCKROACH_ENABLE_HDR_HISTOGRAMS`, allowing users a means to "fall back" to the old model in the event that the new model does not adequately serve their needs (think of this as an "insurance policy" to protect against this from happening again with no real mitigation - ideally, this environment variable should never have to be used). Note: some histograms were introduced *after* the new Prometheus histograms were added to CockroachDB. In this case, we use the `ForceUsePrometheus` option in the `HistogramOptions` struct to ignore the value of the env var, since there never was a time where these specific histograms used the HdrHistogram model. Release note (ops change): Histogram metrics can now optionally use the legacy HdrHistogram model by setting the environment var `COCKROACH_ENABLE_HDR_HISTOGRAMS=true` on CockroachDB nodes. **Note that this is not recommended** unless users are having difficulties with the newer Prometheus-backed histogram model. Enabling can cause performance issues with timeseries databases like Prometheus, as processing and storing the increased number of buckets is taxing on both CPU and storage. Note that the HdrHistogram model is slated for full deprecation in upcoming releases.
This patch increases the fidelity of the histogram buckets for the new Prometheus model. This is primarily done by increasing the bucket counts for all latency buckets, but may also be manually tweaked according to feedback from various engineering teams for their own use cases. Release note (ops change): Prometheus histograms will now export more buckets across the board to improve precision & fidelity of information reported by histogram metrics, such as quantiles. This will lead to an increase in storage requirements to process these histogram metrics in downstream systems like Prometheus, but should still be a marked improvement when compared to the legacy HdrHistogram model. If users have issues with the precision of these bucket boundaries, they can set the environment variable `COCKROACH_ENABLE_HDR_HISTOGRAMS=true` to revert to using the legacy HdrHistogram model instead, although this is not recommended otherwise as the HdrHistogram strains systems like Prometheus with excessive numbers of histogram buckets. Note that HdrHistograms are slated for full deprecation in upcoming releases.
5b84381
to
b916c5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need release-22.2-
prefix in the PR title?
Reviewed 34 of 38 files at r1, 1 of 2 files at r3, 1 of 1 files at r4, 7 of 7 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi)
TFTR!
I see the PR title as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I posted with an old comment that was out of date, whoops.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi)
96504: Revert "sql: improve stack trace for get-user-timeout timeouts" r=ecwall a=rafiss This reverts commit 662e19c. gRPC does not expect the context to be wrapped, so it can get confused. informs #95794 Release note: None 96516: pkg/util/metric: fix HistogramMode for streamingingest histograms r=dhartunian a=abarganier Ref: #96514 The above patch migrated all histogram constructors over to use a generic constructor, which has the ability to specify a HistogramMode. While preparing a backport of this patch, I noticed that the following streamingest histograms were incorrectly tagged with a HistogramMode of HistogramModePreferHdrLatency: 1. FlushHistNanos 2. CommitLatency 3. AdmitLatency If you look at the original migration, when HDR was still being used, these histograms were *not* using the HDR latency defaults: a82aa82#diff-1bc5bdba63149e8efeadce17e7eb62bb5cd1dcee22974b37881a627e13c0501dL137-L143 This patch fixes these histograms to no longer incorrectly specify the HistogramModePreferHdrLatency mode in the histogram options. This was also fixed in the backport PR #96514. Release note: none Part of #95833 Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Alex Barganier <[email protected]>
TFTR! |
This patch reeintroduces the old HdrHistogram model to optionally be
enabled in favor of the new Prometheus model, gated behind
an environment variable called
COCKROACH_ENABLE_HDR_HISTOGRAMS
,allowing users a means to "fall back" to the old model in the
event that the new model does not adequately serve their needs
(think of this as an "insurance policy" to protect against
this from happening again with no real mitigation - ideally,
this environment variable should never have to be used).
It also updates the pre-defined bucket boundaries used by the Prometheus
backed histograms with more buckets. This aims to improve precision,
especially for latency histograms, when calculating quantiles (the low precision
being the core cause of the issue at hand).
Note: some histograms were introduced after the new
Prometheus histograms were added to CockroachDB. In this
case, we use the
ForceUsePrometheus
option in theHistogramOptions
struct to ignore the value of the envvar, since there never was a time where these specific
histograms used the HdrHistogram model.
Release note (ops change): Histogram metrics can now optionally
use the legacy HdrHistogram model by setting the environment var
COCKROACH_ENABLE_HDR_HISTOGRAMS=true
on CockroachDB nodes.Note that this is not recommended unless users are having
difficulties with the newer Prometheus-backed histogram model.
Enabling can cause performance issues with timeseries databases
like Prometheus, as processing and storing the increased number
of buckets is taxing on both CPU and storage. Note that the
HdrHistogram model is slated for full deprecation in upcoming
releases.
Fixes: #95833
Backport 2/2 commits from #96029.
/cc @cockroachdb/release
Release justification: Fixes for high-priority or high-severity bugs in existing functionality