Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.2: pkg/util/metric: optionally reintroduce legacy hdrhistogram model #96514

Merged
merged 2 commits into from
Feb 6, 2023

Conversation

abarganier
Copy link
Contributor

@abarganier abarganier commented Feb 3, 2023

This patch reeintroduces the old HdrHistogram model to optionally be
enabled in favor of the new Prometheus model, gated behind
an environment variable called COCKROACH_ENABLE_HDR_HISTOGRAMS,
allowing users a means to "fall back" to the old model in the
event that the new model does not adequately serve their needs
(think of this as an "insurance policy" to protect against
this from happening again with no real mitigation - ideally,
this environment variable should never have to be used).

It also updates the pre-defined bucket boundaries used by the Prometheus
backed histograms with more buckets. This aims to improve precision,
especially for latency histograms, when calculating quantiles (the low precision
being the core cause of the issue at hand).

Note: some histograms were introduced after the new
Prometheus histograms were added to CockroachDB. In this
case, we use the ForceUsePrometheus option in the
HistogramOptions struct to ignore the value of the env
var, since there never was a time where these specific
histograms used the HdrHistogram model.

Release note (ops change): Histogram metrics can now optionally
use the legacy HdrHistogram model by setting the environment var
COCKROACH_ENABLE_HDR_HISTOGRAMS=true on CockroachDB nodes.
Note that this is not recommended unless users are having
difficulties with the newer Prometheus-backed histogram model.
Enabling can cause performance issues with timeseries databases
like Prometheus, as processing and storing the increased number
of buckets is taxing on both CPU and storage. Note that the
HdrHistogram model is slated for full deprecation in upcoming
releases.

Fixes: #95833

Backport 2/2 commits from #96029.

/cc @cockroachdb/release

Release justification: Fixes for high-priority or high-severity bugs in existing functionality

@abarganier abarganier requested review from dhartunian, aadityasondhi and a team February 3, 2023 17:46
@abarganier abarganier requested review from a team as code owners February 3, 2023 17:46
@abarganier abarganier requested review from a team February 3, 2023 17:46
@abarganier abarganier requested a review from a team as a code owner February 3, 2023 17:46
@abarganier abarganier requested review from samiskin and msirek and removed request for a team February 3, 2023 17:46
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@abarganier abarganier removed request for a team, samiskin and msirek February 3, 2023 17:47
abarganier added a commit to abarganier/cockroach that referenced this pull request Feb 3, 2023
Ref: cockroachdb#96514

The above patch migrated all histogram constructors over to
use a generic constructor, which has the ability to specify
a HistogramMode. While preparing a backport of this patch,
I noticed that the following streamingest histograms were
incorrectly tagged with a HistogramMode of
HistogramModePreferHdrLatency:

1. FlushHistNanos
2. CommitLatency
3. AdmitLatency

If you look at the original migration, when HDR was still being
used, these histograms were *not* using the HDR latency defaults:

cockroachdb@a82aa82#diff-1bc5bdba63149e8efeadce17e7eb62bb5cd1dcee22974b37881a627e13c0501dL137-L143

This patch fixes these histograms to no longer incorrectly specify
the HistogramModePreferHdrLatency mode in the histogram options.

This was also fixed in the backport PR.

Release note: none
@dhartunian
Copy link
Collaborator

@abarganier looks like you're missing some test files for AggMetrics.

@rafiss rafiss changed the title pkg/util/metric: optionally reintroduce legacy hdrhistogram model release-22.2: pkg/util/metric: optionally reintroduce legacy hdrhistogram model Feb 4, 2023
Addresses cockroachdb#95833

This patch reeintroduces the old HdrHistogram model to optionally be
enabled in favor of the new Prometheus model, gated behind
an environment variable called `COCKROACH_ENABLE_HDR_HISTOGRAMS`,
allowing users a means to "fall back" to the old model in the
event that the new model does not adequately serve their needs
(think of this as an "insurance policy" to protect against
this from happening again with no real mitigation - ideally,
this environment variable should never have to be used).

Note: some histograms were introduced *after* the new
Prometheus histograms were added to CockroachDB. In this
case, we use the `ForceUsePrometheus` option in the
`HistogramOptions` struct to ignore the value of the env
var, since there never was a time where these specific
histograms used the HdrHistogram model.

Release note (ops change): Histogram metrics can now optionally
use the legacy HdrHistogram model by setting the environment var
`COCKROACH_ENABLE_HDR_HISTOGRAMS=true` on CockroachDB nodes.
**Note that this is not recommended** unless users are having
difficulties with the newer Prometheus-backed histogram model.
Enabling can cause performance issues with timeseries databases
like Prometheus, as processing and storing the increased number
of buckets is taxing on both CPU and storage. Note that the
HdrHistogram model is slated for full deprecation in upcoming
releases.
This patch increases the fidelity of the histogram buckets for
the new Prometheus model. This is primarily done by increasing the
bucket counts for all latency buckets, but may also be manually
tweaked according to feedback from various engineering teams for
their own use cases.

Release note (ops change): Prometheus histograms will now export
more buckets across the board to improve precision & fidelity of
information reported by histogram metrics, such as quantiles.
This will lead to an increase in storage requirements to process
these histogram metrics in downstream systems like Prometheus,
but should still be a marked improvement when compared to the
legacy HdrHistogram model. If users have issues with the precision
of these bucket boundaries, they can set the environment variable
`COCKROACH_ENABLE_HDR_HISTOGRAMS=true` to revert to using the
legacy HdrHistogram model instead, although this is not recommended
otherwise as the HdrHistogram strains systems like Prometheus with
excessive numbers of histogram buckets. Note that HdrHistograms are
slated for full deprecation in upcoming releases.
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need release-22.2- prefix in the PR title?

Reviewed 34 of 38 files at r1, 1 of 2 files at r3, 1 of 1 files at r4, 7 of 7 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi)

@abarganier
Copy link
Contributor Author

TFTR!

do we need release-22.2- prefix in the PR title?

I see the PR title as release-22.2: pkg/util/metric: optionally reintroduce legacy hdrhistogram model 🤔

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I posted with an old comment that was out of date, whoops.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi)

craig bot pushed a commit that referenced this pull request Feb 6, 2023
96504: Revert "sql: improve stack trace for get-user-timeout timeouts" r=ecwall a=rafiss

This reverts commit 662e19c.

gRPC does not expect the context to be wrapped, so it can get confused.

informs #95794
Release note: None

96516: pkg/util/metric: fix HistogramMode for streamingingest histograms r=dhartunian a=abarganier

Ref: #96514

The above patch migrated all histogram constructors over to
use a generic constructor, which has the ability to specify
a HistogramMode. While preparing a backport of this patch,
I noticed that the following streamingest histograms were
incorrectly tagged with a HistogramMode of
HistogramModePreferHdrLatency:

1. FlushHistNanos
2. CommitLatency
3. AdmitLatency

If you look at the original migration, when HDR was still being
used, these histograms were *not* using the HDR latency defaults:

a82aa82#diff-1bc5bdba63149e8efeadce17e7eb62bb5cd1dcee22974b37881a627e13c0501dL137-L143

This patch fixes these histograms to no longer incorrectly specify
the HistogramModePreferHdrLatency mode in the histogram options.

This was also fixed in the backport PR #96514.

Release note: none

Part of #95833

Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Alex Barganier <[email protected]>
@abarganier
Copy link
Contributor Author

TFTR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants