Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/util/metric: option to use legacy hdrhistogram model, increase bucket fidelity #96029

Merged
merged 2 commits into from
Feb 3, 2023

Conversation

dhartunian
Copy link
Collaborator

@dhartunian dhartunian commented Jan 26, 2023

This patch reeintroduces the old HdrHistogram model to optionally be
enabled in favor of the new Prometheus model, gated behind
an environment variable called COCKROACH_ENABLE_HDR_HISTOGRAMS,
allowing users a means to "fall back" to the old model in the
event that the new model does not adequately serve their needs
(think of this as an "insurance policy" to protect against
this from happening again with no real mitigation - ideally,
this environment variable should never have to be used).

It also updates the pre-defined bucket boundaries used by the Prometheus
backed histograms with more buckets. This aims to improve precision,
especially for latency histograms, when calculating quantiles (the low precision
being the core cause of the issue at hand).

Note: some histograms were introduced after the new
Prometheus histograms were added to CockroachDB. In this
case, we use the ForceUsePrometheus option in the
HistogramOptions struct to ignore the value of the env
var, since there never was a time where these specific
histograms used the HdrHistogram model.

Release note (ops change): Histogram metrics can now optionally
use the legacy HdrHistogram model by setting the environment var
COCKROACH_ENABLE_HDR_HISTOGRAMS=true on CockroachDB nodes.
Note that this is not recommended unless users are having
difficulties with the newer Prometheus-backed histogram model.
Enabling can cause performance issues with timeseries databases
like Prometheus, as processing and storing the increased number
of buckets is taxing on both CPU and storage. Note that the
HdrHistogram model is slated for full deprecation in upcoming
releases.

Fixes: #95833

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@abarganier abarganier force-pushed the improve-histogram-granularity branch 13 times, most recently from e685d2f to 6e08154 Compare January 31, 2023 15:22
@abarganier abarganier changed the title [DRAFT; IN PROGRESS] Improve histogram granularity pkg/util/metric: option to use legacy hdrhistogram model, increase bucket fidelity Jan 31, 2023
@abarganier abarganier marked this pull request as ready for review January 31, 2023 15:28
@abarganier abarganier requested a review from a team as a code owner January 31, 2023 15:28
@abarganier abarganier requested a review from a team January 31, 2023 15:28
@abarganier abarganier requested review from a team as code owners January 31, 2023 15:28
@abarganier abarganier requested a review from a team January 31, 2023 15:28
@abarganier abarganier requested review from a team as code owners January 31, 2023 15:28
@abarganier abarganier requested review from a team January 31, 2023 15:28
@abarganier abarganier requested a review from a team as a code owner January 31, 2023 15:28
@abarganier abarganier requested review from dt and removed request for a team January 31, 2023 15:28
@abarganier abarganier force-pushed the improve-histogram-granularity branch from 435cf40 to 31deaba Compare February 1, 2023 17:04
@abarganier
Copy link
Contributor

Maybe I lack context here, but just curious about why we are doing both of (1) increasing bucket counts and (2) re-introducing hdr histograms which we worked to remove earlier.

@aadityasondhi I've updated the commit messages/release notes to make this more obvious. Reintroduction of the hdrhistograms exists only as a band-aid mitigation, in the event that our fixed bucket boundaries once again cause problems for customers. Ideally, they should never have to be used.

@abarganier
Copy link
Contributor

abarganier commented Feb 1, 2023

EDIT: Nevermind - keeping in mind that we need to backport this change, I think we should keep the changes as minimal as possible. If this has been working fine, we can leave it as is so long as it's not an exported metric.

@jbowens Thoughts on whether we should migrate this over histogram used in sysbench over to use the Prometheus-backed histogram? It doesn't look like this histogram is exported to TSDB/Prometheus, so our primary purpose for migrating from HdrHistogram->PrometheusHistogram isn't super important in this case AFAICT. As a bit more context - HdrHistograms caused issues for Prometheus because it exported hundreds of buckets, which Prometheus would choke on. If we're not exporting the histogram it might not be necessary to migrate it.

type worker struct {
db storage.Engine
latency struct {
syncutil.Mutex
*hdrhistogram.WindowedHistogram
}
logOnly bool
}
func newWorker(db storage.Engine) *worker {
w := &worker{db: db}
w.latency.WindowedHistogram = hdrhistogram.NewWindowed(1,
minLatency.Nanoseconds(), maxLatency.Nanoseconds(), 1)
return w
}

Copy link
Collaborator

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, @irfansharif, @jordanlewis, @miretskiy, @tbg, and @ZhouXing19)


pkg/util/metric/metric_test.go line 249 at r1 (raw file):

Previously, abarganier (Alex Barganier) wrote…

Write() would transform the windowed histogram into the prometheusgo.Metric, which we then made assertions against.

Now with the new interface we have functions like TotalSumWindowed() which implicitly transformw the hdrhistogram into a windowed Prometheus histogram and then pulls values from that, so I don't believe we need this Write function anymore since that transformation is handled elsewhere.

cc @aadityasondhi for a sanity check on this.

That sounds correct to me.

@abarganier
Copy link
Contributor

I've finished testing these changes in roachprod. I wanted to run these reproduction steps against the following environments:

  • A build using this commit, with COCKROACH_ENABLE_HDR_HISTOGRAMS=true
  • A build using this commit, with COCKROACH_ENABLE_HDR_HISTOGRAMS=false
  • A build using CockroachDB v22.2.2, where the issue was originally reported.

The results look good. The improvement in precision is very obvious when comparing the v22.2.2 build against this commit's build making use of the new Prometheus histogram buckets. See screenshots from DB Console below.

Based on this test, the core issue seems to be fixed. HDR vs. Prometheus reported latencies are quite similar. When compared to v22.2.2, which uses the old buckets, the improvement is obvious. On the commit builds, reported latency quantiles are nearly equivalent (P90 of ~230ms) between HDR and Prometheus histograms. On the broken v22.2.2 build, reported latency is broken @ ~490ms, which is using the old Prometheus histogram buckets.

There's a bit of a noticeable difference in fidelity still between Prometheus & HDR. Prometheus seems to have a "smoothing" effect on the histogram charts, whereas HDR appears to more accurately capture "peaks" in latency. However, keep in mind that quantile calculations for these HDR histograms are somewhat broken, since empty histogram buckets are omitted for HDR: #89532

cc @irfansharif @tbg - curious to hear your opinions on whether these results are acceptable, or if you have any further tests you'd like me to run. Thanks for your continued input 🥇

1. Commit build, COCKROACH_ENABLE_HDR_HISTOGRAMS=false

master-hdr-disabled

2. v22.2.2 build, using old Prometheus histogram bucket boundaries

v22 2 2-old-buckets

3. Commit build, COCKROACH_ENABLE_HDR_HISTOGRAMS=true

master-hdr-enabled

@irfansharif irfansharif removed their request for review February 2, 2023 16:56
@abarganier
Copy link
Contributor

bors r=tbg,aadityasondhi

@craig
Copy link
Contributor

craig bot commented Feb 2, 2023

Build failed:

dhartunian and others added 2 commits February 2, 2023 15:06
Addresses cockroachdb#95833

This patch reeintroduces the old HdrHistogram model to optionally be
enabled in favor of the new Prometheus model, gated behind
an environment variable called `COCKROACH_ENABLE_HDR_HISTOGRAMS`,
allowing users a means to "fall back" to the old model in the
event that the new model does not adequately serve their needs
(think of this as an "insurance policy" to protect against
this from happening again with no real mitigation - ideally,
this environment variable should never have to be used).

Note: some histograms were introduced *after* the new
Prometheus histograms were added to CockroachDB. In this
case, we use the `ForceUsePrometheus` option in the
`HistogramOptions` struct to ignore the value of the env
var, since there never was a time where these specific
histograms used the HdrHistogram model.

Release note (ops change): Histogram metrics can now optionally
use the legacy HdrHistogram model by setting the environment var
`COCKROACH_ENABLE_HDR_HISTOGRAMS=true` on CockroachDB nodes.
**Note that this is not recommended** unless users are having
difficulties with the newer Prometheus-backed histogram model.
Enabling can cause performance issues with timeseries databases
like Prometheus, as processing and storing the increased number
of buckets is taxing on both CPU and storage. Note that the
HdrHistogram model is slated for full deprecation in upcoming
releases.
This patch increases the fidelity of the histogram buckets for
the new Prometheus model. This is primarily done by increasing the
bucket counts for all latency buckets, but may also be manually
tweaked according to feedback from various engineering teams for
their own use cases.

Release note (ops change): Prometheus histograms will now export
more buckets across the board to improve precision & fidelity of
information reported by histogram metrics, such as quantiles.
This will lead to an increase in storage requirements to process
these histogram metrics in downstream systems like Prometheus,
but should still be a marked improvement when compared to the
legacy HdrHistogram model. If users have issues with the precision
of these bucket boundaries, they can set the environment variable
`COCKROACH_ENABLE_HDR_HISTOGRAMS=true` to revert to using the
legacy HdrHistogram model instead, although this is not recommended
otherwise as the HdrHistogram strains systems like Prometheus with
excessive numbers of histogram buckets. Note that HdrHistograms are
slated for full deprecation in upcoming releases.
@abarganier abarganier force-pushed the improve-histogram-granularity branch from 31deaba to 4b32a98 Compare February 2, 2023 19:06
@abarganier
Copy link
Contributor

bors r=tbg,aadityasondhi

@craig
Copy link
Contributor

craig bot commented Feb 2, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 2, 2023

This PR was included in a batch that was canceled, it will be automatically retried

@craig craig bot merged commit dd97d0c into cockroachdb:master Feb 3, 2023
@craig
Copy link
Contributor

craig bot commented Feb 3, 2023

Build succeeded:

@blathers-crl
Copy link

blathers-crl bot commented Feb 3, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from a28aa6c to blathers/backport-release-22.2-96029: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

abarganier added a commit to abarganier/cockroach that referenced this pull request Feb 3, 2023
cockroachdb#96029

The above patch introduced some `fmt.Println` statements
in a test accidentally. This patch removes them.

Release note: none
abarganier added a commit to abarganier/cockroach that referenced this pull request Feb 3, 2023
The v22.2 backport of cockroachdb#96029
experienced some linter issues that didn't occur in the original patch.

This patch fixes those linter errors.

Release note: none
abarganier added a commit to abarganier/cockroach that referenced this pull request Feb 3, 2023
The v22.2 backport of cockroachdb#96029
experienced some linter issues that didn't occur in the original patch.

This patch fixes those linter errors.

Release note: none
@dhartunian dhartunian deleted the improve-histogram-granularity branch February 5, 2024 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

metric: changes in histogram buckets affects the connection latency graph in db console
6 participants