Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/metrics: use experimental native histograms. #104302

Merged
merged 1 commit into from
Aug 29, 2023

Conversation

jmcarp
Copy link
Collaborator

@jmcarp jmcarp commented Jun 4, 2023

Optionally enable prometheus native histograms. This feature is still
marked as experimental in prometheus, so flag using an environment
variable, and use additional environment variables to tune precision and
max bucket count. Prometheus native histograms currently only support
exponentially spaced buckets, so we also limit their use to metrics with
an exponential distribution. Note that native histograms currently use
the protobuf exposition format, so we add content negotiation to the
prometheus exporter.

Epic: https://cockroachlabs.atlassian.net/browse/CC-9716

Release note: None

@jmcarp jmcarp requested a review from a team June 4, 2023 00:02
@jmcarp jmcarp requested a review from a team as a code owner June 4, 2023 00:02
@blathers-crl
Copy link

blathers-crl bot commented Jun 4, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Josh! Can you provide us more information about this experiment? Some information about the problem this aims to solve, the impact of moving away from pre-defined buckets, and what we hope to learn from these experiments would be helpful to review this PR. We've gotten ourselves into some trouble recently regarding bucket counts and boundaries for histogram metrics, so the obs-infra considers this code sensitive and we want to make sure we use caution when making changes. That's not to say we're completely opposed, but we want to make sure we fully understand what's happening. Happy to talk through this more!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@jmcarp jmcarp force-pushed the prom-native-histogram branch 4 times, most recently from 601c141 to 1635bcc Compare July 7, 2023 02:06
@jmcarp jmcarp requested review from a team as code owners July 7, 2023 02:06
@jmcarp jmcarp force-pushed the prom-native-histogram branch from 1635bcc to 03f50b9 Compare July 7, 2023 03:56
@jmcarp jmcarp requested review from a team as code owners July 7, 2023 03:56
@jmcarp jmcarp requested review from smg260 and renatolabs and removed request for a team July 7, 2023 03:56
@jmcarp jmcarp force-pushed the prom-native-histogram branch 9 times, most recently from 18147dc to 42ec8f7 Compare July 10, 2023 15:20
@jmcarp jmcarp force-pushed the prom-native-histogram branch 3 times, most recently from 78c6b81 to 8ad2230 Compare August 11, 2023 20:29
@jmcarp
Copy link
Collaborator Author

jmcarp commented Aug 12, 2023

Rebased on master after @ericharmeling's metrics refactor was merged, so now we can enable native histograms only if the environment flag is set and if the given metric uses an exponential distribution. I have actually never written a PR against this repo before and am probably missing things, but I'm calling this ready for review—ping @abarganier @ericharmeling.

@jmcarp jmcarp requested a review from ericharmeling August 12, 2023 01:50
@jmcarp
Copy link
Collaborator Author

jmcarp commented Aug 14, 2023

Some additional context:

  • We found that switching from conventional to native histograms dropped overall prometheus memory use by 30-50%. This didn't involve any loss of precision, since the default bucket exponential factor of 1.1 is more precise than any of the current sets of buckets in pkg/util/metric. We can also tune the precision of native histograms by changing the bucket factor via environment variable if necessary.
  • Under this patch, if native histograms are enabled by environment flag, prometheus can be configured to scrape conventional histograms, native histograms, or both. In other words, switching between histogram implementations doesn't even require reconfiguring or restarting the database.
  • The promql for working with conventional vs native histograms is almost identical. For example, the histogram_quantile function works in the same way for conventional and native histograms. We just remove the _bucket suffix from the metric name to query native histograms.

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmcarp this all sounds good. I like the configuration knobs.

Can you update the PR with a detailed commit message that explains the new functionality, how it's configured, and how we and the customer scan take advantage (w/ release notes since our self-hosted customers will want to use this too). I think that and some native histogram test cases are the biggest things missing here.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling, @jmcarp, @renatolabs, and @smg260)


go.mod line 33 at r2 (raw file):

	google.golang.org/genproto v0.0.0-20230227214838-9b19f0bdc514
	google.golang.org/grpc v1.53.0
	google.golang.org/protobuf v1.30.0

I think usually bumping the protobuf version is sensitive. Can we avoid in this PR?


pkg/util/metric/metric.go line 213 at r2 (raw file):

// rather than a collection of per-bucket counter series. If enabled, both
// conventional and native histograms are exported.
const useNativeHistogramsEnvVar = "COCKROACH_ENABLE_NATIVE_HISTOGRAMS"

should the env var clarify that these are PROMETHEUS_NATIVE_HISTOGRAMS? (same with 2 below)


pkg/util/metric/prometheus_exporter_test.go line 22 at r2 (raw file):

)

func TestPrometheusExporter(t *testing.T) {

Add test case for native histogram

@jmcarp jmcarp force-pushed the prom-native-histogram branch 3 times, most recently from 3cbc14c to 0df4929 Compare August 15, 2023 15:54
@jmcarp
Copy link
Collaborator Author

jmcarp commented Aug 15, 2023

Wrote a better commit message.

@jmcarp
Copy link
Collaborator Author

jmcarp commented Aug 15, 2023

go.mod line 33 at r2 (raw file):

Previously, dhartunian (David Hartunian) wrote…

I think usually bumping the protobuf version is sensitive. Can we avoid in this PR?

Done. I see that prometheus/client_golang bumped this automatically recently in prometheus/client_golang#1243, but I believe it's fine to pin to an older version.

@jmcarp
Copy link
Collaborator Author

jmcarp commented Aug 15, 2023

pkg/util/metric/metric.go line 213 at r2 (raw file):

Previously, dhartunian (David Hartunian) wrote…

should the env var clarify that these are PROMETHEUS_NATIVE_HISTOGRAMS? (same with 2 below)

Sounds good, I renamed the envvars.

@jmcarp
Copy link
Collaborator Author

jmcarp commented Aug 15, 2023

pkg/util/metric/prometheus_exporter_test.go line 22 at r2 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Add test case for native histogram

Added.

@jmcarp
Copy link
Collaborator Author

jmcarp commented Aug 21, 2023

Ready for another look @dhartunian! I'll fix merge conflicts.

@jmcarp jmcarp force-pushed the prom-native-histogram branch from 0df4929 to c17735f Compare August 21, 2023 19:35
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 27 files at r2, 8 of 10 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling, @renatolabs, and @smg260)

@jmcarp jmcarp force-pushed the prom-native-histogram branch from c17735f to 6d1d7c9 Compare August 29, 2023 21:02
Optionally enable prometheus native histograms. Native histograms are
sparse histograms with exponentially spaced buckets that require minimal
configuration. They are also represented as a single series, rather than
a series per bucket as for conventional histograms. As a result, native
histograms typically require less memory and storage and offer higher
precision relative to conventional histograms.

This patch enables support for native histograms behind the
COCKROACH_ENABLE_PROMETHEUS_NATIVE_HISTOGRAMS environment variable flag.
We use a flag because native histograms are currently considered
experimental in prometheus. To use this feature, set the
COCKROACH_ENABLE_PROMETHEUS_NATIVE_HISTOGRAMS environment variable to
"true", and enable native histogram support in prometheus by setting
--enable-feature=native-histograms in the server arguments. Note that
crdb exports both conventional and native histograms when native
histogram support is enabled; prometheus can be configured to scrape
either or both.

Prometheus native histograms currently only support exponentially spaced
buckets, so we also limit their use to metrics with an exponential
distribution. Note that native histograms currently use the protobuf
exposition format, so we add content negotiation to the prometheus
exporter.

Epic: https://cockroachlabs.atlassian.net/browse/CC-9716

Release note (ops change): Added support for prometheus native
histograms behind an environment variable flag.
@jmcarp jmcarp force-pushed the prom-native-histogram branch from 6d1d7c9 to 407017d Compare August 29, 2023 21:46
@jmcarp
Copy link
Collaborator Author

jmcarp commented Aug 29, 2023

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 29, 2023

Build succeeded:

@jmcarp
Copy link
Collaborator Author

jmcarp commented Aug 31, 2023

blathers backport 23.1

@blathers-crl
Copy link

blathers-crl bot commented Aug 31, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 407017d to blathers/backport-release-23.1-104302: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants