Add native histogram specification #2539

beorn7 · 2024-11-04T22:55:12Z

Here is finally the opus magnum that was in the making for far too long. Partially addressing prometheus/prometheus#11561 .

It is partially more documentation than specification, but I assume the specification section is still the least bad place for it.

This is intentionally not yet complete. TODOs mark the missing parts, most of which cannot be resolved at the moment because implementation work is still missing and/or decisions still have to be made.

The intention here is to finally get something out here, even if it is pretty much still a "living document". It will probably take us another year or so to resolve all TODOs.

Dear reviewers, primary goal of the review is to detect what I got wrong here and what I have forgotten. Language improvements are of course welcome, but of secondary importance. And as said above, the TODOs should mostly stay for this first version. Of course, if you have an easy way of resolving one, be my guest.

Tagging potential reviewers here (simply focus on your areas of expertise):
@bwplotka @juliusv @krajorama @zenador @fatsheep9146 @NeerajGartia21 @Maniktherana @fpetkovski @vesari @carrieedwards @fionaliao @codesome @ArthurSens @csmarchbanks @fstab

It is partially more documentation than specification, but I assume the specification section is still the least bad place for it. This is intentionally not yet complete. TODOs mark the missing parts, most of which cannot be resolved at the moment because implementation work is still missing and/or decisions still have to be made. The intention here is to finally get something out here, even if it is pretty much still a "living document". It will probably take us another year or so to resolve all TODOs. Signed-off-by: beorn7 <[email protected]>

beorn7 · 2024-11-04T23:00:20Z

https://deploy-preview-2539--prometheus-docs.netlify.app/docs/specs/native_histograms/ is the preview, very convenient to see the rendered layout.

krajorama · 2024-11-06T08:44:55Z

cc @jhesketh @charleskorn have also dealt with native histograms due to their query engine work.

krajorama

Couple of comments. Haven't gone through yet even once, on line 738 Instrumentation libraries.

content/docs/specs/native_histograms.md

krajorama

done with first read through

content/docs/specs/native_histograms.md

To be squashed before merging. Signed-off-by: beorn7 <[email protected]>

beorn7 · 2024-11-13T17:37:04Z

Thanks everyone for the review so far.

In the commit I have just pushed I have addressed all the feedback and resolved the conversation where I assumed that my response is uncontroversial. The conversations still open might require more discussion.

I have not yet reworded the counter reset handling according to our recent discussions. I will do that tomorrow with a refreshed brain.

This now takes into account that we (currently) have to return "unknown counter reset" for every 1st histogram from a counter histogram chunk and links the tracking issue to eventually change that. In fact, the text is now slightly shorter and simpler. Signed-off-by: beorn7 <[email protected]>

beorn7 · 2024-11-14T16:49:20Z

OK, counter reset handling should be up to date now.

beorn7 · 2024-12-04T15:31:18Z

Note: Apparently, I forgot to send all the 52 pending replies I wrote several weeks ago.
They are submitted now, so now it makes way more sense where I resolved a conversation and where not.

Squash this before merging. Signed-off-by: beorn7 <[email protected]>

content/docs/specs/native_histograms.md

Signed-off-by: beorn7 <[email protected]>

beorn7 · 2024-12-05T14:08:05Z

OK, I think we are done. @krajorama please let me know if I have missed anything.

bwplotka

Nice! Quite an effort and level of the details. Thanks for this work!

Added a few comments, I wonder if we could prepare the reader for nhcb dimension in the glossary and flavors 🤔 Also should we version this spec (like we do with PRW)

I also have a general questions to counter reset hints. We could simplify by using the CT feature on native histograms instead of reset hints. Those feature do the same, just CT allows a bit more information to be passed (when exactly reset happened). I understand the reset hints are implemented and "baking" for stability, so perhaps it should be improved in some 2.0 🤔

content/docs/specs/native_histograms.md

bwplotka · 2024-12-09T12:04:00Z

content/docs/specs/native_histograms.md

+defines the way bucket boundaries are calculated. The currently valid values
+are -53 and the range between and including -4 and +8. More schemas may be
+added in the future. -53 is a schema for so-called _custom bucket boundaries_
+or short _custom buckets_, while the other schema numbers represent the


Let's add those names to glossary 🤗

Maybe we can consider extending the glossary to all the new terms introduced by this document. However, there would be a lot. If this was a printed document, it might still be helpful for cross reference, but since this is electronic, the reader will simply search for any term they don't remember. (The few terms in the glossary so far are those that might easily be misunderstood or confused.)

content/docs/specs/native_histograms.md

bwplotka · 2024-12-09T12:16:52Z

content/docs/specs/native_histograms.md

+approach, which might be improved in the future to find a better trade-off
+between chunk size and compression ratio.
+
+### Counter reset considerations


This is quite complex. What do you think about switching to Created Timestamp feature for it? Would it simplify it? We will have it for all counters soon.

CT is in itself a new and experimental feature. I wouldn't conflate them.

Even if CT were already established, I don't think it can be relied upon for NH ingestion. We need to detect counter reset as part of the encoding anyway. CT is more an additional source of information: We can mark (and store) a counter reset even if it is not detectable by conventional means, see below.

bwplotka · 2024-12-09T12:18:08Z

content/docs/specs/native_histograms.md

+The counter reset information of the synthetic zero sample is always set to
+`CounterReset`. (TODO: Currently, Prometheus probably sets it to
+`UnknownCounterReset` for the first sample of a series, which is not wrong, but
+I think setting it to `CounterReset` makes more sense.)


Yea, I think we want to switch to native CT handling, not synthetic zeros. So switching counter reset to CT would be one way of doing this, going forward 🤔

krajorama

LGTM as starting point, awesome work .

beorn7 · 2024-12-10T19:41:43Z

Thanks @krajorama & @bwplotka

@bwplotka : I'll address the individual comments soon. Just a few general things in advance:

I wonder if we could prepare the reader for nhcb dimension in the glossary and flavors

I tried to include NHCB wherever it was already possible. NHCB is a schema, though, and thus an already defined dimension of the data model. I wouldn't put it under "flavor" (which is anyway a bit of a makeshift solution to have a word for the counter/gauge and integer/float dichotomies).

WRT glossary: My intention was to nail down the words that the reader might get wrong (because of historical baggage ("sparse histogram") or confusion with previously existing concepts (classic vs. native histograms)). Adding all the terms newly introduced by this document would be quite repetitive, as one of the main purposes of this document is to introduce and explain those terms.

Also should we version this spec (like we do with PRW)

Maybe. But not while it is not really done. It is kind-of 0.x right now. But even once it is finalized, I'm not sure if versioning makes sense because this is more a concept document than strictly a spec. Let's discuss once we have declared native histograms a stable feature.

I also have a general questions to counter reset hints. We could simplify by using the CT feature on native histograms instead of reset hints. Those feature do the same, just CT allows a bit more information to be passed (when exactly reset happened). I understand the reset hints are implemented and "baking" for stability, so perhaps it should be improved in some 2.0

Counter resets are more deeply baked into NH. For example, it is impossible to have a counter reset in the middle of a chunk, based on the way the encoding works. Thus, we cannot rely on an external feature, that might or might not be present, to mark counter resets. Or even if we did rely on it, we would still store the counter reset implicitly. In any case, we have to see how CT shakes out in practice, how clock skew is handled and all of that.

Signed-off-by: beorn7 <[email protected]>

beorn7 · 2024-12-10T20:50:38Z

One more thought about the terms I put into the glossary: I think I understand my reluctance to put all the new terms into it better now. What I want is that the reader can just read through the whole document in order, including the glossary. This means it should only contain terms the reader is already somewhat familiar with. A "conventional" glossary is more something you would put at the end of the document where the reader can refer to if they don't remember a certain term in the middle of the text. But the glossary I'm using here is more along the lines of "let's get some common misunderstandings out of the way first".

fatsheep9146 · 2024-12-11T02:22:33Z

content/docs/specs/native_histograms.md

+- The next negative bucket (index _i_+1 relative to the bucket from the
+  previous item) has an upper exclusive limit of `MinFloat64` and an lower
+  inclusive limit of `-Inf`. (It could be called a _negative overflow bucket_.)
+- Buckets beyond the `+Inf` and `-Inf` buckets described above MUST NOT be used.


Using the positive overflow bucket and negative overflow bucket might be more accurate than +Inf and -Inf buckets?

Those terms are only introduced in parentheses with "they could be called". I'd rather refer to the limits as that's firmly established before. It's also very accurate because the only observed value that will go into the +Inf bucket is, well, +Inf, and likewise for -Inf.

bwplotka · 2024-12-11T16:24:50Z

Counter resets are more deeply baked into NH. For example, it is impossible to have a counter reset in the middle of a chunk, based on the way the encoding works. Thus, we cannot rely on an external feature, that might or might not be present, to mark counter resets. Or even if we did rely on it, we would still store the counter reset implicitly. In any case, we have to see how CT shakes out in practice, how clock skew is handled and all of that.

We have exactly similar problem with CT to solve, so I think it makes sense to think how in future CT could replace reset hints. The only reason "why not" is that reset hints obviously take less bytes.

However, happy to stop pressing this topic into the stabilization and docs of 1.0 NH here, too early agree 👍🏽

beorn7 · 2024-12-12T10:27:38Z

@bwplotka is there anything left to address right now? Can I merge this?

bwplotka

@bwplotka is there anything left to address right now? Can I merge this?

All good, thanks for an amazing and deep work here! 💪🏽

beorn7 · 2024-12-12T13:02:39Z

There we go. And this is just the beginning. 43 TODOs to address… 😅

fatsheep9146 · 2024-12-13T02:16:06Z

content/docs/specs/native_histograms.md

+  from zero, a relatively large threshold of the zero bucket helps to avoid
+  many high resolution buckets for a range that is not of interest.
+
+The threshold of the zero bucket SHOULD coincide with a boundary of a regular


Do this should be "SHOULD NOT" instead of "SHOULD"? I take "coincide with" as "same or overlapped".

It should be SHOULD.

What I want to express is that the zero bucket threshold should be the the same as a boundary of one of the regular buckets. My understanding is that "coincide" is precisely that. But happy to be corrected by a native speaker.

Thanks for clarifying, I understand you now. It should be SHOULD.

krajorama self-assigned this Nov 6, 2024

krajorama reviewed Nov 6, 2024

View reviewed changes

krajorama reviewed Nov 7, 2024

View reviewed changes

fionaliao reviewed Nov 7, 2024

View reviewed changes

content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved

fionaliao reviewed Nov 8, 2024

View reviewed changes

carrieedwards reviewed Nov 8, 2024

View reviewed changes

Addressed review feedback.

4d58904

To be squashed before merging. Signed-off-by: beorn7 <[email protected]>

beorn7 mentioned this pull request Nov 14, 2024

Re-organize overlapping sections (Concepts, Best Practices, Tutorials, Guides, Specifications) #2547

Open

beorn7 mentioned this pull request Nov 20, 2024

Some aggregations and functions produce incorrect results for native histograms prometheus/prometheus#13934

Open

More review feedback addressed

49a5fe9

Squash this before merging. Signed-off-by: beorn7 <[email protected]>

krajorama reviewed Dec 5, 2024

View reviewed changes

content/docs/specs/native_histograms.md Outdated Show resolved Hide resolved

review responses, TBS

216fe84

Signed-off-by: beorn7 <[email protected]>

bwplotka mentioned this pull request Dec 9, 2024

OM 2.0: Native Histogram Support in Text format. prometheus/OpenMetrics#279

Open

bwplotka reviewed Dec 9, 2024

View reviewed changes

krajorama approved these changes Dec 10, 2024

View reviewed changes

beorn7 added 2 commits December 10, 2024 20:52

TBS

6c180a8

Signed-off-by: beorn7 <[email protected]>

TBS

86dda6c

Signed-off-by: beorn7 <[email protected]>

fatsheep9146 reviewed Dec 11, 2024

View reviewed changes

bwplotka approved these changes Dec 12, 2024

View reviewed changes

beorn7 merged commit bf0e251 into main Dec 12, 2024
5 checks passed

beorn7 deleted the beorn7/histogram branch December 12, 2024 12:59

fatsheep9146 reviewed Dec 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add native histogram specification #2539

Add native histogram specification #2539

beorn7 commented Nov 4, 2024

beorn7 commented Nov 4, 2024

krajorama commented Nov 6, 2024

krajorama left a comment

krajorama left a comment

beorn7 commented Nov 13, 2024

beorn7 commented Nov 14, 2024

beorn7 commented Dec 4, 2024

beorn7 commented Dec 5, 2024

bwplotka left a comment

bwplotka Dec 9, 2024

beorn7 Dec 10, 2024

bwplotka Dec 9, 2024

beorn7 Dec 10, 2024

bwplotka Dec 9, 2024

beorn7 Dec 10, 2024

krajorama left a comment

beorn7 commented Dec 10, 2024

beorn7 commented Dec 10, 2024

fatsheep9146 Dec 11, 2024

beorn7 Dec 11, 2024

bwplotka commented Dec 11, 2024

beorn7 commented Dec 12, 2024

bwplotka left a comment

beorn7 commented Dec 12, 2024

fatsheep9146 Dec 13, 2024

beorn7 Dec 13, 2024

fatsheep9146 Dec 13, 2024 •

edited

Loading

Add native histogram specification #2539

Add native histogram specification #2539

Conversation

beorn7 commented Nov 4, 2024

beorn7 commented Nov 4, 2024

krajorama commented Nov 6, 2024

krajorama left a comment

Choose a reason for hiding this comment

krajorama left a comment

Choose a reason for hiding this comment

beorn7 commented Nov 13, 2024

beorn7 commented Nov 14, 2024

beorn7 commented Dec 4, 2024

beorn7 commented Dec 5, 2024

bwplotka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krajorama left a comment

Choose a reason for hiding this comment

beorn7 commented Dec 10, 2024

beorn7 commented Dec 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bwplotka commented Dec 11, 2024

beorn7 commented Dec 12, 2024

bwplotka left a comment

Choose a reason for hiding this comment

beorn7 commented Dec 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fatsheep9146 Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

fatsheep9146 Dec 13, 2024 •

edited

Loading