Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine definition for *.limit metric name suffix -or- update current usage to match definition #438

Open
arminru opened this issue Oct 23, 2023 · 4 comments
Assignees

Comments

@arminru
Copy link
Member

arminru commented Oct 23, 2023

In #409 (comment) it was brought up that certain metrics with the .limit suffix are defined as Gauges, whereas most others are defined as UpDownCounters:

.limits that are UpDownCounters:

.limits that are Gauges:


The UpDownCounters are consistent with our current definition for .limit at https://github.com/open-telemetry/semantic-conventions/blob/v1.22.0/docs/general/metrics.md#instrument-naming:

  • limit - an instrument that measures the constant, known total amount of
    something should be called entity.limit. For example, system.memory.limit
    for the total amount of memory on a system.

One can sum up the existing memory, disk space, network bandwidth, or power supply within a given system or compositions of them and get a meaningful aggregate representing the "total amount" available.

The Gauge metrics, however, don't represent an available "total amount". One cannot add the maximum permissible temperature (°C) over multiple components, battery charge fraction for stable operation (%) over multiple batteries, or permissible voltage (V) over multiple components. The aggregated sum breaks the definition and expectation for the individual metric observations.

Two CPUs that can sustain 100 °C each, for example. won't sustain 200 °C together (or 40°C on one and 160°C on the other). Three SSDs that operate at 3.3 V won't tolerate 9.9 V on the shared power supply. Neither is a maximum charge level of 300% for three (potentially different) batteries a helpful aggregation.


I think our options to resolve this are:

  1. Adapt the definition of limit to allow for both use cases or interpretations.
    We'd need to remove the "total amount" wording and replace it with something else. We should also consider adding a note that both aggregatable and non-aggregatable limits can occur.
  2. Keep the current definition of limit and introduce a new, well-known suffix for the non-aggregatable limits and change the current Gauge metrics to use this suffix instead.
  3. Keep the current definition of limit and change the current Gauge metrics to use some other suffix that's not defined by our naming conventions.

I'm looking for feedback on which direction we should pursue and potential suggestions for the respective naming/wording.

@arminru
Copy link
Member Author

arminru commented Oct 23, 2023

cc @bertysentry since you initially proposed most of the hardware semantic conventions.
Would be great to get some context and suggestions from you.

@bertysentry
Copy link
Contributor

That's an excellent question!

TBH, I had not read the exact definition of the .limit suffix when I wrote the hw. semconv. I just inferred that limit was literally a limit for the underlying instrument.

I'm a bit surprised by the actual definition, which is counter-intuitive: a "limit" shouldn't necessarily mean a total amount. My opinion is that this definition is too restrictive.

My recommendation is to update the definition of the .limit suffix as below:

An instrument that measures the limit of another entity instrument should be named entity.limit.

Examples:

  • system.memory.limit
  • hw.temperature.limit

Different types of limits can be precised with the limit_type attribute (max, min, degraded, critical, throttled, etc.).

The type and unit of the .limit metric must be the same as the underlying metric. Example: hw.temperature is a Gauge in Celsius degrees, therefore hw.temperature.limit must be defined as a Gauge in Celsius degrees as well.

@eero-t
Copy link

eero-t commented May 22, 2024

I'm a bit surprised by the actual definition, which is counter-intuitive: a "limit" shouldn't necessarily mean a total amount.
...
Different types of limits can be precised with the limit_type attribute (max, min, degraded, critical, throttled, etc.).

For example with GPUs and their memory; there's the physical amount of memory, how much of that kernel exposes to user-space (after deducting its own overheads), and how much of that user-space API exposes to applications.

In OneAPI Level-Zero Sysman API, first 2 limits are named as total and available.

In OpenCL API, first and last are named as GLOBAL_MEM_SIZE and MAX_MEM_ALLOC_SIZE.

@bertysentry
Copy link
Contributor

Conclusion to @arminru's question: option 1 (Adapt the definition of limit to allow for both use cases or interpretations.
We'd need to remove the "total amount" wording and replace it with something else. We should also consider adding a note that both aggregatable and non-aggregatable limits can occur.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants