From 1040fc24fc671c3ae903edfc0a259ef3755aaa03 Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Wed, 9 Sep 2020 19:16:01 +0000 Subject: [PATCH 01/26] System metrics semantic conventions Conventions from [OTEP 119](https://github.com/open-telemetry/oteps/pull/119) --- CHANGELOG.md | 2 + .../metrics/semantic_conventions/README.md | 7 +- .../semantic_conventions/process-metrics.md | 21 +++ .../semantic_conventions/runtime-metrics.md | 42 +++++ .../semantic_conventions/system-metrics.md | 153 ++++++++++++++++++ 5 files changed, 224 insertions(+), 1 deletion(-) create mode 100644 specification/metrics/semantic_conventions/process-metrics.md create mode 100644 specification/metrics/semantic_conventions/runtime-metrics.md create mode 100644 specification/metrics/semantic_conventions/system-metrics.md diff --git a/CHANGELOG.md b/CHANGELOG.md index 2d96cd1ef85..67fb3629f64 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -23,6 +23,8 @@ New: ([#697](https://github.com/open-telemetry/opentelemetry-specification/pull/697)) * API was extended to allow adding arbitrary event attributes ([#874](https://github.com/open-telemetry/opentelemetry-specification/pull/874)) * `exception.escaped` was added ([#784](https://github.com/open-telemetry/opentelemetry-specification/pull/784)) +- Add semantic conventions for system metrics + ([#937](https://github.com/open-telemetry/opentelemetry-specification/pull/937)) Updates: diff --git a/specification/metrics/semantic_conventions/README.md b/specification/metrics/semantic_conventions/README.md index 6c63439e4e3..0588fcde98f 100644 --- a/specification/metrics/semantic_conventions/README.md +++ b/specification/metrics/semantic_conventions/README.md @@ -1,6 +1,11 @@ # Metrics Semantic Conventions -TODO: Add semantic conventions for metric names and labels. +The following semantic conventions surrounding metrics are defined: + +* [HTTP Metrics](http-metrics.md): Semantic conventions and instruments for HTTP metrics. +* [System Metrics](system-metrics.md): Semantic conventions and instruments for standard system metrics. +* [Process Metrics](process-metrics.md): Semantic conventions and instruments for standard process metrics. +* [Runtime Metrics](runtime-metrics.md): Semantic conventions and instruments for runtime metrics. Apart from semantic conventions for metrics and [traces](../../trace/semantic_conventions/README.md), OpenTelemetry also defines the concept of overarching [Resources](../../resource/sdk.md) with their own diff --git a/specification/metrics/semantic_conventions/process-metrics.md b/specification/metrics/semantic_conventions/process-metrics.md new file mode 100644 index 00000000000..66479b22f77 --- /dev/null +++ b/specification/metrics/semantic_conventions/process-metrics.md @@ -0,0 +1,21 @@ +# Semantic Conventions for Process Metrics + +This document describes instruments and labels for common process level +metrics in OpenTelemetry. Also consider the general [semantic conventions for +system metrics](system-metrics.md#semantic-conventions) when creating +instruments not explicitly defined in this document. + + + + + +- [Metric Instruments](#metric-instruments) + * [Standard Process Metrics - `process.`](#standard-process-metrics---process) + + + +## Metric Instruments + +### Standard Process Metrics - `process.` + +TODO diff --git a/specification/metrics/semantic_conventions/runtime-metrics.md b/specification/metrics/semantic_conventions/runtime-metrics.md new file mode 100644 index 00000000000..7f4bd729ad0 --- /dev/null +++ b/specification/metrics/semantic_conventions/runtime-metrics.md @@ -0,0 +1,42 @@ +# Semantic Conventions for Runtime Metrics + +This document describes instruments and labels for common runtime level +metrics in OpenTelemetry. Also consider the general [semantic conventions for +system metrics](system-metrics.md#semantic-conventions) when creating +instruments not explicitly defined in this document. + + + + + +- [Metric Instruments](#metric-instruments) + * [Runtime Metrics - `runtime.`](#runtime-metrics---runtime) + + [Runtime Specific Metrics - `runtime.{environment}.`](#runtime-specific-metrics---runtimeenvironment) + + + +## Metric Instruments + +### Runtime Metrics - `runtime.` + +Runtime environments vary widely in their terminology, implementation, and +relative values for a given metric. For example, Go and Python are both +garbage collected languages, but comparing heap usage between the two +runtimes directly is not meaningful. For this reason, this document does not +propose any standard top-level runtime metric instruments. See [OTEP +108](https://github.com/open-telemetry/oteps/pull/108/files) for additional +discussion. + +#### Runtime Specific Metrics - `runtime.{environment}.` + +Runtime level metrics specific to a certain runtime environment should be +prefixed with `runtime.{environment}.` and follow the semantic conventions +outlined in [semantic conventions for system +metrics](system-metrics.md#semantic-conventions). For example, Go runtime +metrics use `runtime.go.` as a prefix. + +Some programming languages have multiple runtime environments that vary +significantly in their implementation, for example [Python has many +implementations](https://wiki.python.org/moin/PythonImplementations). For +these languages, consider using specific `environment` prefixes to avoid +ambiguity, like `runtime.cpython.` and `runtime.pypy.`. diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md new file mode 100644 index 00000000000..ca6f9f85e65 --- /dev/null +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -0,0 +1,153 @@ +# Semantic Conventions for System Metrics + +This document describes instruments and labels for common system level +metrics in OpenTelemetry. Also included are general semantic conventions for +system, process, and runtime metrics, which should be considered when +creating instruments not explicitly defined in the specification. + + + + + +- [Semantic Conventions](#semantic-conventions) + * [Instrument Names](#instrument-names) + * [Units](#units) +- [Metric Instruments](#metric-instruments) + * [Standard System Metrics - `system.`](#standard-system-metrics---system) + + [`system.cpu.`](#systemcpu) + + [`system.memory.`](#systemmemory) + + [`system.swap.`](#systemswap) + + [`system.disk.`](#systemdisk) + + [`system.filesystem.`](#systemfilesystem) + + [`system.network.`](#systemnetwork) + + [`system.process.`](#systemprocess) + + [OS Specific System Metrics - `system.{os}.`](#os-specific-system-metrics---systemos) + + + +## Semantic Conventions + +The following semantic conventions aim to keep naming consistent. They +provide guidelines for most of the cases in this specification and should be +followed for other instruments not explicitly defined in this document. + +### Instrument Names + +- **usage** - an instrument that measures an amount used out of a known total +amount should be called `entity.usage`. For example, +`system.filesystem.usage` for the amount of disk spaced used. A measure of +the amount of an unlimited resource consumed is differentiated from +**usage**. This may be time, data, etc. +- **utilization** - an instrument that measures a *value ratio* of usage +(like percent, but in the range `[0, 1]`) should be called +`entity.utilization`. For example, `system.memory.utilization` for the ratio +of memory in use. +- **time** - an instrument that measures passage of time should be called +`entity.time`. For example, `system.cpu.time` with varying values of label +`state` for idle, user, etc. +- **io** - an instrument that measures bidirectional data flow should be +called `entity.io` and have labels for direction. For example, +`system.network.io`. +- Other instruments that do not fit the above descriptions may be named more +freely. For example, `system.swap.page_faults` and `system.network.packets`. +Units do not need to be specified in the names since they are included during +instrument creation, but can be added if there is ambiguity. + +### Units + +- Instruments for utilization metrics (that measure the ratio out of a total) +SHOULD use units of `1`. Such values represent a *value ratio* and are always +in the range `[0, 1]`. +- Instruments that measure an integer count of something SHOULD use semantic +units like `packets`, `errors`, `faults`, etc. + +## Metric Instruments + +### Standard System Metrics - `system.` + +#### `system.cpu.` + +**Description:** System level processor metrics. +| Name | Units | Instrument Type | Value Type | Label Key | Label Values | +| ---------------------- | ------- | ----------------- | ---------- | --------- | ----------------------------------- | +| system.cpu.time | seconds | SumObserver | Double | state | idle, user, system, interrupt, etc. | +| | | | | cpu | 1 - #cores | +| system.cpu.utilization | 1 | UpDownSumObserver | Double | state | idle, user, system, interrupt, etc. | +| | | | | cpu | 1 - #cores | + +#### `system.memory.` + +**Description:** System level memory metrics. +| Name | Units | Instrument Type | Value Type | Label Key | Label Values | +| ------------------------- | ----- | ----------------- | ---------- | --------- | ------------------------ | +| system.memory.usage | bytes | UpDownSumObserver | Int64 | state | used, free, cached, etc. | +| system.memory.utilization | 1 | ValueObserver | Double | state | used, free, cached, etc. | + +#### `system.swap.` + +**Description:** System level swap/paging metrics. +| Name | Units | Instrument Type | Value Type | Label Key | Label Values | +| ---------------------------- | ---------- | ----------------- | ---------- | --------- | ------------ | +| system.swap.usage | pages | UpDownSumObserver | Int64 | state | used, free | +| system.swap.utilization | 1 | ValueObserver | Double | state | used, free | +| system.swap.page\_faults | faults | SumObserver | Int64 | type | major, minor | +| system.swap.page\_operations | operations | SumObserver | Int64 | type | major, minor | +| | | | | direction | in, out | + +#### `system.disk.` + +**Description:** System level disk performance metrics. +| Name | Units | Instrument Type | Value Type | Label Key | Label Values | +| ---------------------------- | ---------- | --------------- | ---------- | --------- | ------------ | +| system.disk.io | bytes | SumObserver | Int64 | device | (identifier) | +| | | | | direction | read, write | +| system.disk.operations | operations | SumObserver | Int64 | device | (identifier) | +| | | | | direction | read, write | +| system.disk.time | seconds | SumObserver | Double | device | (identifier) | +| | | | | direction | read, write | +| system.disk.merged | 1 | SumObserver | Int64 | device | (identifier) | +| | | | | direction | read, write | + +#### `system.filesystem.` + +**Description:** System level filesystem metrics. +| Name | Units | Instrument Type | Value Type | Label Key | Label Values | +| ----------------------------- | ----- | ----------------- | ---------- | --------- | -------------------- | +| system.filesystem.usage | bytes | UpDownSumObserver | Int64 | device | (identifier) | +| | | | | state | used, free, reserved | +| system.filesystem.utilization | 1 | ValueObserver | Double | device | (identifier) | +| | | | | state | used, free, reserved | + +#### `system.network.` + +**Description:** System level network metrics. +| Name | Units | Instrument Type | Value Type | Label Key | Label Values | +| ------------------------------- | ----------- | ----------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- | +| system.network.dropped\_packets | packets | SumObserver | Int64 | device | (identifier) | +| | | | | direction | transmit, receive | +| system.network.packets | packets | SumObserver | Int64 | device | (identifier) | +| | | | | direction | transmit, receive | +| system.network.errors | errors | SumObserver | Int64 | device | (identifier) | +| | | | | direction | transmit, receive | +| system.network.io | bytes | SumObserver | Int64 | device | (identifier) | +| | | | | direction | transmit, receive | +| system.network.connections | connections | UpDownSumObserver | Int64 | device | (identifier) | +| | | | | protocol | tcp, udp, [etc.](https://en.wikipedia.org/wiki/Transport_layer#Protocols) | +| | | | | state | [e.g. for tcp](https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Protocol_operation) | + +#### `system.process.` + +**Description:** System level aggregate process metrics. For metrics at the +individual process level, see [process metrics](process-metrics.md). +| Name | Units | Instrument Type | Value Type | Label Key | Label Values | +| -------------------- | --------- | --------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- | +| system.process.count | processes | SumObserver | Int64 | status | running, sleeping, [etc.](https://man7.org/linux/man-pages/man1/ps.1.html#PROCESS_STATE_CODES) | + +#### OS Specific System Metrics - `system.{os}.` + +Instrument names for system level metrics that have different and conflicting +meaning across multiple OSes should be prefixed with `system.{os}.` and +follow the hierarchies listed above for different entities like CPU, memory, +and network. For example, an instrument for measuring the load average on +Linux could be named `system.linux.cpu.load`, reusing the `cpu` name proposed +above. From f7f2ef7c1efdaf02888fdcbd3d24dca7b6e0713f Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Thu, 10 Sep 2020 20:00:06 -0400 Subject: [PATCH 02/26] change process count to UpDownSumObserver --- specification/metrics/semantic_conventions/system-metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md index ca6f9f85e65..ad5d789952b 100644 --- a/specification/metrics/semantic_conventions/system-metrics.md +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -141,7 +141,7 @@ units like `packets`, `errors`, `faults`, etc. individual process level, see [process metrics](process-metrics.md). | Name | Units | Instrument Type | Value Type | Label Key | Label Values | | -------------------- | --------- | --------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- | -| system.process.count | processes | SumObserver | Int64 | status | running, sleeping, [etc.](https://man7.org/linux/man-pages/man1/ps.1.html#PROCESS_STATE_CODES) | +| system.process.count | processes | UpDownSumObserver | Int64 | status | running, sleeping, [etc.](https://man7.org/linux/man-pages/man1/ps.1.html#PROCESS_STATE_CODES) | #### OS Specific System Metrics - `system.{os}.` From 98d72a165a38e4847c9954506b75f9e9e208e6eb Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Fri, 11 Sep 2020 16:15:25 +0000 Subject: [PATCH 03/26] fix system.cpu.utilization, use better example --- .../semantic_conventions/system-metrics.md | 51 ++++++++++--------- 1 file changed, 26 insertions(+), 25 deletions(-) diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md index ad5d789952b..acec33166ea 100644 --- a/specification/metrics/semantic_conventions/system-metrics.md +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -9,19 +9,20 @@ creating instruments not explicitly defined in the specification. -- [Semantic Conventions](#semantic-conventions) - * [Instrument Names](#instrument-names) - * [Units](#units) -- [Metric Instruments](#metric-instruments) - * [Standard System Metrics - `system.`](#standard-system-metrics---system) - + [`system.cpu.`](#systemcpu) - + [`system.memory.`](#systemmemory) - + [`system.swap.`](#systemswap) - + [`system.disk.`](#systemdisk) - + [`system.filesystem.`](#systemfilesystem) - + [`system.network.`](#systemnetwork) - + [`system.process.`](#systemprocess) - + [OS Specific System Metrics - `system.{os}.`](#os-specific-system-metrics---systemos) +- [Semantic Conventions for System Metrics](#semantic-conventions-for-system-metrics) + - [Semantic Conventions](#semantic-conventions) + - [Instrument Names](#instrument-names) + - [Units](#units) + - [Metric Instruments](#metric-instruments) + - [Standard System Metrics - `system.`](#standard-system-metrics---system) + - [`system.cpu.`](#systemcpu) + - [`system.memory.`](#systemmemory) + - [`system.swap.`](#systemswap) + - [`system.disk.`](#systemdisk) + - [`system.filesystem.`](#systemfilesystem) + - [`system.network.`](#systemnetwork) + - [`system.process.`](#systemprocess) + - [OS Specific System Metrics - `system.{os}.`](#os-specific-system-metrics---systemos) @@ -34,9 +35,9 @@ followed for other instruments not explicitly defined in this document. ### Instrument Names - **usage** - an instrument that measures an amount used out of a known total -amount should be called `entity.usage`. For example, -`system.filesystem.usage` for the amount of disk spaced used. A measure of -the amount of an unlimited resource consumed is differentiated from +amount should be called `entity.usage`. For example, `system.memory.usage` +for the amount of memory used. A measure of the amount of an unlimited +resource consumed is differentiated from **usage**. This may be time, data, etc. - **utilization** - an instrument that measures a *value ratio* of usage (like percent, but in the range `[0, 1]`) should be called @@ -68,12 +69,12 @@ units like `packets`, `errors`, `faults`, etc. #### `system.cpu.` **Description:** System level processor metrics. -| Name | Units | Instrument Type | Value Type | Label Key | Label Values | -| ---------------------- | ------- | ----------------- | ---------- | --------- | ----------------------------------- | -| system.cpu.time | seconds | SumObserver | Double | state | idle, user, system, interrupt, etc. | -| | | | | cpu | 1 - #cores | -| system.cpu.utilization | 1 | UpDownSumObserver | Double | state | idle, user, system, interrupt, etc. | -| | | | | cpu | 1 - #cores | +| Name | Units | Instrument Type | Value Type | Label Key | Label Values | +| ---------------------- | ------- | --------------- | ---------- | --------- | ----------------------------------- | +| system.cpu.time | seconds | SumObserver | Double | state | idle, user, system, interrupt, etc. | +| | | | | cpu | 1 - #cores | +| system.cpu.utilization | 1 | ValueObserver | Double | state | idle, user, system, interrupt, etc. | +| | | | | cpu | 1 - #cores | #### `system.memory.` @@ -139,9 +140,9 @@ units like `packets`, `errors`, `faults`, etc. **Description:** System level aggregate process metrics. For metrics at the individual process level, see [process metrics](process-metrics.md). -| Name | Units | Instrument Type | Value Type | Label Key | Label Values | -| -------------------- | --------- | --------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- | -| system.process.count | processes | UpDownSumObserver | Int64 | status | running, sleeping, [etc.](https://man7.org/linux/man-pages/man1/ps.1.html#PROCESS_STATE_CODES) | +| Name | Units | Instrument Type | Value Type | Label Key | Label Values | +| -------------------- | --------- | ----------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- | +| system.process.count | processes | UpDownSumObserver | Int64 | status | running, sleeping, [etc.](https://man7.org/linux/man-pages/man1/ps.1.html#PROCESS_STATE_CODES) | #### OS Specific System Metrics - `system.{os}.` From 9d20079f032300bbf154854c902193493b716179 Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Thu, 24 Sep 2020 20:59:42 +0000 Subject: [PATCH 04/26] first several comments --- .../metrics/semantic_conventions/README.md | 2 +- .../runtime-environment-metrics.md | 47 ++++++++++ .../semantic_conventions/runtime-metrics.md | 42 --------- .../semantic_conventions/system-metrics.md | 88 ++++++++++++------- 4 files changed, 103 insertions(+), 76 deletions(-) create mode 100644 specification/metrics/semantic_conventions/runtime-environment-metrics.md delete mode 100644 specification/metrics/semantic_conventions/runtime-metrics.md diff --git a/specification/metrics/semantic_conventions/README.md b/specification/metrics/semantic_conventions/README.md index 0588fcde98f..cea4f0975b0 100644 --- a/specification/metrics/semantic_conventions/README.md +++ b/specification/metrics/semantic_conventions/README.md @@ -5,7 +5,7 @@ The following semantic conventions surrounding metrics are defined: * [HTTP Metrics](http-metrics.md): Semantic conventions and instruments for HTTP metrics. * [System Metrics](system-metrics.md): Semantic conventions and instruments for standard system metrics. * [Process Metrics](process-metrics.md): Semantic conventions and instruments for standard process metrics. -* [Runtime Metrics](runtime-metrics.md): Semantic conventions and instruments for runtime metrics. +* [Runtime Environment Metrics](runtime-environment-metrics.md): Semantic conventions and instruments for runtime environment metrics. Apart from semantic conventions for metrics and [traces](../../trace/semantic_conventions/README.md), OpenTelemetry also defines the concept of overarching [Resources](../../resource/sdk.md) with their own diff --git a/specification/metrics/semantic_conventions/runtime-environment-metrics.md b/specification/metrics/semantic_conventions/runtime-environment-metrics.md new file mode 100644 index 00000000000..c4e057a18bd --- /dev/null +++ b/specification/metrics/semantic_conventions/runtime-environment-metrics.md @@ -0,0 +1,47 @@ +# Semantic Conventions for Runtime Environment Metrics + +This document includes semantic conventions for runtime environment level +metrics in OpenTelemetry. Also consider the general semantic conventions for +[system metrics](system-metrics.md#semantic-conventions) and [process +metrics](process-metrics.md) when instrumenting runtime environments. + + + + + +- [Semantic Conventions for Runtime Environment Metrics](#semantic-conventions-for-runtime-environment-metrics) + - [Metric Instruments](#metric-instruments) + - [Runtime Environment Metrics - `runtime.`](#runtime-environment-metrics---runtime) + - [Runtime Environment Specific Metrics - `runtime.{environment}.`](#runtime-environment-specific-metrics---runtimeenvironment) + + + +## Metric Instruments + +### Runtime Environment Metrics - `runtime.` + +Runtime environments vary widely in their terminology, implementation, and +relative values for a given metric. For example, Go and Python are both +garbage collected languages, but comparing heap usage between the Go and +CPython runtimes directly is not meaningful. For this reason, this document +does not propose any standard top-level runtime metric instruments. See [OTEP +108](https://github.com/open-telemetry/oteps/pull/108/files) for additional +discussion. + +#### Runtime Environment Specific Metrics - `runtime.{environment}.` + +Metrics specific to a certain runtime environment should be prefixed with +`runtime.{environment}.` and follow the semantic conventions outlined in +[semantic conventions for system +metrics](system-metrics.md#semantic-conventions). Authors of runtime +instrumentations are responsible for the choice of `{environment}` to avoid +ambiguity when interpreting a metric's name or values. + +For example, some programming languages have multiple runtime environments +that vary significantly in their implementation, like [Python which has many +implementations](https://wiki.python.org/moin/PythonImplementations). For +such languages, consider using specific `{environment}` prefixes to avoid +ambiguity, like `runtime.cpython.` and `runtime.pypy.`. + +There are other dimensions even within a given runtime environment to +consider, for example pthreads vs green thread implementations. diff --git a/specification/metrics/semantic_conventions/runtime-metrics.md b/specification/metrics/semantic_conventions/runtime-metrics.md deleted file mode 100644 index 7f4bd729ad0..00000000000 --- a/specification/metrics/semantic_conventions/runtime-metrics.md +++ /dev/null @@ -1,42 +0,0 @@ -# Semantic Conventions for Runtime Metrics - -This document describes instruments and labels for common runtime level -metrics in OpenTelemetry. Also consider the general [semantic conventions for -system metrics](system-metrics.md#semantic-conventions) when creating -instruments not explicitly defined in this document. - - - - - -- [Metric Instruments](#metric-instruments) - * [Runtime Metrics - `runtime.`](#runtime-metrics---runtime) - + [Runtime Specific Metrics - `runtime.{environment}.`](#runtime-specific-metrics---runtimeenvironment) - - - -## Metric Instruments - -### Runtime Metrics - `runtime.` - -Runtime environments vary widely in their terminology, implementation, and -relative values for a given metric. For example, Go and Python are both -garbage collected languages, but comparing heap usage between the two -runtimes directly is not meaningful. For this reason, this document does not -propose any standard top-level runtime metric instruments. See [OTEP -108](https://github.com/open-telemetry/oteps/pull/108/files) for additional -discussion. - -#### Runtime Specific Metrics - `runtime.{environment}.` - -Runtime level metrics specific to a certain runtime environment should be -prefixed with `runtime.{environment}.` and follow the semantic conventions -outlined in [semantic conventions for system -metrics](system-metrics.md#semantic-conventions). For example, Go runtime -metrics use `runtime.go.` as a prefix. - -Some programming languages have multiple runtime environments that vary -significantly in their implementation, for example [Python has many -implementations](https://wiki.python.org/moin/PythonImplementations). For -these languages, consider using specific `environment` prefixes to avoid -ambiguity, like `runtime.cpython.` and `runtime.pypy.`. diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md index acec33166ea..5ba855fdb27 100644 --- a/specification/metrics/semantic_conventions/system-metrics.md +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -11,18 +11,18 @@ creating instruments not explicitly defined in the specification. - [Semantic Conventions for System Metrics](#semantic-conventions-for-system-metrics) - [Semantic Conventions](#semantic-conventions) - - [Instrument Names](#instrument-names) + - [Instrument Naming](#instrument-naming) - [Units](#units) - [Metric Instruments](#metric-instruments) - [Standard System Metrics - `system.`](#standard-system-metrics---system) - - [`system.cpu.`](#systemcpu) - - [`system.memory.`](#systemmemory) - - [`system.swap.`](#systemswap) - - [`system.disk.`](#systemdisk) - - [`system.filesystem.`](#systemfilesystem) - - [`system.network.`](#systemnetwork) - - [`system.process.`](#systemprocess) - - [OS Specific System Metrics - `system.{os}.`](#os-specific-system-metrics---systemos) + - [`system.cpu.` - Processor metrics](#systemcpu---processor-metrics) + - [`system.memory.` - Memory metrics](#systemmemory---memory-metrics) + - [`system.swap.` - Swap/paging metrics](#systemswap---swappaging-metrics) + - [`system.disk.` - Disk controller metrics](#systemdisk---disk-controller-metrics) + - [`system.filesystem.` - Filesystem metrics](#systemfilesystem---filesystem-metrics) + - [`system.network.` - Network metrics](#systemnetwork---network-metrics) + - [`system.process.` - Aggregate system process metrics](#systemprocess---aggregate-system-process-metrics) + - [`system.{os}.` - OS Specific System Metrics](#systemos---os-specific-system-metrics) @@ -32,23 +32,40 @@ The following semantic conventions aim to keep naming consistent. They provide guidelines for most of the cases in this specification and should be followed for other instruments not explicitly defined in this document. -### Instrument Names +### Instrument Naming + +- **limit** - an instrument that measures the constant, known total amount of +something should be called `entity.limit`. For example, `system.memory.limit` +for the total amount of memory on a system. - **usage** - an instrument that measures an amount used out of a known total -amount should be called `entity.usage`. For example, `system.memory.usage` -for the amount of memory used. A measure of the amount of an unlimited -resource consumed is differentiated from -**usage**. This may be time, data, etc. -- **utilization** - an instrument that measures a *value ratio* of usage -(like percent, but in the range `[0, 1]`) should be called -`entity.utilization`. For example, `system.memory.utilization` for the ratio -of memory in use. +(**limit**) amount should be called `entity.usage`. For example, +`system.memory.usage` with label `state = used | cached | free | ...` for the +amount of memory in a each state. In many cases, the sum of **usage** over +all label values is equal to the **limit**. + + A measure of the amount of an unlimited resource consumed is differentiated + from **usage**. + +- **utilization** - an instrument that measures the *fraction* of **usage** +out of its **limit** should be called `entity.utilization`. For example, +`system.memory.utilization` for the fraction of memory in use. Utilization +values are in the range `[0, 1]`. + - **time** - an instrument that measures passage of time should be called -`entity.time`. For example, `system.cpu.time` with varying values of label -`state` for idle, user, etc. +`entity.time`. For example, `system.cpu.time` with label `state = idle | user +| system | ...`. **time** measurements are not necessarily wall time and can be less than + or greater than the real wall time between measurements. + + **time** instruments are a special case of **usage** metrics, where the + **limit** can usually be calculated as the sum of **time** over all label + values. **utilization** can also be calculated and useful, for example + `system.cpu.utilization`. + - **io** - an instrument that measures bidirectional data flow should be called `entity.io` and have labels for direction. For example, `system.network.io`. + - Other instruments that do not fit the above descriptions may be named more freely. For example, `system.swap.page_faults` and `system.network.packets`. Units do not need to be specified in the names since they are included during @@ -56,17 +73,22 @@ instrument creation, but can be added if there is ambiguity. ### Units -- Instruments for utilization metrics (that measure the ratio out of a total) -SHOULD use units of `1`. Such values represent a *value ratio* and are always -in the range `[0, 1]`. -- Instruments that measure an integer count of something SHOULD use semantic -units like `packets`, `errors`, `faults`, etc. +Units should follow the [UCUM](http://unitsofmeasure.org/ucum.html) (need +more clarification in +[#705](https://github.com/open-telemetry/opentelemetry-specification/issues/705)). + +- Instruments for **utilization** metrics (that measure the fraction out of a total) +SHOULD use units of `1`. +- Instruments that measure an integer count of something have +["non-units"](https://ucum.org/ucum.html#section-Examples-for-some-Non-Units.) +and SHOULD use [annotations](https://ucum.org/ucum.html#para-curly) with curly +braces. For example `{packets}`, `{errors}`, `{faults}`, etc. ## Metric Instruments ### Standard System Metrics - `system.` -#### `system.cpu.` +#### `system.cpu.` - Processor metrics **Description:** System level processor metrics. | Name | Units | Instrument Type | Value Type | Label Key | Label Values | @@ -76,7 +98,7 @@ units like `packets`, `errors`, `faults`, etc. | system.cpu.utilization | 1 | ValueObserver | Double | state | idle, user, system, interrupt, etc. | | | | | | cpu | 1 - #cores | -#### `system.memory.` +#### `system.memory.` - Memory metrics **Description:** System level memory metrics. | Name | Units | Instrument Type | Value Type | Label Key | Label Values | @@ -84,7 +106,7 @@ units like `packets`, `errors`, `faults`, etc. | system.memory.usage | bytes | UpDownSumObserver | Int64 | state | used, free, cached, etc. | | system.memory.utilization | 1 | ValueObserver | Double | state | used, free, cached, etc. | -#### `system.swap.` +#### `system.swap.` - Swap/paging metrics **Description:** System level swap/paging metrics. | Name | Units | Instrument Type | Value Type | Label Key | Label Values | @@ -95,7 +117,7 @@ units like `packets`, `errors`, `faults`, etc. | system.swap.page\_operations | operations | SumObserver | Int64 | type | major, minor | | | | | | direction | in, out | -#### `system.disk.` +#### `system.disk.` - Disk controller metrics **Description:** System level disk performance metrics. | Name | Units | Instrument Type | Value Type | Label Key | Label Values | @@ -109,7 +131,7 @@ units like `packets`, `errors`, `faults`, etc. | system.disk.merged | 1 | SumObserver | Int64 | device | (identifier) | | | | | | direction | read, write | -#### `system.filesystem.` +#### `system.filesystem.` - Filesystem metrics **Description:** System level filesystem metrics. | Name | Units | Instrument Type | Value Type | Label Key | Label Values | @@ -119,7 +141,7 @@ units like `packets`, `errors`, `faults`, etc. | system.filesystem.utilization | 1 | ValueObserver | Double | device | (identifier) | | | | | | state | used, free, reserved | -#### `system.network.` +#### `system.network.` - Network metrics **Description:** System level network metrics. | Name | Units | Instrument Type | Value Type | Label Key | Label Values | @@ -136,7 +158,7 @@ units like `packets`, `errors`, `faults`, etc. | | | | | protocol | tcp, udp, [etc.](https://en.wikipedia.org/wiki/Transport_layer#Protocols) | | | | | | state | [e.g. for tcp](https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Protocol_operation) | -#### `system.process.` +#### `system.process.` - Aggregate system process metrics **Description:** System level aggregate process metrics. For metrics at the individual process level, see [process metrics](process-metrics.md). @@ -144,7 +166,7 @@ individual process level, see [process metrics](process-metrics.md). | -------------------- | --------- | ----------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- | | system.process.count | processes | UpDownSumObserver | Int64 | status | running, sleeping, [etc.](https://man7.org/linux/man-pages/man1/ps.1.html#PROCESS_STATE_CODES) | -#### OS Specific System Metrics - `system.{os}.` +#### `system.{os}.` - OS Specific System Metrics Instrument names for system level metrics that have different and conflicting meaning across multiple OSes should be prefixed with `system.{os}.` and From fd6375e1a1f569d787dad44b6368b7bf257d7171 Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Thu, 24 Sep 2020 22:56:13 +0000 Subject: [PATCH 05/26] add description columns, update units to UCUM --- .../semantic_conventions/system-metrics.md | 105 +++++++++--------- 1 file changed, 54 insertions(+), 51 deletions(-) diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md index 5ba855fdb27..dc76372ecc9 100644 --- a/specification/metrics/semantic_conventions/system-metrics.md +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -91,80 +91,83 @@ braces. For example `{packets}`, `{errors}`, `{faults}`, etc. #### `system.cpu.` - Processor metrics **Description:** System level processor metrics. -| Name | Units | Instrument Type | Value Type | Label Key | Label Values | -| ---------------------- | ------- | --------------- | ---------- | --------- | ----------------------------------- | -| system.cpu.time | seconds | SumObserver | Double | state | idle, user, system, interrupt, etc. | -| | | | | cpu | 1 - #cores | -| system.cpu.utilization | 1 | ValueObserver | Double | state | idle, user, system, interrupt, etc. | -| | | | | cpu | 1 - #cores | + +| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | +| ---------------------- | ----------- | ----- | --------------- | ---------- | --------- | ----------------------------------- | +| system.cpu.time | | s | SumObserver | Double | state | idle, user, system, interrupt, etc. | +| | | | | | cpu | CPU number (0..n) | +| system.cpu.utilization | | 1 | ValueObserver | Double | state | idle, user, system, interrupt, etc. | +| | | | | | cpu | CPU number (0..n) | #### `system.memory.` - Memory metrics -**Description:** System level memory metrics. -| Name | Units | Instrument Type | Value Type | Label Key | Label Values | -| ------------------------- | ----- | ----------------- | ---------- | --------- | ------------------------ | -| system.memory.usage | bytes | UpDownSumObserver | Int64 | state | used, free, cached, etc. | -| system.memory.utilization | 1 | ValueObserver | Double | state | used, free, cached, etc. | +**Description:** System level memory metrics. This does not include [paging/swap +memory](#systemswap---swappaging-metrics). + +| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | +| ------------------------- | ----------- | ----- | ----------------- | ---------- | --------- | ------------------------ | +| system.memory.usage | | By | UpDownSumObserver | Int64 | state | used, free, cached, etc. | +| system.memory.utilization | | 1 | ValueObserver | Double | state | used, free, cached, etc. | #### `system.swap.` - Swap/paging metrics -**Description:** System level swap/paging metrics. -| Name | Units | Instrument Type | Value Type | Label Key | Label Values | -| ---------------------------- | ---------- | ----------------- | ---------- | --------- | ------------ | -| system.swap.usage | pages | UpDownSumObserver | Int64 | state | used, free | -| system.swap.utilization | 1 | ValueObserver | Double | state | used, free | -| system.swap.page\_faults | faults | SumObserver | Int64 | type | major, minor | -| system.swap.page\_operations | operations | SumObserver | Int64 | type | major, minor | -| | | | | direction | in, out | +**Description:** System level paging/swap memory metrics. +| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | +| ---------------------------- | ----------------------------------- | ------------ | ----------------- | ---------- | --------- | ------------ | +| system.swap.usage | Unix swap or windows pagefile usage | By | UpDownSumObserver | Int64 | state | used, free | +| system.swap.utilization | | 1 | ValueObserver | Double | state | used, free | +| system.swap.page\_faults | | {faults} | SumObserver | Int64 | type | major, minor | +| system.swap.page\_operations | | {operations} | SumObserver | Int64 | type | major, minor | +| | | | | | direction | in, out | #### `system.disk.` - Disk controller metrics **Description:** System level disk performance metrics. -| Name | Units | Instrument Type | Value Type | Label Key | Label Values | -| ---------------------------- | ---------- | --------------- | ---------- | --------- | ------------ | -| system.disk.io | bytes | SumObserver | Int64 | device | (identifier) | -| | | | | direction | read, write | -| system.disk.operations | operations | SumObserver | Int64 | device | (identifier) | -| | | | | direction | read, write | -| system.disk.time | seconds | SumObserver | Double | device | (identifier) | -| | | | | direction | read, write | -| system.disk.merged | 1 | SumObserver | Int64 | device | (identifier) | -| | | | | direction | read, write | +| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | +| ---------------------------- | ----------- | ------------ | --------------- | ---------- | --------- | ------------ | +| system.disk.io | | By | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | read, write | +| system.disk.operations | | {operations} | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | read, write | +| system.disk.time | | s | SumObserver | Double | device | (identifier) | +| | | | | | direction | read, write | +| system.disk.merged | | {operations} | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | read, write | #### `system.filesystem.` - Filesystem metrics **Description:** System level filesystem metrics. -| Name | Units | Instrument Type | Value Type | Label Key | Label Values | -| ----------------------------- | ----- | ----------------- | ---------- | --------- | -------------------- | -| system.filesystem.usage | bytes | UpDownSumObserver | Int64 | device | (identifier) | -| | | | | state | used, free, reserved | -| system.filesystem.utilization | 1 | ValueObserver | Double | device | (identifier) | -| | | | | state | used, free, reserved | +| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | +| ----------------------------- | ----------- | ----- | ----------------- | ---------- | --------- | -------------------- | +| system.filesystem.usage | | By | UpDownSumObserver | Int64 | device | (identifier) | +| | | | | | state | used, free, reserved | +| system.filesystem.utilization | | 1 | ValueObserver | Double | device | (identifier) | +| | | | | | state | used, free, reserved | #### `system.network.` - Network metrics **Description:** System level network metrics. -| Name | Units | Instrument Type | Value Type | Label Key | Label Values | -| ------------------------------- | ----------- | ----------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- | -| system.network.dropped\_packets | packets | SumObserver | Int64 | device | (identifier) | -| | | | | direction | transmit, receive | -| system.network.packets | packets | SumObserver | Int64 | device | (identifier) | -| | | | | direction | transmit, receive | -| system.network.errors | errors | SumObserver | Int64 | device | (identifier) | -| | | | | direction | transmit, receive | -| system.network.io | bytes | SumObserver | Int64 | device | (identifier) | -| | | | | direction | transmit, receive | -| system.network.connections | connections | UpDownSumObserver | Int64 | device | (identifier) | -| | | | | protocol | tcp, udp, [etc.](https://en.wikipedia.org/wiki/Transport_layer#Protocols) | -| | | | | state | [e.g. for tcp](https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Protocol_operation) | +| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | +| ------------------------------- | ----------- | ------------- | ----------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- | +| system.network.dropped\_packets | | {packets} | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | transmit, receive | +| system.network.packets | | {packets} | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | transmit, receive | +| system.network.errors | | {errors} | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | transmit, receive | +| system.network.io | | By | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | transmit, receive | +| system.network.connections | | {connections} | UpDownSumObserver | Int64 | device | (identifier) | +| | | | | | protocol | tcp, udp, [etc.](https://en.wikipedia.org/wiki/Transport_layer#Protocols) | +| | | | | | state | [e.g. for tcp](https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Protocol_operation) | #### `system.process.` - Aggregate system process metrics **Description:** System level aggregate process metrics. For metrics at the individual process level, see [process metrics](process-metrics.md). -| Name | Units | Instrument Type | Value Type | Label Key | Label Values | -| -------------------- | --------- | ----------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- | -| system.process.count | processes | UpDownSumObserver | Int64 | status | running, sleeping, [etc.](https://man7.org/linux/man-pages/man1/ps.1.html#PROCESS_STATE_CODES) | +| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | +| -------------------- | --------------------------------------- | ----------- | ----------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- | +| system.process.count | Total number of processes in each state | {processes} | UpDownSumObserver | Int64 | status | running, sleeping, [etc.](https://man7.org/linux/man-pages/man1/ps.1.html#PROCESS_STATE_CODES) | #### `system.{os}.` - OS Specific System Metrics From a0e3e2d5c550d1595412015fdd8889417088623b Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Thu, 24 Sep 2020 22:59:54 +0000 Subject: [PATCH 06/26] markdown-toc --- .../runtime-environment-metrics.md | 7 +++-- .../semantic_conventions/system-metrics.md | 27 +++++++++---------- 2 files changed, 16 insertions(+), 18 deletions(-) diff --git a/specification/metrics/semantic_conventions/runtime-environment-metrics.md b/specification/metrics/semantic_conventions/runtime-environment-metrics.md index c4e057a18bd..c842809a6bf 100644 --- a/specification/metrics/semantic_conventions/runtime-environment-metrics.md +++ b/specification/metrics/semantic_conventions/runtime-environment-metrics.md @@ -9,10 +9,9 @@ metrics](process-metrics.md) when instrumenting runtime environments. -- [Semantic Conventions for Runtime Environment Metrics](#semantic-conventions-for-runtime-environment-metrics) - - [Metric Instruments](#metric-instruments) - - [Runtime Environment Metrics - `runtime.`](#runtime-environment-metrics---runtime) - - [Runtime Environment Specific Metrics - `runtime.{environment}.`](#runtime-environment-specific-metrics---runtimeenvironment) +- [Metric Instruments](#metric-instruments) + * [Runtime Environment Metrics - `runtime.`](#runtime-environment-metrics---runtime) + + [Runtime Environment Specific Metrics - `runtime.{environment}.`](#runtime-environment-specific-metrics---runtimeenvironment) diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md index dc76372ecc9..c333d4ec488 100644 --- a/specification/metrics/semantic_conventions/system-metrics.md +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -9,20 +9,19 @@ creating instruments not explicitly defined in the specification. -- [Semantic Conventions for System Metrics](#semantic-conventions-for-system-metrics) - - [Semantic Conventions](#semantic-conventions) - - [Instrument Naming](#instrument-naming) - - [Units](#units) - - [Metric Instruments](#metric-instruments) - - [Standard System Metrics - `system.`](#standard-system-metrics---system) - - [`system.cpu.` - Processor metrics](#systemcpu---processor-metrics) - - [`system.memory.` - Memory metrics](#systemmemory---memory-metrics) - - [`system.swap.` - Swap/paging metrics](#systemswap---swappaging-metrics) - - [`system.disk.` - Disk controller metrics](#systemdisk---disk-controller-metrics) - - [`system.filesystem.` - Filesystem metrics](#systemfilesystem---filesystem-metrics) - - [`system.network.` - Network metrics](#systemnetwork---network-metrics) - - [`system.process.` - Aggregate system process metrics](#systemprocess---aggregate-system-process-metrics) - - [`system.{os}.` - OS Specific System Metrics](#systemos---os-specific-system-metrics) +- [Semantic Conventions](#semantic-conventions) + * [Instrument Naming](#instrument-naming) + * [Units](#units) +- [Metric Instruments](#metric-instruments) + * [Standard System Metrics - `system.`](#standard-system-metrics---system) + + [`system.cpu.` - Processor metrics](#systemcpu---processor-metrics) + + [`system.memory.` - Memory metrics](#systemmemory---memory-metrics) + + [`system.swap.` - Swap/paging metrics](#systemswap---swappaging-metrics) + + [`system.disk.` - Disk controller metrics](#systemdisk---disk-controller-metrics) + + [`system.filesystem.` - Filesystem metrics](#systemfilesystem---filesystem-metrics) + + [`system.network.` - Network metrics](#systemnetwork---network-metrics) + + [`system.process.` - Aggregate system process metrics](#systemprocess---aggregate-system-process-metrics) + + [`system.{os}.` - OS Specific System Metrics](#systemos---os-specific-system-metrics) From 4f7d3e1176df2edd4396e667f1bc124dd0749ee8 Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Mon, 28 Sep 2020 18:19:19 +0000 Subject: [PATCH 07/26] clarify OS process level metrics --- .../metrics/semantic_conventions/process-metrics.md | 11 +++++++---- .../runtime-environment-metrics.md | 2 +- 2 files changed, 8 insertions(+), 5 deletions(-) diff --git a/specification/metrics/semantic_conventions/process-metrics.md b/specification/metrics/semantic_conventions/process-metrics.md index 66479b22f77..f40840704d0 100644 --- a/specification/metrics/semantic_conventions/process-metrics.md +++ b/specification/metrics/semantic_conventions/process-metrics.md @@ -1,9 +1,12 @@ -# Semantic Conventions for Process Metrics +# Semantic Conventions for OS Process Metrics -This document describes instruments and labels for common process level +This document describes instruments and labels for common OS process level metrics in OpenTelemetry. Also consider the general [semantic conventions for -system metrics](system-metrics.md#semantic-conventions) when creating -instruments not explicitly defined in this document. +system metrics](system-metrics.md) when creating instruments not explicitly +defined in this document. OS process metrics are not related to the specific +runtime environment of the program, and should take measurements from the +operating system. For runtime environment metrics see [semantic conventions +for runtime environment metrics](runtime-environment-metrics.md). diff --git a/specification/metrics/semantic_conventions/runtime-environment-metrics.md b/specification/metrics/semantic_conventions/runtime-environment-metrics.md index c842809a6bf..d1b336155c1 100644 --- a/specification/metrics/semantic_conventions/runtime-environment-metrics.md +++ b/specification/metrics/semantic_conventions/runtime-environment-metrics.md @@ -2,7 +2,7 @@ This document includes semantic conventions for runtime environment level metrics in OpenTelemetry. Also consider the general semantic conventions for -[system metrics](system-metrics.md#semantic-conventions) and [process +[system metrics](system-metrics.md) and [OS Process metrics](process-metrics.md) when instrumenting runtime environments. From dc13aa2ed1477248c886a914b952ee652f1a0e82 Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Mon, 28 Sep 2020 20:17:47 +0000 Subject: [PATCH 08/26] clarify load average exapmle --- .../semantic_conventions/system-metrics.md | 27 ++++++++++++++++--- 1 file changed, 24 insertions(+), 3 deletions(-) diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md index c333d4ec488..1af8ca5acc6 100644 --- a/specification/metrics/semantic_conventions/system-metrics.md +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -173,6 +173,27 @@ individual process level, see [process metrics](process-metrics.md). Instrument names for system level metrics that have different and conflicting meaning across multiple OSes should be prefixed with `system.{os}.` and follow the hierarchies listed above for different entities like CPU, memory, -and network. For example, an instrument for measuring the load average on -Linux could be named `system.linux.cpu.load`, reusing the `cpu` name proposed -above. +and network. This follows the rule of thumb that [aggregations over all the +dimensions of a given metric SHOULD be +meaningful.](https://prometheus.io/docs/practices/naming/#metric-names:~:text=As%20a%20rule%20of%20thumb%2C%20either,be%20meaningful%20(though%20not%20necessarily%20useful).) + +For example, [UNIX load +average](https://en.wikipedia.org/wiki/Load_(computing)) over a given +interval is not well standardized and its value across different UNIX like +OSes may vary despite being under similar load: + +> Without getting into the vagaries of every Unix-like operating system in +existence, the load average more or less represents the average number of +processes that are in the running (using the CPU) or runnable (waiting for +the CPU) states. One notable exception exists: Linux includes processes in +uninterruptible sleep states, typically waiting for some I/O activity to +complete. This can markedly increase the load average on Linux systems. + +([source of +quote](https://github.com/torvalds/linux/blob/e4cbce4d131753eca271d9d67f58c6377f27ad21/kernel/sched/loadavg.c#L11-L18), +[linux source +code](https://github.com/torvalds/linux/blob/e4cbce4d131753eca271d9d67f58c6377f27ad21/kernel/sched/loadavg.c#L11-L18)) + +An instrument for load average over 1 minute on Linux could be named +`system.linux.cpu.load_1m`, reusing the `cpu` name proposed above and having +an `{os}` prefix to split this metric across OSes. From ceb99bb8411f3bd17a1360df11f9fddf996d5128 Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Thu, 1 Oct 2020 17:53:16 +0000 Subject: [PATCH 09/26] move general conventions + OTEP 108 into README.md --- .../metrics/semantic_conventions/README.md | 100 ++++++++++++++++++ .../semantic_conventions/system-metrics.md | 67 +----------- 2 files changed, 103 insertions(+), 64 deletions(-) diff --git a/specification/metrics/semantic_conventions/README.md b/specification/metrics/semantic_conventions/README.md index cea4f0975b0..23baf5241c8 100644 --- a/specification/metrics/semantic_conventions/README.md +++ b/specification/metrics/semantic_conventions/README.md @@ -10,3 +10,103 @@ The following semantic conventions surrounding metrics are defined: Apart from semantic conventions for metrics and [traces](../../trace/semantic_conventions/README.md), OpenTelemetry also defines the concept of overarching [Resources](../../resource/sdk.md) with their own [Resource Semantic Conventions](../../resource/semantic_conventions/README.md). + +## General Guidelines + +Metric names and labels exist within a single universe and a single +hierarchy. Metric names and labels MUST be considered within the universe of +all existing metric names. When defining new metric names and labels, +consider the prior art of existing standard metrics and metrics from +frameworks/libraries. + +Associated metrics SHOULD be nested together in a hierarchy based on their +usage. Define a top-level hierarchy for common metric categories: for OS +metrics, like CPU and network; for app runtimes, like GC internals. Libraries +and frameworks should nest their metrics into a hierarchy as well. This aids +in discovery and adhoc comparison. This allows a user to find similar metrics +given a certain metric. + +The hierarchical structure of metrics defines the namespacing. Supporting +OpenTelemetry artifacts define the metric structures and hierarchies for some +categories of metrics, and these can assist decisions when creating future +metrics. + +Common labels SHOULD be consistently named. This aids in discoverability and +disambiguates similar labels to metric names. + +["As a rule of thumb, **aggregations** over all the dimensions of a given +metric **SHOULD** be +meaningful,"](https://prometheus.io/docs/practices/naming/#metric-names) as +Prometheus recommends. + +Semantic ambiguity SHOULD be avoided. Use prefixed metric names in cases +where similar metrics have significantly different implementations across the +breadth of all existing metrics. For example, every garbage collected runtime +has slightly different strategies and measures. Using a single set of metric +names for GC, not divided by the runtime, could create dissimilar comparisons +and confusion for end users. (For example, prefer `runtime.java.gc*` over +`runtime.gc.*`.) Measures of many operating system metrics are similar. + +For conventional metrics or metrics that have their units included in +OpenTelemetry metadata (eg `metric.WithUnit` in Go), SHOULD NOT include the +units in the metric name. Units may be included when it provides additional +meaning to the metric name. Metrics MUST, above all, be understandable and +usable. + +## General Metric Semantic Conventions + +The following semantic conventions aim to keep naming consistent. They +provide guidelines for most of the cases in this specification and should be +followed for other instruments not explicitly defined in this document. + +### Instrument Naming + +- **limit** - an instrument that measures the constant, known total amount of +something should be called `entity.limit`. For example, `system.memory.limit` +for the total amount of memory on a system. + +- **usage** - an instrument that measures an amount used out of a known total +(**limit**) amount should be called `entity.usage`. For example, +`system.memory.usage` with label `state = used | cached | free | ...` for the +amount of memory in a each state. In many cases, the sum of **usage** over +all label values is equal to the **limit**. + + A measure of the amount of an unlimited resource consumed is differentiated + from **usage**. + +- **utilization** - an instrument that measures the *fraction* of **usage** +out of its **limit** should be called `entity.utilization`. For example, +`system.memory.utilization` for the fraction of memory in use. Utilization +values are in the range `[0, 1]`. + +- **time** - an instrument that measures passage of time should be called +`entity.time`. For example, `system.cpu.time` with label `state = idle | user +| system | ...`. **time** measurements are not necessarily wall time and can be less than + or greater than the real wall time between measurements. + + **time** instruments are a special case of **usage** metrics, where the + **limit** can usually be calculated as the sum of **time** over all label + values. **utilization** can also be calculated and useful, for example + `system.cpu.utilization`. + +- **io** - an instrument that measures bidirectional data flow should be +called `entity.io` and have labels for direction. For example, +`system.network.io`. + +- Other instruments that do not fit the above descriptions may be named more +freely. For example, `system.swap.page_faults` and `system.network.packets`. +Units do not need to be specified in the names since they are included during +instrument creation, but can be added if there is ambiguity. + +### Units + +Units should follow the [UCUM](http://unitsofmeasure.org/ucum.html) (need +more clarification in +[#705](https://github.com/open-telemetry/opentelemetry-specification/issues/705)). + +- Instruments for **utilization** metrics (that measure the fraction out of a total) +SHOULD use units of `1`. +- Instruments that measure an integer count of something have +["non-units"](https://ucum.org/ucum.html#section-Examples-for-some-Non-Units.) +and SHOULD use [annotations](https://ucum.org/ucum.html#para-curly) with curly +braces. For example `{packets}`, `{errors}`, `{faults}`, etc. diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md index 1af8ca5acc6..92a74397a4d 100644 --- a/specification/metrics/semantic_conventions/system-metrics.md +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -1,17 +1,14 @@ # Semantic Conventions for System Metrics This document describes instruments and labels for common system level -metrics in OpenTelemetry. Also included are general semantic conventions for -system, process, and runtime metrics, which should be considered when -creating instruments not explicitly defined in the specification. +metrics in OpenTelemetry. Consider the [General Metric Semantic +Conventions](README.md#general-metric-semantic-conventions) when creating +instruments not explicitly defined in the specification. -- [Semantic Conventions](#semantic-conventions) - * [Instrument Naming](#instrument-naming) - * [Units](#units) - [Metric Instruments](#metric-instruments) * [Standard System Metrics - `system.`](#standard-system-metrics---system) + [`system.cpu.` - Processor metrics](#systemcpu---processor-metrics) @@ -25,64 +22,6 @@ creating instruments not explicitly defined in the specification. -## Semantic Conventions - -The following semantic conventions aim to keep naming consistent. They -provide guidelines for most of the cases in this specification and should be -followed for other instruments not explicitly defined in this document. - -### Instrument Naming - -- **limit** - an instrument that measures the constant, known total amount of -something should be called `entity.limit`. For example, `system.memory.limit` -for the total amount of memory on a system. - -- **usage** - an instrument that measures an amount used out of a known total -(**limit**) amount should be called `entity.usage`. For example, -`system.memory.usage` with label `state = used | cached | free | ...` for the -amount of memory in a each state. In many cases, the sum of **usage** over -all label values is equal to the **limit**. - - A measure of the amount of an unlimited resource consumed is differentiated - from **usage**. - -- **utilization** - an instrument that measures the *fraction* of **usage** -out of its **limit** should be called `entity.utilization`. For example, -`system.memory.utilization` for the fraction of memory in use. Utilization -values are in the range `[0, 1]`. - -- **time** - an instrument that measures passage of time should be called -`entity.time`. For example, `system.cpu.time` with label `state = idle | user -| system | ...`. **time** measurements are not necessarily wall time and can be less than - or greater than the real wall time between measurements. - - **time** instruments are a special case of **usage** metrics, where the - **limit** can usually be calculated as the sum of **time** over all label - values. **utilization** can also be calculated and useful, for example - `system.cpu.utilization`. - -- **io** - an instrument that measures bidirectional data flow should be -called `entity.io` and have labels for direction. For example, -`system.network.io`. - -- Other instruments that do not fit the above descriptions may be named more -freely. For example, `system.swap.page_faults` and `system.network.packets`. -Units do not need to be specified in the names since they are included during -instrument creation, but can be added if there is ambiguity. - -### Units - -Units should follow the [UCUM](http://unitsofmeasure.org/ucum.html) (need -more clarification in -[#705](https://github.com/open-telemetry/opentelemetry-specification/issues/705)). - -- Instruments for **utilization** metrics (that measure the fraction out of a total) -SHOULD use units of `1`. -- Instruments that measure an integer count of something have -["non-units"](https://ucum.org/ucum.html#section-Examples-for-some-Non-Units.) -and SHOULD use [annotations](https://ucum.org/ucum.html#para-curly) with curly -braces. For example `{packets}`, `{errors}`, `{faults}`, etc. - ## Metric Instruments ### Standard System Metrics - `system.` From 45ae1f8a74eb55ea71a7796d835376ed664175e4 Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Thu, 1 Oct 2020 17:57:33 +0000 Subject: [PATCH 10/26] renamed swap -> paging --- .../metrics/semantic_conventions/README.md | 2 +- .../semantic_conventions/system-metrics.md | 20 +++++++++---------- 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/specification/metrics/semantic_conventions/README.md b/specification/metrics/semantic_conventions/README.md index 23baf5241c8..e96556bf151 100644 --- a/specification/metrics/semantic_conventions/README.md +++ b/specification/metrics/semantic_conventions/README.md @@ -94,7 +94,7 @@ called `entity.io` and have labels for direction. For example, `system.network.io`. - Other instruments that do not fit the above descriptions may be named more -freely. For example, `system.swap.page_faults` and `system.network.packets`. +freely. For example, `system.paging.faults` and `system.network.packets`. Units do not need to be specified in the names since they are included during instrument creation, but can be added if there is ambiguity. diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md index 92a74397a4d..46b4ec17f5a 100644 --- a/specification/metrics/semantic_conventions/system-metrics.md +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -13,7 +13,7 @@ instruments not explicitly defined in the specification. * [Standard System Metrics - `system.`](#standard-system-metrics---system) + [`system.cpu.` - Processor metrics](#systemcpu---processor-metrics) + [`system.memory.` - Memory metrics](#systemmemory---memory-metrics) - + [`system.swap.` - Swap/paging metrics](#systemswap---swappaging-metrics) + + [`system.paging.` - Paging/swap metrics](#systempaging---pagingswap-metrics) + [`system.disk.` - Disk controller metrics](#systemdisk---disk-controller-metrics) + [`system.filesystem.` - Filesystem metrics](#systemfilesystem---filesystem-metrics) + [`system.network.` - Network metrics](#systemnetwork---network-metrics) @@ -40,23 +40,23 @@ instruments not explicitly defined in the specification. #### `system.memory.` - Memory metrics **Description:** System level memory metrics. This does not include [paging/swap -memory](#systemswap---swappaging-metrics). +memory](#systempaging---pagingswap-metrics). | Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | | ------------------------- | ----------- | ----- | ----------------- | ---------- | --------- | ------------------------ | | system.memory.usage | | By | UpDownSumObserver | Int64 | state | used, free, cached, etc. | | system.memory.utilization | | 1 | ValueObserver | Double | state | used, free, cached, etc. | -#### `system.swap.` - Swap/paging metrics +#### `system.paging.` - Paging/swap metrics **Description:** System level paging/swap memory metrics. -| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | -| ---------------------------- | ----------------------------------- | ------------ | ----------------- | ---------- | --------- | ------------ | -| system.swap.usage | Unix swap or windows pagefile usage | By | UpDownSumObserver | Int64 | state | used, free | -| system.swap.utilization | | 1 | ValueObserver | Double | state | used, free | -| system.swap.page\_faults | | {faults} | SumObserver | Int64 | type | major, minor | -| system.swap.page\_operations | | {operations} | SumObserver | Int64 | type | major, minor | -| | | | | | direction | in, out | +| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | +| ------------------------- | ----------------------------------- | ------------ | ----------------- | ---------- | --------- | ------------ | +| system.paging.usage | Unix swap or windows pagefile usage | By | UpDownSumObserver | Int64 | state | used, free | +| system.paging.utilization | | 1 | ValueObserver | Double | state | used, free | +| system.paging.faults | | {faults} | SumObserver | Int64 | type | major, minor | +| system.paging.operations | | {operations} | SumObserver | Int64 | type | major, minor | +| | | | | | direction | in, out | #### `system.disk.` - Disk controller metrics From b3f7508e9d06c724bd67af6301a4c853c15130a1 Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Thu, 1 Oct 2020 18:44:58 +0000 Subject: [PATCH 11/26] add addition fs labels --- .../semantic_conventions/system-metrics.md | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md index 46b4ec17f5a..0d5bf8ec400 100644 --- a/specification/metrics/semantic_conventions/system-metrics.md +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -75,12 +75,18 @@ memory](#systempaging---pagingswap-metrics). #### `system.filesystem.` - Filesystem metrics **Description:** System level filesystem metrics. -| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | -| ----------------------------- | ----------- | ----- | ----------------- | ---------- | --------- | -------------------- | -| system.filesystem.usage | | By | UpDownSumObserver | Int64 | device | (identifier) | -| | | | | | state | used, free, reserved | -| system.filesystem.utilization | | 1 | ValueObserver | Double | device | (identifier) | -| | | | | | state | used, free, reserved | +| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | +| ----------------------------- | ----------- | ----- | ----------------- | ---------- | ---------- | -------------------- | +| system.filesystem.usage | | By | UpDownSumObserver | Int64 | device | (identifier) | +| | | | | | state | used, free, reserved | +| | | | | | type | ext4, tmpfs, etc. | +| | | | | | mode | rw, ro, etc. | +| | | | | | mountpoint | (path) | +| system.filesystem.utilization | | 1 | ValueObserver | Double | device | (identifier) | +| | | | | | state | used, free, reserved | +| | | | | | type | ext4, tmpfs, etc. | +| | | | | | mode | rw, ro, etc. | +| | | | | | mountpoint | (path) | #### `system.network.` - Network metrics From 2512dd7532cc57ca83fa7379ffd2c360642ef2e9 Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Thu, 1 Oct 2020 18:49:40 +0000 Subject: [PATCH 12/26] fix links --- .../metrics/semantic_conventions/process-metrics.md | 13 +++++++------ .../runtime-environment-metrics.md | 7 ++++--- .../metrics/semantic_conventions/system-metrics.md | 4 ++-- 3 files changed, 13 insertions(+), 11 deletions(-) diff --git a/specification/metrics/semantic_conventions/process-metrics.md b/specification/metrics/semantic_conventions/process-metrics.md index f40840704d0..84818a28f46 100644 --- a/specification/metrics/semantic_conventions/process-metrics.md +++ b/specification/metrics/semantic_conventions/process-metrics.md @@ -1,12 +1,13 @@ # Semantic Conventions for OS Process Metrics This document describes instruments and labels for common OS process level -metrics in OpenTelemetry. Also consider the general [semantic conventions for -system metrics](system-metrics.md) when creating instruments not explicitly -defined in this document. OS process metrics are not related to the specific -runtime environment of the program, and should take measurements from the -operating system. For runtime environment metrics see [semantic conventions -for runtime environment metrics](runtime-environment-metrics.md). +metrics in OpenTelemetry. Also consider the [general metric semantic +conventions](README.md#general-metric-semantic-conventions) when creating +instruments not explicitly defined in this document. OS process metrics are +not related to the specific runtime environment of the program, and should +take measurements from the operating system. For runtime environment metrics +see [semantic conventions for runtime environment +metrics](runtime-environment-metrics.md). diff --git a/specification/metrics/semantic_conventions/runtime-environment-metrics.md b/specification/metrics/semantic_conventions/runtime-environment-metrics.md index d1b336155c1..194baeaf94e 100644 --- a/specification/metrics/semantic_conventions/runtime-environment-metrics.md +++ b/specification/metrics/semantic_conventions/runtime-environment-metrics.md @@ -1,9 +1,10 @@ # Semantic Conventions for Runtime Environment Metrics This document includes semantic conventions for runtime environment level -metrics in OpenTelemetry. Also consider the general semantic conventions for -[system metrics](system-metrics.md) and [OS Process -metrics](process-metrics.md) when instrumenting runtime environments. +metrics in OpenTelemetry. Also consider the [general +metric](README.md#general-metric-semantic-conventions), [system +metrics](system-metrics.md) and [OS Process metrics](process-metrics.md) +semantic conventions when instrumenting runtime environments. diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md index 0d5bf8ec400..d098910b39f 100644 --- a/specification/metrics/semantic_conventions/system-metrics.md +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -1,8 +1,8 @@ # Semantic Conventions for System Metrics This document describes instruments and labels for common system level -metrics in OpenTelemetry. Consider the [General Metric Semantic -Conventions](README.md#general-metric-semantic-conventions) when creating +metrics in OpenTelemetry. Consider the [general metric semantic +conventions](README.md#general-metric-semantic-conventions) when creating instruments not explicitly defined in the specification. From 964c535e711b44947bd92d980eb7ce02fc557687 Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Thu, 1 Oct 2020 19:12:13 +0000 Subject: [PATCH 13/26] fix link --- .../semantic_conventions/runtime-environment-metrics.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/specification/metrics/semantic_conventions/runtime-environment-metrics.md b/specification/metrics/semantic_conventions/runtime-environment-metrics.md index 194baeaf94e..15f6ea98507 100644 --- a/specification/metrics/semantic_conventions/runtime-environment-metrics.md +++ b/specification/metrics/semantic_conventions/runtime-environment-metrics.md @@ -32,10 +32,10 @@ discussion. Metrics specific to a certain runtime environment should be prefixed with `runtime.{environment}.` and follow the semantic conventions outlined in -[semantic conventions for system -metrics](system-metrics.md#semantic-conventions). Authors of runtime -instrumentations are responsible for the choice of `{environment}` to avoid -ambiguity when interpreting a metric's name or values. +[general metric semantic +conventions](README.md#general-metric-semantic-conventions). Authors of +runtime instrumentations are responsible for the choice of `{environment}` to +avoid ambiguity when interpreting a metric's name or values. For example, some programming languages have multiple runtime environments that vary significantly in their implementation, like [Python which has many From cde23930814d36a81ad434a65b09e4f1ea4529ea Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Tue, 6 Oct 2020 16:42:19 -0400 Subject: [PATCH 14/26] Update specification/metrics/semantic_conventions/README.md Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com> --- specification/metrics/semantic_conventions/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specification/metrics/semantic_conventions/README.md b/specification/metrics/semantic_conventions/README.md index e96556bf151..3e48ae844fb 100644 --- a/specification/metrics/semantic_conventions/README.md +++ b/specification/metrics/semantic_conventions/README.md @@ -48,7 +48,7 @@ and confusion for end users. (For example, prefer `runtime.java.gc*` over `runtime.gc.*`.) Measures of many operating system metrics are similar. For conventional metrics or metrics that have their units included in -OpenTelemetry metadata (eg `metric.WithUnit` in Go), SHOULD NOT include the +OpenTelemetry metadata (e.g. `metric.WithUnit` in Go), SHOULD NOT include the units in the metric name. Units may be included when it provides additional meaning to the metric name. Metrics MUST, above all, be understandable and usable. From b758d248de4045156cac752de8537f96d7737dd3 Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Tue, 6 Oct 2020 17:19:06 -0400 Subject: [PATCH 15/26] Update specification/metrics/semantic_conventions/README.md Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com> --- specification/metrics/semantic_conventions/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specification/metrics/semantic_conventions/README.md b/specification/metrics/semantic_conventions/README.md index 3e48ae844fb..4056becdd31 100644 --- a/specification/metrics/semantic_conventions/README.md +++ b/specification/metrics/semantic_conventions/README.md @@ -105,7 +105,7 @@ more clarification in [#705](https://github.com/open-telemetry/opentelemetry-specification/issues/705)). - Instruments for **utilization** metrics (that measure the fraction out of a total) -SHOULD use units of `1`. +SHOULD use the default unit `1` (the unity). - Instruments that measure an integer count of something have ["non-units"](https://ucum.org/ucum.html#section-Examples-for-some-Non-Units.) and SHOULD use [annotations](https://ucum.org/ucum.html#para-curly) with curly From c9a37fb5c156cbb2c6f4e427539400e395df8ca0 Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Thu, 8 Oct 2020 18:05:20 -0400 Subject: [PATCH 16/26] Apply suggestions from code review Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com> --- specification/metrics/semantic_conventions/system-metrics.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md index d098910b39f..29f012f989c 100644 --- a/specification/metrics/semantic_conventions/system-metrics.md +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -30,10 +30,10 @@ instruments not explicitly defined in the specification. **Description:** System level processor metrics. -| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | +| Name | Description | Units | Instrument Type | Value Type | Label Key(s) | Label Values | | ---------------------- | ----------- | ----- | --------------- | ---------- | --------- | ----------------------------------- | | system.cpu.time | | s | SumObserver | Double | state | idle, user, system, interrupt, etc. | -| | | | | | cpu | CPU number (0..n) | +| | | | | | cpu | CPU number [0..n-1] | | system.cpu.utilization | | 1 | ValueObserver | Double | state | idle, user, system, interrupt, etc. | | | | | | | cpu | CPU number (0..n) | From 6c1c579a78e83dffb0a19137c0856d0f37a562dc Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Tue, 6 Oct 2020 21:37:27 +0000 Subject: [PATCH 17/26] fix tigran comments --- .../metrics/semantic_conventions/README.md | 20 ++++++++++--------- .../semantic_conventions/process-metrics.md | 6 +++--- .../semantic_conventions/system-metrics.md | 6 ++---- 3 files changed, 16 insertions(+), 16 deletions(-) diff --git a/specification/metrics/semantic_conventions/README.md b/specification/metrics/semantic_conventions/README.md index 4056becdd31..9721413b3f9 100644 --- a/specification/metrics/semantic_conventions/README.md +++ b/specification/metrics/semantic_conventions/README.md @@ -45,7 +45,8 @@ breadth of all existing metrics. For example, every garbage collected runtime has slightly different strategies and measures. Using a single set of metric names for GC, not divided by the runtime, could create dissimilar comparisons and confusion for end users. (For example, prefer `runtime.java.gc*` over -`runtime.gc.*`.) Measures of many operating system metrics are similar. +`runtime.gc.*`.) Measures of many operating system metrics are similarly +ambiguous. For conventional metrics or metrics that have their units included in OpenTelemetry metadata (e.g. `metric.WithUnit` in Go), SHOULD NOT include the @@ -81,8 +82,8 @@ values are in the range `[0, 1]`. - **time** - an instrument that measures passage of time should be called `entity.time`. For example, `system.cpu.time` with label `state = idle | user -| system | ...`. **time** measurements are not necessarily wall time and can be less than - or greater than the real wall time between measurements. +| system | ...`. **time** measurements are not necessarily wall time and can +be less than or greater than the real wall time between measurements. **time** instruments are a special case of **usage** metrics, where the **limit** can usually be calculated as the sum of **time** over all label @@ -104,9 +105,10 @@ Units should follow the [UCUM](http://unitsofmeasure.org/ucum.html) (need more clarification in [#705](https://github.com/open-telemetry/opentelemetry-specification/issues/705)). -- Instruments for **utilization** metrics (that measure the fraction out of a total) -SHOULD use the default unit `1` (the unity). -- Instruments that measure an integer count of something have -["non-units"](https://ucum.org/ucum.html#section-Examples-for-some-Non-Units.) -and SHOULD use [annotations](https://ucum.org/ucum.html#para-curly) with curly -braces. For example `{packets}`, `{errors}`, `{faults}`, etc. +- Instruments for **utilization** metrics (that measure the fraction out of a +total) are dimensionless and SHOULD use the default unit `1` (the unity). +- Instruments that measure an integer count of something SHOULD use the +default unit `1` (the unity) and +[annotations](https://ucum.org/ucum.html#para-curly) with curly braces to +give additional meaning. For example `{packets}`, `{errors}`, `{faults}`, +etc. diff --git a/specification/metrics/semantic_conventions/process-metrics.md b/specification/metrics/semantic_conventions/process-metrics.md index 84818a28f46..f6d8026df56 100644 --- a/specification/metrics/semantic_conventions/process-metrics.md +++ b/specification/metrics/semantic_conventions/process-metrics.md @@ -4,9 +4,9 @@ This document describes instruments and labels for common OS process level metrics in OpenTelemetry. Also consider the [general metric semantic conventions](README.md#general-metric-semantic-conventions) when creating instruments not explicitly defined in this document. OS process metrics are -not related to the specific runtime environment of the program, and should -take measurements from the operating system. For runtime environment metrics -see [semantic conventions for runtime environment +not related to the runtime environment of the program, and should take +measurements from the operating system. For runtime environment metrics see +[semantic conventions for runtime environment metrics](runtime-environment-metrics.md). diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md index 29f012f989c..b541cea10b1 100644 --- a/specification/metrics/semantic_conventions/system-metrics.md +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -93,7 +93,7 @@ memory](#systempaging---pagingswap-metrics). **Description:** System level network metrics. | Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | | ------------------------------- | ----------- | ------------- | ----------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- | -| system.network.dropped\_packets | | {packets} | SumObserver | Int64 | device | (identifier) | +| system.network.dropped_packets | | {packets} | SumObserver | Int64 | device | (identifier) | | | | | | | direction | transmit, receive | | system.network.packets | | {packets} | SumObserver | Int64 | device | (identifier) | | | | | | | direction | transmit, receive | @@ -118,9 +118,7 @@ individual process level, see [process metrics](process-metrics.md). Instrument names for system level metrics that have different and conflicting meaning across multiple OSes should be prefixed with `system.{os}.` and follow the hierarchies listed above for different entities like CPU, memory, -and network. This follows the rule of thumb that [aggregations over all the -dimensions of a given metric SHOULD be -meaningful.](https://prometheus.io/docs/practices/naming/#metric-names:~:text=As%20a%20rule%20of%20thumb%2C%20either,be%20meaningful%20(though%20not%20necessarily%20useful).) +and network. For example, [UNIX load average](https://en.wikipedia.org/wiki/Load_(computing)) over a given From 5ffcb5826073035e7f53fdac50fb458c701cd0e4 Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Thu, 8 Oct 2020 22:35:07 +0000 Subject: [PATCH 18/26] add disk io_time and operation_time --- .../semantic_conventions/system-metrics.md | 44 ++++++++++++------- 1 file changed, 29 insertions(+), 15 deletions(-) diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md index b541cea10b1..8bafe4c59e1 100644 --- a/specification/metrics/semantic_conventions/system-metrics.md +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -31,11 +31,11 @@ instruments not explicitly defined in the specification. **Description:** System level processor metrics. | Name | Description | Units | Instrument Type | Value Type | Label Key(s) | Label Values | -| ---------------------- | ----------- | ----- | --------------- | ---------- | --------- | ----------------------------------- | -| system.cpu.time | | s | SumObserver | Double | state | idle, user, system, interrupt, etc. | -| | | | | | cpu | CPU number [0..n-1] | -| system.cpu.utilization | | 1 | ValueObserver | Double | state | idle, user, system, interrupt, etc. | -| | | | | | cpu | CPU number (0..n) | +| ---------------------- | ----------- | ----- | --------------- | ---------- | ------------ | ----------------------------------- | +| system.cpu.time | | s | SumObserver | Double | state | idle, user, system, interrupt, etc. | +| | | | | | cpu | CPU number [0..n-1] | +| system.cpu.utilization | | 1 | ValueObserver | Double | state | idle, user, system, interrupt, etc. | +| | | | | | cpu | CPU number (0..n) | #### `system.memory.` - Memory metrics @@ -61,16 +61,30 @@ memory](#systempaging---pagingswap-metrics). #### `system.disk.` - Disk controller metrics **Description:** System level disk performance metrics. -| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | -| ---------------------------- | ----------- | ------------ | --------------- | ---------- | --------- | ------------ | -| system.disk.io | | By | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | read, write | -| system.disk.operations | | {operations} | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | read, write | -| system.disk.time | | s | SumObserver | Double | device | (identifier) | -| | | | | | direction | read, write | -| system.disk.merged | | {operations} | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | read, write | +| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | +| --------------------------------------------------------- | -------------------------------------------------- | ------------ | --------------- | ---------- | --------- | ------------ | +| system.disk.io | | By | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | read, write | +| system.disk.operations | | {operations} | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | read, write | +| system.disk.io_time[1](#io_time) | The actual time the queue and disks were busy | s | SumObserver | Double | device | (identifier) | +| system.disk.operation_time[2](#operation_time) | The sum of the time each request took to complete. | s | SumObserver | Double | device | (identifier) | +| | | | | | direction | read, write | +| system.disk.merged | | {operations} | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | read, write | + +1: I.e. the real elapsed time ("wall clock") used in the I/O +path (time from operations running in parallel are not counted). +- Linux: Field 13 from +[procfs-diskstats](https://www.kernel.org/doc/Documentation/ABI/testing/procfs-diskstats) +- Windows: Inverse of "Disk/% Idle Time" perf counter divided by elapsed time + +2: Because it is the sum of time each request +took, parallel-issued requests each contribute to make the count grow. +- Fields 7 & 11 from +[procfs-diskstats](https://www.kernel.org/doc/Documentation/ABI/testing/procfs-diskstats) +- Windows: "Avg. Disk sec/Read" perf counter multiplied by "Disk Reads/sec" +perf counter (similar for Writes) #### `system.filesystem.` - Filesystem metrics From 1b90514f5483c2d9bf42a511a7684f8b137cb48e Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Thu, 8 Oct 2020 23:20:07 +0000 Subject: [PATCH 19/26] add descriptions/footnotes for dropped packets and net errors --- .../semantic_conventions/system-metrics.md | 37 ++++++++++++------- 1 file changed, 24 insertions(+), 13 deletions(-) diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md index 8bafe4c59e1..6b0144c09fe 100644 --- a/specification/metrics/semantic_conventions/system-metrics.md +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -105,19 +105,30 @@ perf counter (similar for Writes) #### `system.network.` - Network metrics **Description:** System level network metrics. -| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | -| ------------------------------- | ----------- | ------------- | ----------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- | -| system.network.dropped_packets | | {packets} | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | transmit, receive | -| system.network.packets | | {packets} | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | transmit, receive | -| system.network.errors | | {errors} | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | transmit, receive | -| system.network.io | | By | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | transmit, receive | -| system.network.connections | | {connections} | UpDownSumObserver | Int64 | device | (identifier) | -| | | | | | protocol | tcp, udp, [etc.](https://en.wikipedia.org/wiki/Transport_layer#Protocols) | -| | | | | | state | [e.g. for tcp](https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Protocol_operation) | +| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | +| -------------------------------------------------------------- | -------------------------------------------------------------------- | ------------- | ----------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- | +| system.network.dropped_packets[1](#dropped_packets) | Packets that are dropped or discarded even though there was no error | {packets} | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | transmit, receive | +| system.network.packets | | {packets} | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | transmit, receive | +| system.network.errors[2](#errors) | Number of network errors detected | {errors} | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | transmit, receive | +| system.network.io | | By | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | transmit, receive | +| system.network.connections | | {connections} | UpDownSumObserver | Int64 | device | (identifier) | +| | | | | | protocol | tcp, udp, [etc.](https://en.wikipedia.org/wiki/Transport_layer#Protocols) | +| | | | | | state | [e.g. for tcp](https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Protocol_operation) | + +1: Measured on Windows as +`InDiscards`/`OutDiscards` +([source](https://docs.microsoft.com/en-us/windows/win32/api/netioapi/ns-netioapi-mib_if_row2)). +On Linux, the `drop` column in `/proc/dev/net` +([source](https://web.archive.org/web/20180321091318/http://www.onlamp.com/pub/a/linux/2000/11/16/LinuxAdmin.html)). + +2: Measured on Windows as `InErrors`/`OutErrors` +([source](https://docs.microsoft.com/en-us/windows/win32/api/netioapi/ns-netioapi-mib_if_row2)). +On Linux, the `errs` column in `/proc/dev/net` +([source](https://web.archive.org/web/20180321091318/http://www.onlamp.com/pub/a/linux/2000/11/16/LinuxAdmin.html)). #### `system.process.` - Aggregate system process metrics From 7b14a93fe0328e5a4e19e69ad89d6187210d1851 Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Thu, 8 Oct 2020 23:39:25 +0000 Subject: [PATCH 20/26] lint, more info for net dropped packets/errors --- .../metrics/semantic_conventions/system-metrics.md | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md index 6b0144c09fe..1cada649166 100644 --- a/specification/metrics/semantic_conventions/system-metrics.md +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -75,12 +75,14 @@ memory](#systempaging---pagingswap-metrics). 1: I.e. the real elapsed time ("wall clock") used in the I/O path (time from operations running in parallel are not counted). + - Linux: Field 13 from [procfs-diskstats](https://www.kernel.org/doc/Documentation/ABI/testing/procfs-diskstats) - Windows: Inverse of "Disk/% Idle Time" perf counter divided by elapsed time 2: Because it is the sum of time each request took, parallel-issued requests each contribute to make the count grow. + - Fields 7 & 11 from [procfs-diskstats](https://www.kernel.org/doc/Documentation/ABI/testing/procfs-diskstats) - Windows: "Avg. Disk sec/Read" perf counter multiplied by "Disk Reads/sec" @@ -120,13 +122,16 @@ perf counter (similar for Writes) | | | | | | state | [e.g. for tcp](https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Protocol_operation) | 1: Measured on Windows as -`InDiscards`/`OutDiscards` -([source](https://docs.microsoft.com/en-us/windows/win32/api/netioapi/ns-netioapi-mib_if_row2)). +[`InDiscards`/`OutDiscards`](https://docs.microsoft.com/en-us/windows/win32/api/netioapi/ns-netioapi-mib_if_row2) +from +[`GetIfEntry2`](https://docs.microsoft.com/en-us/windows/win32/api/netioapi/nf-netioapi-getifentry2). On Linux, the `drop` column in `/proc/dev/net` ([source](https://web.archive.org/web/20180321091318/http://www.onlamp.com/pub/a/linux/2000/11/16/LinuxAdmin.html)). -2: Measured on Windows as `InErrors`/`OutErrors` -([source](https://docs.microsoft.com/en-us/windows/win32/api/netioapi/ns-netioapi-mib_if_row2)). +2: Measured on Windows as +[`InErrors`/`OutErrors`](https://docs.microsoft.com/en-us/windows/win32/api/netioapi/ns-netioapi-mib_if_row2) +from +[`GetIfEntry2`](https://docs.microsoft.com/en-us/windows/win32/api/netioapi/nf-netioapi-getifentry2). On Linux, the `errs` column in `/proc/dev/net` ([source](https://web.archive.org/web/20180321091318/http://www.onlamp.com/pub/a/linux/2000/11/16/LinuxAdmin.html)). From a9037833c2ec7c68b482002e358a80ab95c11ccb Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Fri, 9 Oct 2020 00:04:54 +0000 Subject: [PATCH 21/26] "dropped_packets" -> "dropped" --- .../semantic_conventions/system-metrics.md | 30 +++++++++---------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md index 1cada649166..77fe5260b6e 100644 --- a/specification/metrics/semantic_conventions/system-metrics.md +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -107,21 +107,21 @@ perf counter (similar for Writes) #### `system.network.` - Network metrics **Description:** System level network metrics. -| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | -| -------------------------------------------------------------- | -------------------------------------------------------------------- | ------------- | ----------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- | -| system.network.dropped_packets[1](#dropped_packets) | Packets that are dropped or discarded even though there was no error | {packets} | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | transmit, receive | -| system.network.packets | | {packets} | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | transmit, receive | -| system.network.errors[2](#errors) | Number of network errors detected | {errors} | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | transmit, receive | -| system.network.io | | By | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | transmit, receive | -| system.network.connections | | {connections} | UpDownSumObserver | Int64 | device | (identifier) | -| | | | | | protocol | tcp, udp, [etc.](https://en.wikipedia.org/wiki/Transport_layer#Protocols) | -| | | | | | state | [e.g. for tcp](https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Protocol_operation) | - -1: Measured on Windows as +| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | +| ---------------------------------------------- | ----------------------------------------------------------------------------- | ------------- | ----------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- | +| system.network.dropped[1](#dropped) | Count of packets that are dropped or discarded even though there was no error | {packets} | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | transmit, receive | +| system.network.packets | | {packets} | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | transmit, receive | +| system.network.errors[2](#errors) | Count of network errors detected | {errors} | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | transmit, receive | +| system.network.io | | By | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | transmit, receive | +| system.network.connections | | {connections} | UpDownSumObserver | Int64 | device | (identifier) | +| | | | | | protocol | tcp, udp, [etc.](https://en.wikipedia.org/wiki/Transport_layer#Protocols) | +| | | | | | state | [e.g. for tcp](https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Protocol_operation) | + +1: Measured on Windows as [`InDiscards`/`OutDiscards`](https://docs.microsoft.com/en-us/windows/win32/api/netioapi/ns-netioapi-mib_if_row2) from [`GetIfEntry2`](https://docs.microsoft.com/en-us/windows/win32/api/netioapi/nf-netioapi-getifentry2). From c218cac8c39b841a54e32821d8c1918d351ddcfe Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Mon, 12 Oct 2020 12:34:15 -0400 Subject: [PATCH 22/26] Apply suggestions from James' code review Co-authored-by: James Bebbington --- specification/metrics/semantic_conventions/README.md | 4 ++-- specification/metrics/semantic_conventions/system-metrics.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/specification/metrics/semantic_conventions/README.md b/specification/metrics/semantic_conventions/README.md index 9721413b3f9..48fc0fe0338 100644 --- a/specification/metrics/semantic_conventions/README.md +++ b/specification/metrics/semantic_conventions/README.md @@ -48,8 +48,8 @@ and confusion for end users. (For example, prefer `runtime.java.gc*` over `runtime.gc.*`.) Measures of many operating system metrics are similarly ambiguous. -For conventional metrics or metrics that have their units included in -OpenTelemetry metadata (e.g. `metric.WithUnit` in Go), SHOULD NOT include the +Conventional metrics or metrics that have their units included in +OpenTelemetry metadata (e.g. `metric.WithUnit` in Go) SHOULD NOT include the units in the metric name. Units may be included when it provides additional meaning to the metric name. Metrics MUST, above all, be understandable and usable. diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md index 77fe5260b6e..421c8f30fdd 100644 --- a/specification/metrics/semantic_conventions/system-metrics.md +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -73,7 +73,7 @@ memory](#systempaging---pagingswap-metrics). | system.disk.merged | | {operations} | SumObserver | Int64 | device | (identifier) | | | | | | | direction | read, write | -1: I.e. the real elapsed time ("wall clock") used in the I/O +1 i.e. the real elapsed time ("wall clock") used in the I/O path (time from operations running in parallel are not counted). - Linux: Field 13 from @@ -83,7 +83,7 @@ path (time from operations running in parallel are not counted). 2: Because it is the sum of time each request took, parallel-issued requests each contribute to make the count grow. -- Fields 7 & 11 from +- Linux: Fields 7 & 11 from [procfs-diskstats](https://www.kernel.org/doc/Documentation/ABI/testing/procfs-diskstats) - Windows: "Avg. Disk sec/Read" perf counter multiplied by "Disk Reads/sec" perf counter (similar for Writes) From 09a31b788237897b6ad448abd122858f4db8ec27 Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Mon, 12 Oct 2020 16:35:36 +0000 Subject: [PATCH 23/26] comments from James' code review --- .../metrics/semantic_conventions/README.md | 4 +- .../semantic_conventions/process-metrics.md | 3 - .../runtime-environment-metrics.md | 7 +- .../semantic_conventions/system-metrics.md | 85 ++++++++++--------- 4 files changed, 48 insertions(+), 51 deletions(-) diff --git a/specification/metrics/semantic_conventions/README.md b/specification/metrics/semantic_conventions/README.md index 48fc0fe0338..62f61569f20 100644 --- a/specification/metrics/semantic_conventions/README.md +++ b/specification/metrics/semantic_conventions/README.md @@ -69,8 +69,8 @@ for the total amount of memory on a system. - **usage** - an instrument that measures an amount used out of a known total (**limit**) amount should be called `entity.usage`. For example, `system.memory.usage` with label `state = used | cached | free | ...` for the -amount of memory in a each state. In many cases, the sum of **usage** over -all label values is equal to the **limit**. +amount of memory in a each state. Where appropriate, the sum of **usage** +over all label values SHOULD be equal to the **limit**. A measure of the amount of an unlimited resource consumed is differentiated from **usage**. diff --git a/specification/metrics/semantic_conventions/process-metrics.md b/specification/metrics/semantic_conventions/process-metrics.md index f6d8026df56..3d2d4e28e75 100644 --- a/specification/metrics/semantic_conventions/process-metrics.md +++ b/specification/metrics/semantic_conventions/process-metrics.md @@ -14,12 +14,9 @@ metrics](runtime-environment-metrics.md). - [Metric Instruments](#metric-instruments) - * [Standard Process Metrics - `process.`](#standard-process-metrics---process) ## Metric Instruments -### Standard Process Metrics - `process.` - TODO diff --git a/specification/metrics/semantic_conventions/runtime-environment-metrics.md b/specification/metrics/semantic_conventions/runtime-environment-metrics.md index 15f6ea98507..a1abb095162 100644 --- a/specification/metrics/semantic_conventions/runtime-environment-metrics.md +++ b/specification/metrics/semantic_conventions/runtime-environment-metrics.md @@ -11,15 +11,12 @@ semantic conventions when instrumenting runtime environments. - [Metric Instruments](#metric-instruments) - * [Runtime Environment Metrics - `runtime.`](#runtime-environment-metrics---runtime) - + [Runtime Environment Specific Metrics - `runtime.{environment}.`](#runtime-environment-specific-metrics---runtimeenvironment) + * [Runtime Environment Specific Metrics - `runtime.{environment}.`](#runtime-environment-specific-metrics---runtimeenvironment) ## Metric Instruments -### Runtime Environment Metrics - `runtime.` - Runtime environments vary widely in their terminology, implementation, and relative values for a given metric. For example, Go and Python are both garbage collected languages, but comparing heap usage between the Go and @@ -28,7 +25,7 @@ does not propose any standard top-level runtime metric instruments. See [OTEP 108](https://github.com/open-telemetry/oteps/pull/108/files) for additional discussion. -#### Runtime Environment Specific Metrics - `runtime.{environment}.` +### Runtime Environment Specific Metrics - `runtime.{environment}.` Metrics specific to a certain runtime environment should be prefixed with `runtime.{environment}.` and follow the semantic conventions outlined in diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md index 421c8f30fdd..50a1f6e52b7 100644 --- a/specification/metrics/semantic_conventions/system-metrics.md +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -10,23 +10,20 @@ instruments not explicitly defined in the specification. - [Metric Instruments](#metric-instruments) - * [Standard System Metrics - `system.`](#standard-system-metrics---system) - + [`system.cpu.` - Processor metrics](#systemcpu---processor-metrics) - + [`system.memory.` - Memory metrics](#systemmemory---memory-metrics) - + [`system.paging.` - Paging/swap metrics](#systempaging---pagingswap-metrics) - + [`system.disk.` - Disk controller metrics](#systemdisk---disk-controller-metrics) - + [`system.filesystem.` - Filesystem metrics](#systemfilesystem---filesystem-metrics) - + [`system.network.` - Network metrics](#systemnetwork---network-metrics) - + [`system.process.` - Aggregate system process metrics](#systemprocess---aggregate-system-process-metrics) - + [`system.{os}.` - OS Specific System Metrics](#systemos---os-specific-system-metrics) + * [`system.cpu.` - Processor metrics](#systemcpu---processor-metrics) + * [`system.memory.` - Memory metrics](#systemmemory---memory-metrics) + * [`system.paging.` - Paging/swap metrics](#systempaging---pagingswap-metrics) + * [`system.disk.` - Disk controller metrics](#systemdisk---disk-controller-metrics) + * [`system.filesystem.` - Filesystem metrics](#systemfilesystem---filesystem-metrics) + * [`system.network.` - Network metrics](#systemnetwork---network-metrics) + * [`system.process.` - Aggregate system process metrics](#systemprocess---aggregate-system-process-metrics) + * [`system.{os}.` - OS Specific System Metrics](#systemos---os-specific-system-metrics) ## Metric Instruments -### Standard System Metrics - `system.` - -#### `system.cpu.` - Processor metrics +### `system.cpu.` - Processor metrics **Description:** System level processor metrics. @@ -37,7 +34,7 @@ instruments not explicitly defined in the specification. | system.cpu.utilization | | 1 | ValueObserver | Double | state | idle, user, system, interrupt, etc. | | | | | | | cpu | CPU number (0..n) | -#### `system.memory.` - Memory metrics +### `system.memory.` - Memory metrics **Description:** System level memory metrics. This does not include [paging/swap memory](#systempaging---pagingswap-metrics). @@ -47,7 +44,7 @@ memory](#systempaging---pagingswap-metrics). | system.memory.usage | | By | UpDownSumObserver | Int64 | state | used, free, cached, etc. | | system.memory.utilization | | 1 | ValueObserver | Double | state | used, free, cached, etc. | -#### `system.paging.` - Paging/swap metrics +### `system.paging.` - Paging/swap metrics **Description:** System level paging/swap memory metrics. | Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | @@ -58,37 +55,39 @@ memory](#systempaging---pagingswap-metrics). | system.paging.operations | | {operations} | SumObserver | Int64 | type | major, minor | | | | | | | direction | in, out | -#### `system.disk.` - Disk controller metrics +### `system.disk.` - Disk controller metrics **Description:** System level disk performance metrics. -| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | -| --------------------------------------------------------- | -------------------------------------------------- | ------------ | --------------- | ---------- | --------- | ------------ | -| system.disk.io | | By | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | read, write | -| system.disk.operations | | {operations} | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | read, write | -| system.disk.io_time[1](#io_time) | The actual time the queue and disks were busy | s | SumObserver | Double | device | (identifier) | -| system.disk.operation_time[2](#operation_time) | The sum of the time each request took to complete. | s | SumObserver | Double | device | (identifier) | -| | | | | | direction | read, write | -| system.disk.merged | | {operations} | SumObserver | Int64 | device | (identifier) | -| | | | | | direction | read, write | - -1 i.e. the real elapsed time ("wall clock") used in the I/O -path (time from operations running in parallel are not counted). +| Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | +| --------------------------------------------------------- | ----------------------------------------------- | ------------ | --------------- | ---------- | --------- | ------------ | +| system.disk.io | | By | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | read, write | +| system.disk.operations | | {operations} | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | read, write | +| system.disk.io_time[1](#io_time) | Time disk spent activated | s | SumObserver | Double | device | (identifier) | +| system.disk.operation_time[2](#operation_time) | Sum of the time each operation took to complete | s | SumObserver | Double | device | (identifier) | +| | | | | | direction | read, write | +| system.disk.merged | | {operations} | SumObserver | Int64 | device | (identifier) | +| | | | | | direction | read, write | + +1 The real elapsed time ("wall clock") +used in the I/O path (time from operations running in parallel are not +counted). Measured as: - Linux: Field 13 from [procfs-diskstats](https://www.kernel.org/doc/Documentation/ABI/testing/procfs-diskstats) - Windows: Inverse of "Disk/% Idle Time" perf counter divided by elapsed time -2: Because it is the sum of time each request -took, parallel-issued requests each contribute to make the count grow. +2 Because it is the sum of time each +request took, parallel-issued requests each contribute to make the count +grow. Measured as: - Linux: Fields 7 & 11 from [procfs-diskstats](https://www.kernel.org/doc/Documentation/ABI/testing/procfs-diskstats) - Windows: "Avg. Disk sec/Read" perf counter multiplied by "Disk Reads/sec" perf counter (similar for Writes) -#### `system.filesystem.` - Filesystem metrics +### `system.filesystem.` - Filesystem metrics **Description:** System level filesystem metrics. | Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | @@ -104,7 +103,7 @@ perf counter (similar for Writes) | | | | | | mode | rw, ro, etc. | | | | | | | mountpoint | (path) | -#### `system.network.` - Network metrics +### `system.network.` - Network metrics **Description:** System level network metrics. | Name | Description | Units | Instrument Type | Value Type | Label Key | Label Values | @@ -121,21 +120,25 @@ perf counter (similar for Writes) | | | | | | protocol | tcp, udp, [etc.](https://en.wikipedia.org/wiki/Transport_layer#Protocols) | | | | | | | state | [e.g. for tcp](https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Protocol_operation) | -1: Measured on Windows as +1 Measured as: + +- Linux: the `drop` column in `/proc/dev/net` +([source](https://web.archive.org/web/20180321091318/http://www.onlamp.com/pub/a/linux/2000/11/16/LinuxAdmin.html)). +- Windows: [`InDiscards`/`OutDiscards`](https://docs.microsoft.com/en-us/windows/win32/api/netioapi/ns-netioapi-mib_if_row2) from [`GetIfEntry2`](https://docs.microsoft.com/en-us/windows/win32/api/netioapi/nf-netioapi-getifentry2). -On Linux, the `drop` column in `/proc/dev/net` -([source](https://web.archive.org/web/20180321091318/http://www.onlamp.com/pub/a/linux/2000/11/16/LinuxAdmin.html)). -2: Measured on Windows as +2 Measured as: + +- Linux: the `errs` column in `/proc/dev/net` +([source](https://web.archive.org/web/20180321091318/http://www.onlamp.com/pub/a/linux/2000/11/16/LinuxAdmin.html)). +- Windows: [`InErrors`/`OutErrors`](https://docs.microsoft.com/en-us/windows/win32/api/netioapi/ns-netioapi-mib_if_row2) from [`GetIfEntry2`](https://docs.microsoft.com/en-us/windows/win32/api/netioapi/nf-netioapi-getifentry2). -On Linux, the `errs` column in `/proc/dev/net` -([source](https://web.archive.org/web/20180321091318/http://www.onlamp.com/pub/a/linux/2000/11/16/LinuxAdmin.html)). -#### `system.process.` - Aggregate system process metrics +### `system.process.` - Aggregate system process metrics **Description:** System level aggregate process metrics. For metrics at the individual process level, see [process metrics](process-metrics.md). @@ -143,7 +146,7 @@ individual process level, see [process metrics](process-metrics.md). | -------------------- | --------------------------------------- | ----------- | ----------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- | | system.process.count | Total number of processes in each state | {processes} | UpDownSumObserver | Int64 | status | running, sleeping, [etc.](https://man7.org/linux/man-pages/man1/ps.1.html#PROCESS_STATE_CODES) | -#### `system.{os}.` - OS Specific System Metrics +### `system.{os}.` - OS Specific System Metrics Instrument names for system level metrics that have different and conflicting meaning across multiple OSes should be prefixed with `system.{os}.` and From 8fec8f99c50643c0770a1995ba2580cba8ec8d7f Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Mon, 12 Oct 2020 22:59:01 +0000 Subject: [PATCH 24/26] clarify windows perf counter --- specification/metrics/semantic_conventions/system-metrics.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md index 50a1f6e52b7..7468ed34aa0 100644 --- a/specification/metrics/semantic_conventions/system-metrics.md +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -76,7 +76,9 @@ counted). Measured as: - Linux: Field 13 from [procfs-diskstats](https://www.kernel.org/doc/Documentation/ABI/testing/procfs-diskstats) -- Windows: Inverse of "Disk/% Idle Time" perf counter divided by elapsed time +- Windows: The complement of ["Disk\% Idle +Time"](https://docs.microsoft.com/en-us/archive/blogs/askcore/windows-performance-monitor-disk-counters-explained#windows-performance-monitor-disk-counters-explained:~:text=%25%20Idle%20Time,Idle\)%20to%200%20(meaning%20always%20busy).) +performance counter: `uptime * (100 - "Disk\% Idle Time") / 100` 2 Because it is the sum of time each request took, parallel-issued requests each contribute to make the count From aa5e16eb09ddb626f873be6b120b4dab5eb2c917 Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Thu, 15 Oct 2020 14:07:17 -0400 Subject: [PATCH 25/26] Update specification/metrics/semantic_conventions/README.md Co-authored-by: Joshua MacDonald --- specification/metrics/semantic_conventions/README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/specification/metrics/semantic_conventions/README.md b/specification/metrics/semantic_conventions/README.md index 62f61569f20..6de918b3c98 100644 --- a/specification/metrics/semantic_conventions/README.md +++ b/specification/metrics/semantic_conventions/README.md @@ -87,8 +87,9 @@ be less than or greater than the real wall time between measurements. **time** instruments are a special case of **usage** metrics, where the **limit** can usually be calculated as the sum of **time** over all label - values. **utilization** can also be calculated and useful, for example - `system.cpu.utilization`. + values. **utilization** for time instruments can be derived automatically using + metric event timestamps. For example, `system.cpu.utilization` is defined as the difference + in `system.cpu.time` measurements divided by the elapsed time. - **io** - an instrument that measures bidirectional data flow should be called `entity.io` and have labels for direction. For example, From aa2856609ee63c2a2706bda312108c63fbade351 Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Thu, 15 Oct 2020 18:10:04 +0000 Subject: [PATCH 26/26] reflow text --- .../metrics/semantic_conventions/README.md | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/specification/metrics/semantic_conventions/README.md b/specification/metrics/semantic_conventions/README.md index 6de918b3c98..935709d16af 100644 --- a/specification/metrics/semantic_conventions/README.md +++ b/specification/metrics/semantic_conventions/README.md @@ -7,9 +7,11 @@ The following semantic conventions surrounding metrics are defined: * [Process Metrics](process-metrics.md): Semantic conventions and instruments for standard process metrics. * [Runtime Environment Metrics](runtime-environment-metrics.md): Semantic conventions and instruments for runtime environment metrics. -Apart from semantic conventions for metrics and [traces](../../trace/semantic_conventions/README.md), -OpenTelemetry also defines the concept of overarching [Resources](../../resource/sdk.md) with their own -[Resource Semantic Conventions](../../resource/semantic_conventions/README.md). +Apart from semantic conventions for metrics and +[traces](../../trace/semantic_conventions/README.md), OpenTelemetry also +defines the concept of overarching [Resources](../../resource/sdk.md) with +their own [Resource Semantic +Conventions](../../resource/semantic_conventions/README.md). ## General Guidelines @@ -87,9 +89,10 @@ be less than or greater than the real wall time between measurements. **time** instruments are a special case of **usage** metrics, where the **limit** can usually be calculated as the sum of **time** over all label - values. **utilization** for time instruments can be derived automatically using - metric event timestamps. For example, `system.cpu.utilization` is defined as the difference - in `system.cpu.time` measurements divided by the elapsed time. + values. **utilization** for time instruments can be derived automatically + using metric event timestamps. For example, `system.cpu.utilization` is + defined as the difference in `system.cpu.time` measurements divided by the + elapsed time. - **io** - an instrument that measures bidirectional data flow should be called `entity.io` and have labels for direction. For example,