- Stage: 0 (strawman)
- Date: 2023-03-01
The following high level metrics should be per host to indicate its health:
- CPU used (in %) and load
- Memory used (in %, used, total)
- Disk usage (in %) and io -> summary
- Network (traffic in / out)
This translates to the following metrics. The goal is to have as few as possible.
- host.cpu.system.norm.pct
- host.cpu.user.norm.pct
- host.fsstats.total_size.used (in bytes)
- host.fsstats.total_size.total (in bytes)
- host.fsstats.total_size.used.pct
- host.load.norm.1
- host.load.norm.5
- host.load.norm.15
- host.memory.actual.used.bytes
- host.memory.actual.used.pct
- host.memory.total
- host.network.egress.bytes
- host.network.ingress.bytes
cgroup metrics were left out of the proposal by design and might be added later on. More details around cgroups can be found in the cgroup RFC.
These metrics can be used to give a quick overview on how a specific host is doing. Some examples:
- A agent is running on a host and reports metrics about some services running on it. These metrics are shipped in addition to show how the host is doing.
- A user is looking at service metrics delivered by APM. These metrics are used to show how the host the service is running on is doing.
In the context if usage, it is also important what is NOT part of the fields by design:
- Process metrics: Details around process metrics. For this, detailed collection around processes must be enabled
- Cgroup metrics: cgroup metrics might follow at a later stage
The source of this data comes from monitoring a host like a Linux machine, laptop or a k8s node. The can come delivered through different shippers like Elastic Agent system metrics inputs, apm agents, prometheus node exporter and other host metric collectors.
Currently Elastic Agent and metricbeat ship data host/system metrics under the system.*
prefix. This would change it to host.*
. One of the reasons for this is that some metrics for network already exist under this prefix in ECS so conflicts can be prevented. Another advantage is that some of these fields might use newer field types like gauge
and counter
delivered by TSDB in Elasticsearch which is possible without a breaking change.
- One of the concerns is it needs to be figured out how to migrate to the new fields with the existing shippers.
- Not all metrics might be available on all operating systems. How will we deal with this limitation?
- host.cpu.usage already exist, how do the new fields relate to it.
The following are the people that consulted on the contents of this RFC.
- @ruflin | author
- @andrewkroh | reviewer
- @felixbarny | reviewer
- @gizas | reviewer
- @lalit-satapathy | reviewer
- @neptunian | reviewer
- @tommyers-elastic | reviewer
- Schema for metrics in ECS
- Otel host metrics
- ECS cgroup rfc
- Prometheus Node Exporter
- APM System metrics fields
- APM Agent system metrics fields
- APM addition of Cgroup metrics
- Host metrics used in Inventory view of Kibana (related queries)
- Stage 0: #2129