Agents periodically collect and report various metrics, described below.
All agents (excluding JavaScript RUM) should record the following basic system/process metrics:
system.cpu.total.norm.pct
: system CPU usage since the last report, in the range[0,1]
(0-100%)system.process.cpu.total.norm.pct
: process CPU usage since the last report, in the range[0,1]
(0-100%)system.memory.total
: total usable (but not necessarily available) memory on the system, in bytessystem.memory.actual.free
: total available memory on the system, in bytessystem.process.memory.size
: process virtual memory size, in bytessystem.process.memory.rss.bytes
: process resident set size, in bytes
Where applicable, all agents (excluding JavaScript RUM) should record the following cgroup metrics:
system.process.cgroup.memory.mem.limit.bytes
system.process.cgroup.memory.mem.usage.bytes
system.process.cgroup.memory.mem.limit.bytes
- based on thememory.limit_in_bytes
filesystem.process.cgroup.memory.mem.usage.bytes
- based on thememory.usage_in_bytes
file
system.process.cgroup.memory.mem.limit.bytes
- based on thememory.max
filesystem.process.cgroup.memory.mem.usage.bytes
- based on thememory.current
file
All files mentioned above are located at the same directory. Ideally, we can discover this dir by parsing the /proc/self/mountinfo
file, looking for the memory mount line and extracting the path from within it. An example of such line is:
436 431 0:33 /docker/5042cfbb4ab36fcef9ca5f1eda54f40265c6ef3fe0694dfe34b9b474e70f8df5 /sys/fs/cgroup/memory ro,nosuid,nodev,noexec,relatime master:22 - cgroup memory rw,memory
The regex ^\d+? \d+? .+? .+? (.*?) .*cgroup.*memory.*
works in the cgroup-v1 systems tested so far, where the first and only group should be the directory path. However, it will probably take a few iterations and tests on different container runtimes and OSs to get it right.
There is no regex currently suggested for cgroup-v2. Look in other agent PRs to get ideas.
Whenever agents fail to discover the memory mount path, they should default to /sys/fs/cgroup/memory
.
Special values are used to indicate that the cgroup is not configured with a memory limit. In cgroup v1, this value is numeric - 0x7ffffffffffff000
and in cgroup v2 it is represented by the string max
.
Agents should not send the system.process.cgroup.memory.mem.limit.bytes
metric whenever these special values are set.
Agent should record runtime-specific metrics, such as garbage collection pauses. Due to their runtime-specific nature, these will differ for each agent.
When capturing runtime metrics, keep in mind the end use-case: how will they be used? Is the format in which they are recorded appropriate for visualisation in Kibana? Do not record metrics just because it is easy; record them because they are useful.
Agents should record "breakdown metrics", which is a summarisation of how much time is spent per span type/subtype in each transaction group. This is described in detail in the Breakdown Graphs document, so we do not repeat it here.
Agents should make an effort to flush any metrics before shutting down. If this cannot be achieved with shutdown hooks provided by the language/runtime, the agent should provide a public API that the user can call to flush any remaining data.