node_cpu_core_throttles_total: per core, not per cpu #659

knweiss · 2017-08-23T15:05:05Z

Host operating system: output of `uname -a`

# uname -a
Linux haswell 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of `node_exporter --version`

$ ./node_exporter --version
node_exporter, version 0.14.0 (branch: package_throttles_total, revision: 60ee361e86cb1457151753f0aa8c0da976c6bc26)

node_exporter command line flags

./node_exporter --collectors.enabled=cpu --log.level="debug"

Are you running node_exporter in Docker?

No, on physical multi-core systems.

What did you do that produced an error?

I am testing the node_cpu_core_throttles_total metric. As the metric name indicates, this is a per (physical) core metric and not a per (logical) cpu metric.

However, node_exporter currently creates two identical time series for each physical core if Hyper-Threading is enable.

# HELP node_cpu_core_throttles_total Number of times this cpu core has been throttled.
# TYPE node_cpu_core_throttles_total counter
node_cpu_core_throttles_total{cpu="cpu1"} 61
node_cpu_core_throttles_total{cpu="cpu2"} 3
node_cpu_core_throttles_total{cpu="cpu4"} 108
node_cpu_core_throttles_total{cpu="cpu9"} 49
node_cpu_core_throttles_total{cpu="cpu25"} 61
node_cpu_core_throttles_total{cpu="cpu26"} 3
node_cpu_core_throttles_total{cpu="cpu28"} 108
node_cpu_core_throttles_total{cpu="cpu33"} 49

(I've omitted the metrics with value 0.)

What did you expect to see?

I expected node_cpu_core_throttles_total metric for each physical core (24 in my case) and not for each logical cpu (48).

This creates lots of redundant time series on multi-core systems.

What did you see instead?

The /sys file system of one of my test machines looks like this:

# for i in /sys/bus/cpu/devices/cpu{0..47}/thermal_throttle/core_throttle_count; do\
 echo "$i : $(cat $i)"; done | grep -vw 0
/sys/bus/cpu/devices/cpu1/thermal_throttle/core_throttle_count : 61
/sys/bus/cpu/devices/cpu2/thermal_throttle/core_throttle_count : 3
/sys/bus/cpu/devices/cpu4/thermal_throttle/core_throttle_count : 108
/sys/bus/cpu/devices/cpu9/thermal_throttle/core_throttle_count : 49
/sys/bus/cpu/devices/cpu25/thermal_throttle/core_throttle_count : 61
/sys/bus/cpu/devices/cpu26/thermal_throttle/core_throttle_count : 3
/sys/bus/cpu/devices/cpu28/thermal_throttle/core_throttle_count : 108
/sys/bus/cpu/devices/cpu33/thermal_throttle/core_throttle_count : 49
# lscpu | grep ^NUMA
NUMA node(s):          2
NUMA node0 CPU(s):     0-11,24-35
NUMA node1 CPU(s):     12-23,36-47

Notice how each core metric is present twice in the fs (once for each HT sibling) and the cpu collector replicates this its node_cpu_core_throttles_total metrics:

See my PR #657 for another, similar problem.

The text was updated successfully, but these errors were encountered:

rtreffer · 2017-08-23T15:16:29Z

Yes. Right now the metric exposes how often a certain cpu was impacted by a core or package throttle. Regarding the PR: Yeah, I was thinking about that higher level but it would require to export the package/node/cpu topology to correlate this with e.g. high cpu load. Anyway, I am mainly using this to detect machines that have an unusual throttle frequency and either way of reporting it would work for me. Am 23. August 2017 17:05:13 MESZ schrieb Karsten Weiss <[email protected]>:

…

@rtreffer @SuperQ ### Host operating system: output of `uname -a` ``` # uname -a Linux haswell 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux ``` ### node_exporter version: output of `node_exporter --version` ``` $ ./node_exporter --version node_exporter, version 0.14.0 (branch: package_throttles_total, revision: 60ee361e86cb1457151753f0aa8c0da976c6bc26) ``` ### node_exporter command line flags ``` ./node_exporter --collectors.enabled=cpu --log.level="debug" ``` ### Are you running node_exporter in Docker? No, on physical multi-core systems. ### What did you do that produced an error? I am testing the `node_cpu_core_throttles_total` metric. As the metric name indicates, this is a **per (physical) core** metric and **not** a per (logical) cpu metric. However, node_exporter currently creates **two identical time series** for each physical core if Hyper-Threading is enable. ``` # HELP node_cpu_core_throttles_total Number of times this cpu core has been throttled. # TYPE node_cpu_core_throttles_total counter node_cpu_core_throttles_total{cpu="cpu1"} 61 node_cpu_core_throttles_total{cpu="cpu2"} 3 node_cpu_core_throttles_total{cpu="cpu4"} 108 node_cpu_core_throttles_total{cpu="cpu9"} 49 node_cpu_core_throttles_total{cpu="cpu25"} 61 node_cpu_core_throttles_total{cpu="cpu26"} 3 node_cpu_core_throttles_total{cpu="cpu28"} 108 node_cpu_core_throttles_total{cpu="cpu33"} 49 ``` (I've omitted the metrics with value 0.) ### What did you expect to see? I expected `node_cpu_core_throttles_total` metric for each physical core (24 in my case) and not for each logical cpu (48). This creates lots of redundant time series on multi-core systems. ### What did you see instead? The `/sys` file system of one of my test machines looks like this: ``` # for i in /sys/bus/cpu/devices/cpu{0..47}/thermal_throttle/core_throttle_count; do\ echo "$i : $(cat $i)"; done | grep -vw 0 /sys/bus/cpu/devices/cpu1/thermal_throttle/core_throttle_count : 61 /sys/bus/cpu/devices/cpu2/thermal_throttle/core_throttle_count : 3 /sys/bus/cpu/devices/cpu4/thermal_throttle/core_throttle_count : 108 /sys/bus/cpu/devices/cpu9/thermal_throttle/core_throttle_count : 49 /sys/bus/cpu/devices/cpu25/thermal_throttle/core_throttle_count : 61 /sys/bus/cpu/devices/cpu26/thermal_throttle/core_throttle_count : 3 /sys/bus/cpu/devices/cpu28/thermal_throttle/core_throttle_count : 108 /sys/bus/cpu/devices/cpu33/thermal_throttle/core_throttle_count : 49 # lscpu | grep ^NUMA NUMA node(s): 2 NUMA node0 CPU(s): 0-11,24-35 NUMA node1 CPU(s): 12-23,36-47 ``` Notice how each core metric is present twice in the fs (once for each HT sibling) and the cpu collector replicates this its `node_cpu_core_throttles_total` metrics: See my PR #657 for another, similar problem. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #659

-- Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.

SuperQ · 2017-08-23T15:23:58Z

Yes, it's tricky, because both logical CPUs on a physical core are affected by throttle events.

knweiss · 2017-08-23T15:24:21Z

@rtreffer I would appreciate any commit that reduces the number of time series to lowest possible number without losing information.

Right now, the cpu collector already exposes lots of series on modern multi-core systems.

(Imagine e.g. AMD EPYC server chips with 64 logical CPUs per socket. These are going to be used in HPC clusters with thousands of nodes i.e. the number of time series are going to explode...)

rtreffer · 2017-08-23T16:03:04Z

Hm, actually cpufreq_cur should be a good/better proxy if throttles cause issues on a core. So I think it's fine to drop the cardinality of the throttles: - throttles act as a proxy metric for a broken thermal design (esp. if you see high temps through hwmon, too) - cpufreq can be uses to detect a cpu that is regularly at a low rate (despite e.g. load / non-idle cpu usage) Sounds reasonable to drop the throttle metric in cardinality. Am 23. August 2017 17:24:50 MESZ schrieb Karsten Weiss <[email protected]>:

…

@rtreffer I would appreciate any commit that reduces the number of time series to lowest possible number **without losing information**. Right now, the cpu collector already exposes lots of series on modern multi-core systems. (Imagine e.g. AMD EPYC server chips with 64 logical CPUs per socket. These are going to be used in HPC clusters with thousands of nodes i.e. the number of time series are going to explode...) -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #659 (comment)

-- Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.

knweiss · 2017-08-23T17:08:05Z

Another option would be to drop the node_cpu_core_throttles_count metric altogether.

Can you imagine a use case where the package level throttling metrics are not good enough to detect a problem like bad heat transfer paste or broken fans? You can't fix a thermal problem on the core level anyway - only on the package level or higher (e.g. chassis or rack).

This would reduce the number of time series a lot (esp combined with PR #657).

rtreffer · 2017-08-23T17:14:30Z

The core one can trigger without the node one. A pinned service running on a single CPU can hit the core throttle without hitting the node throttle. You basically need both to see that a thermal design is insufficient. Am 23. August 2017 19:08:09 MESZ schrieb Karsten Weiss <[email protected]>:

…

Another option would be to drop the `node_cpu_core_throttles_count` metric altogether. Can you imagine a use case where the package level throttling metrics are not good enough to detect a problem like bad heat transfer paste or broken fans? You can't fix a thermal problem on the core level anyway - only on the package level or higher (e.g. chassis or rack). This would reduce the number of time series a lot (esp combined with PR #657). -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #659 (comment)

-- Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.

SuperQ · 2018-02-22T19:23:22Z

Has this been fixed?

SuperQ added this to the 0.16.0 milestone Feb 22, 2018

rtreffer mentioned this issue Feb 22, 2018

Only report core throttles per core, not per cpu #836

Merged

SuperQ closed this as completed in #836 Feb 27, 2018

This was referenced Sep 4, 2019

node_cpu_core_throttles_total ignores second thread of hyperthreading systems #1472

Closed

report core throttles for each CPU #1479

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node_cpu_core_throttles_total: per core, not per cpu #659

node_cpu_core_throttles_total: per core, not per cpu #659

knweiss commented Aug 23, 2017

rtreffer commented Aug 23, 2017 via email

SuperQ commented Aug 23, 2017

knweiss commented Aug 23, 2017

rtreffer commented Aug 23, 2017 via email

knweiss commented Aug 23, 2017

rtreffer commented Aug 23, 2017 via email

SuperQ commented Feb 22, 2018

node_cpu_core_throttles_total: per core, not per cpu #659

node_cpu_core_throttles_total: per core, not per cpu #659

Comments

knweiss commented Aug 23, 2017

Host operating system: output of uname -a

node_exporter version: output of node_exporter --version

node_exporter command line flags

Are you running node_exporter in Docker?

What did you do that produced an error?

What did you expect to see?

What did you see instead?

rtreffer commented Aug 23, 2017 via email

SuperQ commented Aug 23, 2017

knweiss commented Aug 23, 2017

rtreffer commented Aug 23, 2017 via email

knweiss commented Aug 23, 2017

rtreffer commented Aug 23, 2017 via email

SuperQ commented Feb 22, 2018

Host operating system: output of `uname -a`

node_exporter version: output of `node_exporter --version`