-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node_cpu_core_throttles_total: per core, not per cpu #659
Comments
Yes. Right now the metric exposes how often a certain cpu was impacted by a core or package throttle.
Regarding the PR: Yeah, I was thinking about that higher level but it would require to export the package/node/cpu topology to correlate this with e.g. high cpu load.
Anyway, I am mainly using this to detect machines that have an unusual throttle frequency and either way of reporting it would work for me.
Am 23. August 2017 17:05:13 MESZ schrieb Karsten Weiss <[email protected]>:
…
@rtreffer @SuperQ
### Host operating system: output of `uname -a`
```
# uname -a
Linux haswell 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC
2017 x86_64 x86_64 x86_64 GNU/Linux
```
### node_exporter version: output of `node_exporter --version`
```
$ ./node_exporter --version
node_exporter, version 0.14.0 (branch: package_throttles_total,
revision: 60ee361e86cb1457151753f0aa8c0da976c6bc26)
```
### node_exporter command line flags
```
./node_exporter --collectors.enabled=cpu --log.level="debug"
```
### Are you running node_exporter in Docker?
No, on physical multi-core systems.
### What did you do that produced an error?
I am testing the `node_cpu_core_throttles_total` metric. As the metric
name indicates, this is a **per (physical) core** metric and **not** a
per (logical) cpu metric.
However, node_exporter currently creates **two identical time series**
for each physical core if Hyper-Threading is enable.
```
# HELP node_cpu_core_throttles_total Number of times this cpu core has
been throttled.
# TYPE node_cpu_core_throttles_total counter
node_cpu_core_throttles_total{cpu="cpu1"} 61
node_cpu_core_throttles_total{cpu="cpu2"} 3
node_cpu_core_throttles_total{cpu="cpu4"} 108
node_cpu_core_throttles_total{cpu="cpu9"} 49
node_cpu_core_throttles_total{cpu="cpu25"} 61
node_cpu_core_throttles_total{cpu="cpu26"} 3
node_cpu_core_throttles_total{cpu="cpu28"} 108
node_cpu_core_throttles_total{cpu="cpu33"} 49
```
(I've omitted the metrics with value 0.)
### What did you expect to see?
I expected `node_cpu_core_throttles_total` metric for each physical
core (24 in my case) and not for each logical cpu (48).
This creates lots of redundant time series on multi-core systems.
### What did you see instead?
The `/sys` file system of one of my test machines looks like this:
```
# for i in
/sys/bus/cpu/devices/cpu{0..47}/thermal_throttle/core_throttle_count;
do\
echo "$i : $(cat $i)"; done | grep -vw 0
/sys/bus/cpu/devices/cpu1/thermal_throttle/core_throttle_count : 61
/sys/bus/cpu/devices/cpu2/thermal_throttle/core_throttle_count : 3
/sys/bus/cpu/devices/cpu4/thermal_throttle/core_throttle_count : 108
/sys/bus/cpu/devices/cpu9/thermal_throttle/core_throttle_count : 49
/sys/bus/cpu/devices/cpu25/thermal_throttle/core_throttle_count : 61
/sys/bus/cpu/devices/cpu26/thermal_throttle/core_throttle_count : 3
/sys/bus/cpu/devices/cpu28/thermal_throttle/core_throttle_count : 108
/sys/bus/cpu/devices/cpu33/thermal_throttle/core_throttle_count : 49
# lscpu | grep ^NUMA
NUMA node(s): 2
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47
```
Notice how each core metric is present twice in the fs (once for each
HT sibling) and the cpu collector replicates this its
`node_cpu_core_throttles_total` metrics:
See my PR #657 for another, similar problem.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#659
--
Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.
|
Yes, it's tricky, because both logical CPUs on a physical core are affected by throttle events. |
@rtreffer I would appreciate any commit that reduces the number of time series to lowest possible number without losing information. Right now, the cpu collector already exposes lots of series on modern multi-core systems. (Imagine e.g. AMD EPYC server chips with 64 logical CPUs per socket. These are going to be used in HPC clusters with thousands of nodes i.e. the number of time series are going to explode...) |
Hm, actually cpufreq_cur should be a good/better proxy if throttles cause issues on a core.
So I think it's fine to drop the cardinality of the throttles:
- throttles act as a proxy metric for a broken thermal design (esp. if you see high temps through hwmon, too)
- cpufreq can be uses to detect a cpu that is regularly at a low rate (despite e.g. load / non-idle cpu usage)
Sounds reasonable to drop the throttle metric in cardinality.
Am 23. August 2017 17:24:50 MESZ schrieb Karsten Weiss <[email protected]>:
…
@rtreffer I would appreciate any commit that reduces the number of time
series to lowest possible number **without losing information**.
Right now, the cpu collector already exposes lots of series on modern
multi-core systems.
(Imagine e.g. AMD EPYC server chips with 64 logical CPUs per socket.
These are going to be used in HPC clusters with thousands of nodes i.e.
the number of time series are going to explode...)
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#659 (comment)
--
Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.
|
Another option would be to drop the Can you imagine a use case where the package level throttling metrics are not good enough to detect a problem like bad heat transfer paste or broken fans? You can't fix a thermal problem on the core level anyway - only on the package level or higher (e.g. chassis or rack). This would reduce the number of time series a lot (esp combined with PR #657). |
The core one can trigger without the node one. A pinned service running on a single CPU can hit the core throttle without hitting the node throttle.
You basically need both to see that a thermal design is insufficient.
Am 23. August 2017 19:08:09 MESZ schrieb Karsten Weiss <[email protected]>:
…Another option would be to drop the `node_cpu_core_throttles_count`
metric altogether.
Can you imagine a use case where the package level throttling metrics
are not good enough to detect a problem like bad heat transfer paste or
broken fans? You can't fix a thermal problem on the core level anyway -
only on the package level or higher (e.g. chassis or rack).
This would reduce the number of time series a lot (esp combined with PR
#657).
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#659 (comment)
--
Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.
|
Has this been fixed? |
@rtreffer @SuperQ
Host operating system: output of
uname -a
node_exporter version: output of
node_exporter --version
node_exporter command line flags
Are you running node_exporter in Docker?
No, on physical multi-core systems.
What did you do that produced an error?
I am testing the
node_cpu_core_throttles_total
metric. As the metric name indicates, this is a per (physical) core metric and not a per (logical) cpu metric.However, node_exporter currently creates two identical time series for each physical core if Hyper-Threading is enable.
(I've omitted the metrics with value 0.)
What did you expect to see?
I expected
node_cpu_core_throttles_total
metric for each physical core (24 in my case) and not for each logical cpu (48).This creates lots of redundant time series on multi-core systems.
What did you see instead?
The
/sys
file system of one of my test machines looks like this:Notice how each core metric is present twice in the fs (once for each HT sibling) and the cpu collector replicates this its
node_cpu_core_throttles_total
metrics:See my PR #657 for another, similar problem.
The text was updated successfully, but these errors were encountered: