Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node_cpu_core_throttles_total: per core, not per cpu #659

Closed
knweiss opened this issue Aug 23, 2017 · 7 comments
Closed

node_cpu_core_throttles_total: per core, not per cpu #659

knweiss opened this issue Aug 23, 2017 · 7 comments
Milestone

Comments

@knweiss
Copy link
Contributor

knweiss commented Aug 23, 2017

@rtreffer @SuperQ

Host operating system: output of uname -a

# uname -a
Linux haswell 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

$ ./node_exporter --version
node_exporter, version 0.14.0 (branch: package_throttles_total, revision: 60ee361e86cb1457151753f0aa8c0da976c6bc26)

node_exporter command line flags

./node_exporter --collectors.enabled=cpu --log.level="debug"

Are you running node_exporter in Docker?

No, on physical multi-core systems.

What did you do that produced an error?

I am testing the node_cpu_core_throttles_total metric. As the metric name indicates, this is a per (physical) core metric and not a per (logical) cpu metric.

However, node_exporter currently creates two identical time series for each physical core if Hyper-Threading is enable.

# HELP node_cpu_core_throttles_total Number of times this cpu core has been throttled.
# TYPE node_cpu_core_throttles_total counter
node_cpu_core_throttles_total{cpu="cpu1"} 61
node_cpu_core_throttles_total{cpu="cpu2"} 3
node_cpu_core_throttles_total{cpu="cpu4"} 108
node_cpu_core_throttles_total{cpu="cpu9"} 49
node_cpu_core_throttles_total{cpu="cpu25"} 61
node_cpu_core_throttles_total{cpu="cpu26"} 3
node_cpu_core_throttles_total{cpu="cpu28"} 108
node_cpu_core_throttles_total{cpu="cpu33"} 49

(I've omitted the metrics with value 0.)

What did you expect to see?

I expected node_cpu_core_throttles_total metric for each physical core (24 in my case) and not for each logical cpu (48).

This creates lots of redundant time series on multi-core systems.

What did you see instead?

The /sys file system of one of my test machines looks like this:

# for i in /sys/bus/cpu/devices/cpu{0..47}/thermal_throttle/core_throttle_count; do\
 echo "$i : $(cat $i)"; done | grep -vw 0
/sys/bus/cpu/devices/cpu1/thermal_throttle/core_throttle_count : 61
/sys/bus/cpu/devices/cpu2/thermal_throttle/core_throttle_count : 3
/sys/bus/cpu/devices/cpu4/thermal_throttle/core_throttle_count : 108
/sys/bus/cpu/devices/cpu9/thermal_throttle/core_throttle_count : 49
/sys/bus/cpu/devices/cpu25/thermal_throttle/core_throttle_count : 61
/sys/bus/cpu/devices/cpu26/thermal_throttle/core_throttle_count : 3
/sys/bus/cpu/devices/cpu28/thermal_throttle/core_throttle_count : 108
/sys/bus/cpu/devices/cpu33/thermal_throttle/core_throttle_count : 49
# lscpu | grep ^NUMA
NUMA node(s):          2
NUMA node0 CPU(s):     0-11,24-35
NUMA node1 CPU(s):     12-23,36-47

Notice how each core metric is present twice in the fs (once for each HT sibling) and the cpu collector replicates this its node_cpu_core_throttles_total metrics:

See my PR #657 for another, similar problem.

@rtreffer
Copy link
Contributor

rtreffer commented Aug 23, 2017 via email

@SuperQ
Copy link
Member

SuperQ commented Aug 23, 2017

Yes, it's tricky, because both logical CPUs on a physical core are affected by throttle events.

@knweiss
Copy link
Contributor Author

knweiss commented Aug 23, 2017

@rtreffer I would appreciate any commit that reduces the number of time series to lowest possible number without losing information.

Right now, the cpu collector already exposes lots of series on modern multi-core systems.

(Imagine e.g. AMD EPYC server chips with 64 logical CPUs per socket. These are going to be used in HPC clusters with thousands of nodes i.e. the number of time series are going to explode...)

@rtreffer
Copy link
Contributor

rtreffer commented Aug 23, 2017 via email

@knweiss
Copy link
Contributor Author

knweiss commented Aug 23, 2017

Another option would be to drop the node_cpu_core_throttles_count metric altogether.

Can you imagine a use case where the package level throttling metrics are not good enough to detect a problem like bad heat transfer paste or broken fans? You can't fix a thermal problem on the core level anyway - only on the package level or higher (e.g. chassis or rack).

This would reduce the number of time series a lot (esp combined with PR #657).

@rtreffer
Copy link
Contributor

rtreffer commented Aug 23, 2017 via email

@SuperQ
Copy link
Member

SuperQ commented Feb 22, 2018

Has this been fixed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants