Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only report core throttles per core, not per cpu #836

Merged
merged 5 commits into from
Feb 27, 2018

Conversation

rtreffer
Copy link
Contributor

Fixes #659

// cpu loop
for _, cpu := range cpus {
_, cpuName := filepath.Split(cpu)
cpuNum := strings.TrimPrefix(cpuName, "cpu")

core_id := -1
if value, err := readUintFromFile(filepath.Join(cpu, "topology/core_id")); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only need core_id for thermal throttles, let's move this read to after we check for thermal throttles. Then we can avoid the -1 handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to an outer if as there is no need to read thermal throttles if we don't have the core_id.

CI is running....

Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SuperQ SuperQ merged commit c504c7e into prometheus:master Feb 27, 2018
@knweiss
Copy link
Contributor

knweiss commented Mar 27, 2018

I think this code is problematic on multi-socket systems:

The cpu loop iterates over all cpus and writes the core_throttle_count value in the cpu_core_throttles array indexed by core_id. However, the core_id numbers are the same for each processor/package. I.e. the 2nd processors' core_throttle_count overwrites the value of the 1st processor.

        // cpu loop
        for _, cpu := range cpus {
...
                if value, err := readUintFromFile(filepath.Join(cpu, "topology/core_id")); err != nil {
                        log.Debugf("CPU %v is misssing topology/core_id", cpu)
                } else {
                        core_id := int(value)
                        if value, err = readUintFromFile(filepath.Join(cpu, "thermal_throttle", "core_throttle_count")); err != nil {
                                return err
                        }
                        cpu_core_throttles[core_id] = value
                }
        }

        // core throttles
        for core_id, value := range cpu_core_throttles {
                ch <- prometheus.MustNewConstMetric(c.cpuCoreThrottle, prometheus.CounterValue, float64(value), strconv.Itoa(core_id))
        }
}

Example (source):

	[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
				-> [thread 1] -> Linux CPU 4
		    -> [core 1] -> [thread 0] -> Linux CPU 1
				-> [thread 1] -> Linux CPU 5

	[package 1] -> [core 0] -> [thread 0] -> Linux CPU 2
				-> [thread 1] -> Linux CPU 6
		    -> [core 1] -> [thread 0] -> Linux CPU 3
                                -> [thread 1] -> Linux CPU 7

Another example: Here's the metric output of one of my systems. It shows 12 metrics although this system has 24 physical cores (also notice that interesting and confusing core id gap between 5 and 8):

# HELP node_cpu_core_throttles_total Number of times this cpu core has been throttled.
# TYPE node_cpu_core_throttles_total counter
node_cpu_core_throttles_total{core="0"} 0
node_cpu_core_throttles_total{core="1"} 0
node_cpu_core_throttles_total{core="10"} 0
node_cpu_core_throttles_total{core="11"} 0
node_cpu_core_throttles_total{core="12"} 0
node_cpu_core_throttles_total{core="13"} 0
node_cpu_core_throttles_total{core="2"} 0
node_cpu_core_throttles_total{core="3"} 0
node_cpu_core_throttles_total{core="4"} 0
node_cpu_core_throttles_total{core="5"} 0
node_cpu_core_throttles_total{core="8"} 0
node_cpu_core_throttles_total{core="9"} 0

This is the cpu:

# lscpu | grep 'CPU(s)'
CPU(s):                48
On-line CPU(s) list:   0-47
NUMA node0 CPU(s):     0-11,24-35
NUMA node1 CPU(s):     12-23,36-47

I.e. we only collect the metrics from CPUs 12-23 in this example and AFAICS the core_throttle_count values from CPUs 0-11 will get lost.

# grep . /sys/bus/cpu/devices/cpu*/thermal_throttle/core_throttle_count |wc -l
48

Q: Should the node_cpu_core_throttles_total metric get a 2nd label package or processor or node?

@discordianfish
Copy link
Member

Related: #867 - That "fixes" the tests, hiding this problem. So yeah I think we need a 2nd label or concate core+package somehow.

@knweiss
Copy link
Contributor

knweiss commented Mar 29, 2018

(The 2nd label would be sourced from /sys/devices/system/cpu/cpu*/topology/physical_package_id.)

oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this pull request Apr 9, 2024
* Only report core throttles per core, not per cpu

* Add topology/core_id to the cpu sysfs fixtures

* Add new cpu fixtures to ttar file

* Merge core_id reading and thermal throttle accounting

* Declare core_id
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants