Only report core throttles per core, not per cpu #836

rtreffer · 2018-02-22T21:03:40Z

Fixes #659

SuperQ · 2018-02-23T07:54:01Z

collector/cpu_linux.go

 	// cpu loop
 	for _, cpu := range cpus {
 		_, cpuName := filepath.Split(cpu)
 		cpuNum := strings.TrimPrefix(cpuName, "cpu")

+		core_id := -1
+		if value, err := readUintFromFile(filepath.Join(cpu, "topology/core_id")); err != nil {


Since we only need core_id for thermal throttles, let's move this read to after we check for thermal throttles. Then we can avoid the -1 handling.

I moved it to an outer if as there is no need to read thermal throttles if we don't have the core_id.

CI is running....

SuperQ

LGTM

knweiss · 2018-03-27T07:54:41Z

I think this code is problematic on multi-socket systems:

The cpu loop iterates over all cpus and writes the core_throttle_count value in the cpu_core_throttles array indexed by core_id. However, the core_id numbers are the same for each processor/package. I.e. the 2nd processors' core_throttle_count overwrites the value of the 1st processor.

        // cpu loop
        for _, cpu := range cpus {
...
                if value, err := readUintFromFile(filepath.Join(cpu, "topology/core_id")); err != nil {
                        log.Debugf("CPU %v is misssing topology/core_id", cpu)
                } else {
                        core_id := int(value)
                        if value, err = readUintFromFile(filepath.Join(cpu, "thermal_throttle", "core_throttle_count")); err != nil {
                                return err
                        }
                        cpu_core_throttles[core_id] = value
                }
        }

        // core throttles
        for core_id, value := range cpu_core_throttles {
                ch <- prometheus.MustNewConstMetric(c.cpuCoreThrottle, prometheus.CounterValue, float64(value), strconv.Itoa(core_id))
        }
}

Example (source):

	[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
				-> [thread 1] -> Linux CPU 4
		    -> [core 1] -> [thread 0] -> Linux CPU 1
				-> [thread 1] -> Linux CPU 5

	[package 1] -> [core 0] -> [thread 0] -> Linux CPU 2
				-> [thread 1] -> Linux CPU 6
		    -> [core 1] -> [thread 0] -> Linux CPU 3
                                -> [thread 1] -> Linux CPU 7

Another example: Here's the metric output of one of my systems. It shows 12 metrics although this system has 24 physical cores (also notice that interesting and confusing core id gap between 5 and 8):

# HELP node_cpu_core_throttles_total Number of times this cpu core has been throttled.
# TYPE node_cpu_core_throttles_total counter
node_cpu_core_throttles_total{core="0"} 0
node_cpu_core_throttles_total{core="1"} 0
node_cpu_core_throttles_total{core="10"} 0
node_cpu_core_throttles_total{core="11"} 0
node_cpu_core_throttles_total{core="12"} 0
node_cpu_core_throttles_total{core="13"} 0
node_cpu_core_throttles_total{core="2"} 0
node_cpu_core_throttles_total{core="3"} 0
node_cpu_core_throttles_total{core="4"} 0
node_cpu_core_throttles_total{core="5"} 0
node_cpu_core_throttles_total{core="8"} 0
node_cpu_core_throttles_total{core="9"} 0

This is the cpu:

# lscpu | grep 'CPU(s)'
CPU(s):                48
On-line CPU(s) list:   0-47
NUMA node0 CPU(s):     0-11,24-35
NUMA node1 CPU(s):     12-23,36-47

I.e. we only collect the metrics from CPUs 12-23 in this example and AFAICS the core_throttle_count values from CPUs 0-11 will get lost.

# grep . /sys/bus/cpu/devices/cpu*/thermal_throttle/core_throttle_count |wc -l
48

Q: Should the node_cpu_core_throttles_total metric get a 2nd label package or processor or node?

discordianfish · 2018-03-27T15:54:52Z

Related: #867 - That "fixes" the tests, hiding this problem. So yeah I think we need a 2nd label or concate core+package somehow.

knweiss · 2018-03-29T07:56:50Z

(The 2nd label would be sourced from /sys/devices/system/cpu/cpu*/topology/physical_package_id.)

* Only report core throttles per core, not per cpu * Add topology/core_id to the cpu sysfs fixtures * Add new cpu fixtures to ttar file * Merge core_id reading and thermal throttle accounting * Declare core_id

rtreffer-sc added 3 commits February 22, 2018 22:02

Only report core throttles per core, not per cpu

3746140

Add topology/core_id to the cpu sysfs fixtures

c56cac9

Add new cpu fixtures to ttar file

f2f2a73

SuperQ reviewed Feb 23, 2018

View reviewed changes

rtreffer-sc added 2 commits February 23, 2018 20:46

Merge core_id reading and thermal throttle accounting

7c33b16

Declare core_id

2493482

SuperQ approved these changes Feb 24, 2018

View reviewed changes

SuperQ requested a review from discordianfish February 24, 2018 06:59

SuperQ merged commit c504c7e into prometheus:master Feb 27, 2018

knweiss mentioned this pull request Mar 29, 2018

cpu: Add a 2nd label 'package' to metric node_cpu_core_throttles_total #871

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only report core throttles per core, not per cpu #836

Only report core throttles per core, not per cpu #836

rtreffer commented Feb 22, 2018

SuperQ Feb 23, 2018

rtreffer Feb 23, 2018

SuperQ left a comment

knweiss commented Mar 27, 2018

discordianfish commented Mar 27, 2018

knweiss commented Mar 29, 2018

Only report core throttles per core, not per cpu #836

Only report core throttles per core, not per cpu #836

Conversation

rtreffer commented Feb 22, 2018

SuperQ Feb 23, 2018

Choose a reason for hiding this comment

rtreffer Feb 23, 2018

Choose a reason for hiding this comment

SuperQ left a comment

Choose a reason for hiding this comment

knweiss commented Mar 27, 2018

discordianfish commented Mar 27, 2018

knweiss commented Mar 29, 2018