Regression: Sysman engine metrics do not work any more #707

eero-t · 2024-02-22T15:11:08Z

Somewhere between these dates / versions:

2024-01-11: 23.48.27912.11
2024-01-25: 23.52.28202.14

Sysman engine metrics stopped working: https://spec.oneapi.io/level-zero/latest/sysman/api.html#engine

I've tested this with my own compute-runtime builds, but it's reported to happen also between e.g. following release packages:

This regression is visible both with current Ubuntu 22.04 5.15 and its HWE kernel 6.5 (on TGL-H iGPU), and with older 5.15 internal BKC kernel (on ATS-M dGPU), so I assume it to be generic one, not related to any particular HW.

According to strace -f -e perf_event_open, earlier version does 36 successful calls like this:
perf_event_open({type=0x58 /* PERF_TYPE_??? */, size=PERF_ATTR_SIZE_VER7, config=0x100007, sample_period=0, sample_type=0, read_format=PERF_FORMAT_TOTAL_TIME_ENABLED, precise_ip=0 /* arbitrary skid */, ...}, -1, 0, -1, 0) = 57

And current version does 136 failing calls like this:
perf_event_open({type=0x58 /* PERF_TYPE_??? */, size=PERF_ATTR_SIZE_VER7, config=0x10000b, sample_period=0, sample_type=0, read_format=PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_GROUP, precise_ip=0 /* arbitrary skid */, ...}, -1, 0, -1, 0) = -1 ENOENT (No such file or directory)

Looking at the arguments for all of those calls, the differences are config argument values being different from earlier ones in all calls, and all of them including now PERF_FORMAT_GROUP option.

The text was updated successfully, but these errors were encountered:

eero-t · 2024-02-22T15:41:58Z

Other metrics have not regressed, only engine ones.

Engine metrics fail also with latest zello_sysman, or one built from the same compute-runtime version:

# zello_sysman -e
...
Device Name = Intel(R) Data Center GPU Flex 170
UUID: 
% 4 � � x ) � �         
Device Name = Intel(R) Data Center GPU Flex 170
UUID: 
" � X � 9 N � �         
Sysman Initialization done via zeInit

 ----  Engine tests ---- 
Device UUID: 37 52 199 174 120 41 161 243 0 0 0 0 0 0 0 0 
Could not retrieve Engine domains

 ----  Engine tests ---- 
Device UUID: 34 158 88 160 57 78 239 207 0 0 0 0 0 0 0 0 
Could not retrieve Engine domains

eero-t · 2024-02-22T16:31:55Z

No idea whether zello_sysman being initialized with zeInit or zesInit (using ZELLO_SYSMAN_USE_ZESINIT=1 env var) affects this, as it segfaults with latter when querying engine metrics (both with i915 & xe driver).

eero-t · 2024-02-29T12:37:02Z

Segfault when zesInit() is used, is specific zello_sysman. In my own program there's no segfault, engine metrics just do not work.

saik-intel · 2024-03-01T09:26:54Z

sure we will take a look internally

eero-t · 2024-05-15T19:28:16Z

Noticed (by accident) that engine metrics work (now) if process has SYS_ADMIN capability, PERFMON capability is not enough for that any more.

PERFMON capability:

# docker run -it --net none --rm --env ZELLO_SYSMAN_USE_ZESINIT=1 --cap-drop all --cap-add PERFMON --user root --device /dev/dri:/dev/dri:rw <image> bash -c "grep ^Cap /proc/self/status && zello_sysman -e"
CapInh:	0000000000000000
CapPrm:	0000004000000000
CapEff:	0000004000000000
CapBnd:	0000004000000000
CapAmb:	0000000000000000
ZES_ENABLE_SYSMAN environment variable Not Set
Sysman Initialization done via zesInit

 ----  Engine tests ---- 
Device UUID: 
134 128 160 86 8 0 0 0 3 0 0 0 0 0 0 0 
Could not retrieve Engine domains

# capsh --decode=0000004000000000
0x0000004000000000=cap_perfmon

=> Correct capability, but no engine metrics.

SYS_ADMIN capability:

docker run -it --net none --rm --env ZELLO_SYSMAN_USE_ZESINIT=1 --cap-drop all --cap-add SYS_ADMIN --user root --device /dev/dri:/dev/dri:rw registry.fi.intel.com/dgpu-enabling/collectd-gpu-plugin:GIT-2024-05-10-collectd-6.0 bash -c "grep ^Cap /proc/self/status && zello_sysman -e"
CapInh:	0000000000000000
CapPrm:	0000000000200000
CapEff:	0000000000200000
CapBnd:	0000000000200000
CapAmb:	0000000000000000
ZES_ENABLE_SYSMAN environment variable Not Set
Sysman Initialization done via zesInit

 ----  Engine tests ---- 
Device UUID: 
134 128 160 86 8 0 0 0 3 0 0 0 0 0 0 0 
[0]
...
Engine Type = ZES_ENGINE_GROUP_RENDER_ALL || Active Time = 0 || Timestamp = 1204094
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[9]
Engine Type = ZES_ENGINE_GROUP_ALL || Active Time = 0 || Timestamp = 1354503
Engine Type = ZES_ENGINE_GROUP_COMPUTE_ALL || Active Time = 0 || Timestamp = 1354510
Engine Type = ZES_ENGINE_GROUP_MEDIA_ALL || Active Time = 0 || Timestamp = 1354519
Engine Type = ZES_ENGINE_GROUP_COPY_ALL || Active Time = 0 || Timestamp = 1354527
Engine Type = ZES_ENGINE_GROUP_COMPUTE_SINGLE || Active Time = 0 || Timestamp = 1354536
Engine Type = ZES_ENGINE_GROUP_COMPUTE_SINGLE || Active Time = 0 || Timestamp = 1354547
Engine Type = ZES_ENGINE_GROUP_COMPUTE_SINGLE || Active Time = 0 || Timestamp = 1354548
Engine Type = ZES_ENGINE_GROUP_COMPUTE_SINGLE || Active Time = 0 || Timestamp = 1354550
Engine Type = ZES_ENGINE_GROUP_RENDER_SINGLE || Active Time = 0 || Timestamp = 1354549
Engine Type = ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE || Active Time = 0 || Timestamp = 1354552
Engine Type = ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE || Active Time = 0 || Timestamp = 1354551
Engine Type = ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE || Active Time = 0 || Timestamp = 1354550
Engine Type = ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE || Active Time = 0 || Timestamp = 1354548
Engine Type = ZES_ENGINE_GROUP_COPY_SINGLE || Active Time = 0 || Timestamp = 1354547
Engine Type = ZES_ENGINE_GROUP_MEDIA_ENHANCEMENT_SINGLE || Active Time = 0 || Timestamp = 1354545
Engine Type = ZES_ENGINE_GROUP_MEDIA_ENHANCEMENT_SINGLE || Active Time = 0 || Timestamp = 1354545
Engine Type = ZES_ENGINE_GROUP_RENDER_ALL || Active Time = 0 || Timestamp = 1354543

Same issue both with and without the Intel i915 DKMS on Ubuntu 22.04 HWE 6.5 kernel:
intel-i915-dkms/1.24.1.11.240117.14, 6.5.0-35-generic, x86_64: installedAUXILIARY_BUS is enabled for 6.5.0-35-generic.

=> This should be tagged as security issue because requiring GPU monitoring containers to have (way too wide) SYS_ADMIN capability (instead of the more targeted PERFMON intended for perf API usage) means that (subverted) process can to escape the containment.

eero-t · 2024-05-15T19:45:27Z

As far as I can see from the latest i915 KMD code, SYS_ADMIN capability is needed only for (early Gens) SECURE_BATCHES. For perf API usage, PERFMON capability should still be enough: https://cgit.freedesktop.org/drm-tip/tree/drivers/gpu/drm/i915/i915_perf.c#n3893

And indeed, with drm-tip v6.9 kernel i915 KMD, PERFMON capability is enough...

However, engine metrics are there only when ZELLO_SYSMAN_USE_ZESINIT=1 is used. With zeInit(), engine metrics are still missing for zello_sysman.

=> There are two regressions:

i915 KMD: engine (perf) metrics not working with PERFMON capability, only with too wide SYS_ADMIN (already fixed, at least in drm-tip)
compute-runtime (or just zello_sysman?): engine metrics not working any more when zeInit() is used

PS. As can be seen from above outputs, zello_sysman crash with zesInit() seems to be fixed in latest compute-runtime versions.

joshuaranjan · 2024-05-16T07:24:27Z

@eero-t
Checking this

eero-t · 2024-07-17T14:22:08Z

@eero-t Checking this

@joshuaranjan Any results?

This was referenced Feb 23, 2024

Will ARC be supported? intel/xpumanager#74

Open

Assert with Xe KMD when using -DNEO_ENABLE_XE_DRM_DETECTION=TRUE #696

Closed

JablonskiMateusz added the L0 Sysman Issue related to L0 Sysman label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression: Sysman engine metrics do not work any more #707

Regression: Sysman engine metrics do not work any more #707

eero-t commented Feb 22, 2024 •

edited

Loading

eero-t commented Feb 22, 2024

eero-t commented Feb 22, 2024 •

edited

Loading

eero-t commented Feb 29, 2024

saik-intel commented Mar 1, 2024

eero-t commented May 15, 2024

eero-t commented May 15, 2024 •

edited

Loading

joshuaranjan commented May 16, 2024

eero-t commented Jul 17, 2024

Regression: Sysman engine metrics do not work any more #707

Regression: Sysman engine metrics do not work any more #707

Comments

eero-t commented Feb 22, 2024 • edited Loading

eero-t commented Feb 22, 2024

eero-t commented Feb 22, 2024 • edited Loading

eero-t commented Feb 29, 2024

saik-intel commented Mar 1, 2024

eero-t commented May 15, 2024

eero-t commented May 15, 2024 • edited Loading

joshuaranjan commented May 16, 2024

eero-t commented Jul 17, 2024

eero-t commented Feb 22, 2024 •

edited

Loading

eero-t commented Feb 22, 2024 •

edited

Loading

eero-t commented May 15, 2024 •

edited

Loading