Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: Sysman engine metrics do not work any more #707

Open
eero-t opened this issue Feb 22, 2024 · 8 comments
Open

Regression: Sysman engine metrics do not work any more #707

eero-t opened this issue Feb 22, 2024 · 8 comments
Labels
L0 Sysman Issue related to L0 Sysman

Comments

@eero-t
Copy link

eero-t commented Feb 22, 2024

Somewhere between these dates / versions:

  • 2024-01-11: 23.48.27912.11
  • 2024-01-25: 23.52.28202.14

Sysman engine metrics stopped working: https://spec.oneapi.io/level-zero/latest/sysman/api.html#engine

I've tested this with my own compute-runtime builds, but it's reported to happen also between e.g. following release packages:

This regression is visible both with current Ubuntu 22.04 5.15 and its HWE kernel 6.5 (on TGL-H iGPU), and with older 5.15 internal BKC kernel (on ATS-M dGPU), so I assume it to be generic one, not related to any particular HW.

According to strace -f -e perf_event_open, earlier version does 36 successful calls like this:
perf_event_open({type=0x58 /* PERF_TYPE_??? */, size=PERF_ATTR_SIZE_VER7, config=0x100007, sample_period=0, sample_type=0, read_format=PERF_FORMAT_TOTAL_TIME_ENABLED, precise_ip=0 /* arbitrary skid */, ...}, -1, 0, -1, 0) = 57

And current version does 136 failing calls like this:
perf_event_open({type=0x58 /* PERF_TYPE_??? */, size=PERF_ATTR_SIZE_VER7, config=0x10000b, sample_period=0, sample_type=0, read_format=PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_GROUP, precise_ip=0 /* arbitrary skid */, ...}, -1, 0, -1, 0) = -1 ENOENT (No such file or directory)

Looking at the arguments for all of those calls, the differences are config argument values being different from earlier ones in all calls, and all of them including now PERF_FORMAT_GROUP option.

@eero-t
Copy link
Author

eero-t commented Feb 22, 2024

Other metrics have not regressed, only engine ones.

Engine metrics fail also with latest zello_sysman, or one built from the same compute-runtime version:

# zello_sysman -e
...
Device Name = Intel(R) Data Center GPU Flex 170
UUID: 
% 4 � � x ) � �         
Device Name = Intel(R) Data Center GPU Flex 170
UUID: 
" � X � 9 N � �         
Sysman Initialization done via zeInit

 ----  Engine tests ---- 
Device UUID: 37 52 199 174 120 41 161 243 0 0 0 0 0 0 0 0 
Could not retrieve Engine domains

 ----  Engine tests ---- 
Device UUID: 34 158 88 160 57 78 239 207 0 0 0 0 0 0 0 0 
Could not retrieve Engine domains

@eero-t
Copy link
Author

eero-t commented Feb 22, 2024

No idea whether zello_sysman being initialized with zeInit or zesInit (using ZELLO_SYSMAN_USE_ZESINIT=1 env var) affects this, as it segfaults with latter when querying engine metrics (both with i915 & xe driver).

@eero-t
Copy link
Author

eero-t commented Feb 29, 2024

Segfault when zesInit() is used, is specific zello_sysman. In my own program there's no segfault, engine metrics just do not work.

@saik-intel
Copy link
Contributor

sure we will take a look internally

@JablonskiMateusz JablonskiMateusz added the L0 Sysman Issue related to L0 Sysman label Mar 4, 2024
@eero-t
Copy link
Author

eero-t commented May 15, 2024

Noticed (by accident) that engine metrics work (now) if process has SYS_ADMIN capability, PERFMON capability is not enough for that any more.

PERFMON capability:

# docker run -it --net none --rm --env ZELLO_SYSMAN_USE_ZESINIT=1 --cap-drop all --cap-add PERFMON --user root --device /dev/dri:/dev/dri:rw <image> bash -c "grep ^Cap /proc/self/status && zello_sysman -e"
CapInh:	0000000000000000
CapPrm:	0000004000000000
CapEff:	0000004000000000
CapBnd:	0000004000000000
CapAmb:	0000000000000000
ZES_ENABLE_SYSMAN environment variable Not Set
Sysman Initialization done via zesInit

 ----  Engine tests ---- 
Device UUID: 
134 128 160 86 8 0 0 0 3 0 0 0 0 0 0 0 
Could not retrieve Engine domains

# capsh --decode=0000004000000000
0x0000004000000000=cap_perfmon

=> Correct capability, but no engine metrics.

SYS_ADMIN capability:

docker run -it --net none --rm --env ZELLO_SYSMAN_USE_ZESINIT=1 --cap-drop all --cap-add SYS_ADMIN --user root --device /dev/dri:/dev/dri:rw registry.fi.intel.com/dgpu-enabling/collectd-gpu-plugin:GIT-2024-05-10-collectd-6.0 bash -c "grep ^Cap /proc/self/status && zello_sysman -e"
CapInh:	0000000000000000
CapPrm:	0000000000200000
CapEff:	0000000000200000
CapBnd:	0000000000200000
CapAmb:	0000000000000000
ZES_ENABLE_SYSMAN environment variable Not Set
Sysman Initialization done via zesInit

 ----  Engine tests ---- 
Device UUID: 
134 128 160 86 8 0 0 0 3 0 0 0 0 0 0 0 
[0]
...
Engine Type = ZES_ENGINE_GROUP_RENDER_ALL || Active Time = 0 || Timestamp = 1204094
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[9]
Engine Type = ZES_ENGINE_GROUP_ALL || Active Time = 0 || Timestamp = 1354503
Engine Type = ZES_ENGINE_GROUP_COMPUTE_ALL || Active Time = 0 || Timestamp = 1354510
Engine Type = ZES_ENGINE_GROUP_MEDIA_ALL || Active Time = 0 || Timestamp = 1354519
Engine Type = ZES_ENGINE_GROUP_COPY_ALL || Active Time = 0 || Timestamp = 1354527
Engine Type = ZES_ENGINE_GROUP_COMPUTE_SINGLE || Active Time = 0 || Timestamp = 1354536
Engine Type = ZES_ENGINE_GROUP_COMPUTE_SINGLE || Active Time = 0 || Timestamp = 1354547
Engine Type = ZES_ENGINE_GROUP_COMPUTE_SINGLE || Active Time = 0 || Timestamp = 1354548
Engine Type = ZES_ENGINE_GROUP_COMPUTE_SINGLE || Active Time = 0 || Timestamp = 1354550
Engine Type = ZES_ENGINE_GROUP_RENDER_SINGLE || Active Time = 0 || Timestamp = 1354549
Engine Type = ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE || Active Time = 0 || Timestamp = 1354552
Engine Type = ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE || Active Time = 0 || Timestamp = 1354551
Engine Type = ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE || Active Time = 0 || Timestamp = 1354550
Engine Type = ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE || Active Time = 0 || Timestamp = 1354548
Engine Type = ZES_ENGINE_GROUP_COPY_SINGLE || Active Time = 0 || Timestamp = 1354547
Engine Type = ZES_ENGINE_GROUP_MEDIA_ENHANCEMENT_SINGLE || Active Time = 0 || Timestamp = 1354545
Engine Type = ZES_ENGINE_GROUP_MEDIA_ENHANCEMENT_SINGLE || Active Time = 0 || Timestamp = 1354545
Engine Type = ZES_ENGINE_GROUP_RENDER_ALL || Active Time = 0 || Timestamp = 1354543

Same issue both with and without the Intel i915 DKMS on Ubuntu 22.04 HWE 6.5 kernel:
intel-i915-dkms/1.24.1.11.240117.14, 6.5.0-35-generic, x86_64: installedAUXILIARY_BUS is enabled for 6.5.0-35-generic.

=> This should be tagged as security issue because requiring GPU monitoring containers to have (way too wide) SYS_ADMIN capability (instead of the more targeted PERFMON intended for perf API usage) means that (subverted) process can to escape the containment.

@eero-t
Copy link
Author

eero-t commented May 15, 2024

As far as I can see from the latest i915 KMD code, SYS_ADMIN capability is needed only for (early Gens) SECURE_BATCHES. For perf API usage, PERFMON capability should still be enough: https://cgit.freedesktop.org/drm-tip/tree/drivers/gpu/drm/i915/i915_perf.c#n3893

And indeed, with drm-tip v6.9 kernel i915 KMD, PERFMON capability is enough...

However, engine metrics are there only when ZELLO_SYSMAN_USE_ZESINIT=1 is used. With zeInit(), engine metrics are still missing for zello_sysman.

=> There are two regressions:

  • i915 KMD: engine (perf) metrics not working with PERFMON capability, only with too wide SYS_ADMIN (already fixed, at least in drm-tip)
  • compute-runtime (or just zello_sysman?): engine metrics not working any more when zeInit() is used

PS. As can be seen from above outputs, zello_sysman crash with zesInit() seems to be fixed in latest compute-runtime versions.

@joshuaranjan
Copy link
Contributor

@eero-t
Checking this

@eero-t
Copy link
Author

eero-t commented Jul 17, 2024

@eero-t Checking this

@joshuaranjan Any results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
L0 Sysman Issue related to L0 Sysman
Projects
None yet
Development

No branches or pull requests

4 participants