-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression: Sysman engine metrics do not work any more #707
Comments
Other metrics have not regressed, only engine ones. Engine metrics fail also with latest zello_sysman, or one built from the same
|
No idea whether |
Segfault when zesInit() is used, is specific |
sure we will take a look internally |
Noticed (by accident) that engine metrics work (now) if process has PERFMON capability:
=> Correct capability, but no engine metrics. SYS_ADMIN capability:
Same issue both with and without the Intel i915 DKMS on Ubuntu 22.04 HWE 6.5 kernel: => This should be tagged as security issue because requiring GPU monitoring containers to have (way too wide) |
As far as I can see from the latest And indeed, with However, engine metrics are there only when => There are two regressions:
PS. As can be seen from above outputs, |
@eero-t |
@joshuaranjan Any results? |
Somewhere between these dates / versions:
Sysman engine metrics stopped working: https://spec.oneapi.io/level-zero/latest/sysman/api.html#engine
I've tested this with my own
compute-runtime
builds, but it's reported to happen also between e.g. following release packages:This regression is visible both with current Ubuntu 22.04 5.15 and its HWE kernel 6.5 (on TGL-H iGPU), and with older 5.15 internal BKC kernel (on ATS-M dGPU), so I assume it to be generic one, not related to any particular HW.
According to
strace -f -e perf_event_open
, earlier version does 36 successful calls like this:perf_event_open({type=0x58 /* PERF_TYPE_??? */, size=PERF_ATTR_SIZE_VER7, config=0x100007, sample_period=0, sample_type=0, read_format=PERF_FORMAT_TOTAL_TIME_ENABLED, precise_ip=0 /* arbitrary skid */, ...}, -1, 0, -1, 0) = 57
And current version does 136 failing calls like this:
perf_event_open({type=0x58 /* PERF_TYPE_??? */, size=PERF_ATTR_SIZE_VER7, config=0x10000b, sample_period=0, sample_type=0, read_format=PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_GROUP, precise_ip=0 /* arbitrary skid */, ...}, -1, 0, -1, 0) = -1 ENOENT (No such file or directory)
Looking at the arguments for all of those calls, the differences are
config
argument values being different from earlier ones in all calls, and all of them including nowPERF_FORMAT_GROUP
option.The text was updated successfully, but these errors were encountered: