Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zesDeviceGetProperties(): Don't crash miserably when Sysman wasn't initialized #389

Closed
wants to merge 1 commit into from

Conversation

bgoglin
Copy link

@bgoglin bgoglin commented Dec 8, 2020

This is an untested patch (don't know how to compile oneAPI) to provide a short-term workaround for segfaults reported in oneapi-src/level-zero#36

Signed-off-by: Brice Goglin [email protected]

@jandres742
Copy link

Thanks @bgoglin for the patch. So without the patch, L0 segfaults?

@bgoglin
Copy link
Author

bgoglin commented Dec 8, 2020

zesDeviceGetProperties() segfaults unless Sysman was initialized during zeInit().

Here's the code for reproducing this. It could be simpler but I basically just extracted what I was going to put in hwloc for listing L0 devices. We need a way to find out whether Sysman was enabled when zeInit() was first called in the process (hwloc may not always be the first one initializing L0). Returning an error from Sysman functions is an easy way to detect that case.

#include <ze_api.h>
#include <zes_api.h>
#include <stdio.h>
#include <stdlib.h>

int main()
{
  ze_result_t res;
  ze_driver_handle_t *drh;
  uint32_t nbdrivers, i;

  res = zeInit(0);
  if (res != ZE_RESULT_SUCCESS) {
    printf("zeInit failed\n");
    return 0;
  }

  nbdrivers = 0;
  res = zeDriverGet(&nbdrivers, NULL);
  if (res != ZE_RESULT_SUCCESS || !nbdrivers)
    return 0;
  drh = malloc(nbdrivers * sizeof(*drh));
  if (!drh)
    return 0;
  res = zeDriverGet(&nbdrivers, drh);
  if (res != ZE_RESULT_SUCCESS) {
    free(drh);
    return 0;
  }

  for(i=0; i<nbdrivers; i++) {
    uint32_t nbdevices, j;
    ze_device_handle_t *dvh;

    nbdevices = 0;
    res = zeDeviceGet(drh[i], &nbdevices, NULL);
    if (res != ZE_RESULT_SUCCESS || !nbdevices)
      continue;
    dvh = malloc(nbdevices * sizeof(*dvh));
    if (!dvh)
      continue;
    res = zeDeviceGet(drh[i], &nbdevices, dvh);
    if (res != ZE_RESULT_SUCCESS) {
      free(dvh);
      continue;
    }

    for(j=0; j<nbdevices; j++) {
      zes_device_properties_t prop;
      zes_device_handle_t sdvh = dvh[i];

      res = zesDeviceGetProperties(sdvh, &prop); /* SEGFAULT unless ZES_ENABLE_SYSMAN=1 */
      printf("zesDeviceGetProperties returned %d\n", res);
    }
  }
}

@jandres742
Copy link

@bgoglin Thanks. Taking a closer look internally and will get back to you.

@jandres742
Copy link

@bgoglin : we have merged a more complete solution in 86e2563, which avoids the segfaults. Could you try on your end?

@bgoglin
Copy link
Author

bgoglin commented Dec 9, 2020

@jandres742 Do you have a CI building deb packages? I tried to build from git manually in the past but couldn't get it to work.

Otherwise, a variant of my code above would be to do zeInit(); putenv("ZES_ENABLE_SYSMAN=1"); zeInit(); and then the rest of the code. That's what going to happen if some library initializes L0 without Sysman before hwloc tries initializing with Sysman.

Your commit seems to read ZES_ENABLE_SYSMAN multiple times from the environment. Do you know if all these readings will be made during the first zeInit()? If not, the application or some library may change the environment in the middle. Reading the environment only once and remembering isSysManEnabled might be more safe.

@jandres742
Copy link

@bgoglin

You are correct. However, the patch I did was meant to just fix the segfault problem, so the application fails gracefully when SysMan hasn't been initialized. A complete solution for oneapi-src/level-zero#36 is still under discussion.

@bgoglin
Copy link
Author

bgoglin commented Dec 9, 2020

I understand. What I am saying is that it would be better to look at ZES_ENABLE_SYSMAN only once and store the result in a variable so that all initialization everywhere is done consistently with respect to sysman being enabled or not.

@jandres742
Copy link

That's a good idea. As we move forward to have a more complete solution for oneapi-src/level-zero#36, we could add that. The patch above performs the checking of SYSMAN when the driver is being loaded by the loader, and at that moment there are no L0 objects created, so I preferred to avoid creating new objects for this checking, mainly considering that the final solution might change.

@bgoglin
Copy link
Author

bgoglin commented Dec 24, 2020

@jandres742 Just tested the Debian packages released a couple days ago with your patch (20.51.18762), no crash anymore, thanks.
I am getting ZE_RESULT_ERROR_UNSUPPORTED_VERSION from zesDeviceGetProperties(), can you confirm whether that's the error code you expected?

@jandres742
Copy link

@jandres742 Just tested the Debian packages released a couple days ago with your patch (20.51.18762), no crash anymore, thanks.
I am getting ZE_RESULT_ERROR_UNSUPPORTED_VERSION from zesDeviceGetProperties(), can you confirm whether that's the error code you expected?

@bgoglin Glad to hear that. For now, that's the error we are returning, but agree, it's not the more intuitive one. We should be actually returning "UNITIALIZED". So, in this first patch we took care of the segfault, but we need a second patch in the loader to return UNITIALIZED - we are working on this second patch.

@bgoglin
Copy link
Author

bgoglin commented Jan 5, 2021

Closing this issue, the crash doesn't occur anymore, further discussion may go to oneapi-src/level-zero#36

@bgoglin bgoglin closed this Jan 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants