Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will ARC be supported? #74

Open
nathanodle opened this issue Feb 18, 2024 · 21 comments
Open

Will ARC be supported? #74

nathanodle opened this issue Feb 18, 2024 · 21 comments

Comments

@nathanodle
Copy link

There's currently no way to get most performance statistics on ARC GPUs. intel_gpu_top doesn't have memory usage, and while it appears xpu-smi has some metrics it's missing a lot on ARC.

I'm working on a multi-GPU ARC system and it's hard to troubleshoot certain things without knowing what the GPUs are doing outside of code.

Thanks!

@fmiao2372
Copy link

XPU Manager mainly targets Intel data center GPU. For some missing metrics, please refer to the issue 26. What metrics are supported, depends on the underlying HW, its FW, and kernel + user-space drivers. All metrics supported by XPU Manager, are not provided by all HW, or their driver stacks.

@eero-t
Copy link

eero-t commented Feb 23, 2024

XPU Manager mainly targets Intel data center GPU.

While XPUM is validated only for those, it uses LevelZero Sysman API to query the metrics: https://spec.oneapi.io/level-zero/latest/sysman/api.html

And Intel GPU L0 backend releases do list ARC (DG2) as having "production" level support: https://github.com/intel/compute-runtime/


PS. Release testing for the Sysman part of L0 seems somewhat spotty still, as during the years I've noticed couple of regressions, with latest one being: intel/compute-runtime#707

There being 3 Intel kernel GPU driver uAPIs that the user-space driver tries to support at the same time, may have something to do with it:

  • i915-upstream (Linus' Linux kernel tree)
  • i915-prelim (out-of-tree / backport kernel driver for older enterprise/LTS kernels)
  • xe (rewrite of i915 for the future HW, and which uAPI is moving target as it's still being upstreamed)

Driver releases are currently built with support for the first uAPIs two, but it's possible that the changes to support last one could regress them => In addition to latest driver, one could also try one or two older ones, especially for HW that's been out for a while, like ARC is.

@eero-t
Copy link

eero-t commented Feb 23, 2024

There's currently no way to get most performance statistics on ARC GPUs. intel_gpu_top doesn't have memory usage, and while it appears xpu-smi has some metrics it's missing a lot on ARC.

On a quick test with A770 (0x56a0) on TGL-H host, with GuC 70.8.0 FW, using "6.5.0-18-generic" HWE kernel (=upstream with Ubuntu patches) on Ubuntu 22.04.4 LTS distro, with compute-runtime "23.48.27912.11" (own build), I get following GPU metrics from the driver:

  • Engine utilization
  • Frequency
  • Memory usage
  • Power usage

(There may be some kernel DKMS drivers + user-space driver combo which would provide also GPU memory BW, temperature and maybe also error counters, but at least one of those will need out-of-band metrics kernel driver instead of GPU one.)

PS. I'm checking these with the tester in the corresponding compute-runtime version (after installing level-zero frontend devel package):

$ DRIVER_TAG=23.48.27912.11
$ wget --no-verbose https://raw.githubusercontent.com/intel/compute-runtime/$DRIVER_TAG/level_zero/tools/test/black_box_tests/zello_sysman.cpp
$ g++ -O2 -Wall -o zello_sysman zello_sysman.cpp -lze_loader
$ zello_sysman --engine --frequency --memory --temperature --ras --power

(--power needs to be last option as it has optional args.)

@QiXuanWang
Copy link

This is very much needed feature. the zello_sysman command provided is not that friendly.

@eero-t
Copy link

eero-t commented May 13, 2024

This is very much needed feature. the zello_sysman command provided is not that friendly.

@QiXuanWang Just use XPUM then?

If zello_sysman shows some metric for your HW/FW/KMD/UMD combo, I do not see any reason why XPUM would not show it too with the same HW/SW stack...

@qnixsynapse
Copy link

There's currently no way to get most performance statistics on ARC GPUs. intel_gpu_top doesn't have memory usage, and while it appears xpu-smi has some metrics it's missing a lot on ARC.

On a quick test with A770 (0x56a0) on TGL-H host, with GuC 70.8.0 FW, using "6.5.0-18-generic" HWE kernel (=upstream with Ubuntu patches) on Ubuntu 22.04.4 LTS distro, with compute-runtime "23.48.27912.11" (own build), I get following GPU metrics from the driver:

* Engine utilization

* Frequency

* Memory usage

* Power usage

(There may be some kernel DKMS drivers + user-space driver combo which would provide also GPU memory BW, temperature and maybe also error counters, but at least one of those will need out-of-band metrics kernel driver instead of GPU one.)

PS. I'm checking these with the tester in the corresponding compute-runtime version (after installing level-zero frontend devel package):

$ DRIVER_TAG=23.48.27912.11
$ wget --no-verbose https://raw.githubusercontent.com/intel/compute-runtime/$DRIVER_TAG/level_zero/tools/test/black_box_tests/zello_sysman.cpp
$ g++ -O2 -Wall -o zello_sysman zello_sysman.cpp -lze_loader
$ zello_sysman --engine --frequency --memory --temperature --ras --power

(--power needs to be last option as it has optional args.)

I gave it a try... It seems the ras and temperature is currently not supported. Temperature metrics is such a needed feature imo. I opened a report on i915's kernel driver repository.

$ sudo ./zello_sysman --engine --frequency --memory --temperature --ras --power
ZES_ENABLE_SYSMAN environment variable Not Set
Setting the environment variable ZES_ENABLE_SYSMAN 
ZES_ENABLE_SYSMAN environment variable Set
Device Name = Intel(R) Arc(TM) A750 Graphics
UUID: 
134 128 161 86 8 0 0 0 3 0 0 0 0 0 0 0 
Sysman Initialization done via zeInit

 ----  Frequency tests ---- 
freqProperties.type = 0
freqProperties.canControl = 1
freqProperties.isThrottleEventSupported = 0
freqProperties.min = 300
freqProperties.max = 2400
freqState.currentVoltage = -1
freqState.request = 2400
freqState.tdp = -1
freqState.efficient = 600
freqState.actual = 2400
freqState.throttleReasons = 0
freqRange.min = 300
freqRange.max = 2400
 frequency = 300
...
...
 frequency = 2400
Setting Frequency Range . min 300
Setting Frequency Range . max 300
After Setting Getting Frequency Range . min 300
After Setting Getting Frequency Range . max 300
Setting Frequency Range . min 300
Setting Frequency Range . max 2400
After Setting Getting Frequency Range . min 300
After Setting Getting Frequency Range . max 2400

 ----  Engine tests ---- 
Device UUID: 
134 128 161 86 8 0 0 0 3 0 0 0 0 0 0 0 
[0]
Engine Type = ZES_ENGINE_GROUP_RENDER_SINGLE || Active Time = 0 || Timestamp = 427
Engine Type = ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE || Active Time = 0 || Timestamp = 417
Engine Type = ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE || Active Time = 0 || Timestamp = 401
Engine Type = ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE || Active Time = 0 || Timestamp = 372
Engine Type = ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE || Active Time = 0 || Timestamp = 336
Engine Type = ZES_ENGINE_GROUP_COPY_SINGLE || Active Time = 0 || Timestamp = 293
Engine Type = ZES_ENGINE_GROUP_MEDIA_ENHANCEMENT_SINGLE || Active Time = 0 || Timestamp = 241
Engine Type = ZES_ENGINE_GROUP_MEDIA_ENHANCEMENT_SINGLE || Active Time = 0 || Timestamp = 178
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[1]
....
....

 ----  Temperature tests ---- 
Could not retrieve Temperature domains

 ----  Power tests ---- 
properties.onSubdevice = 0
properties.subdeviceId = 0
properties.canControl = 1
properties.isEnergyThresholdSupported= 0
properties.defaultLimit= -1
properties.maxLimit =-1
properties.minLimit =-1
CurrentPower = 9.45378 W forrootDevice
CurrentPower = 8.96061 W forrootDevice
CurrentPower = 9.37615 W forrootDevice
CurrentPower = 9.49795 W forrootDevice
CurrentPower = 8.42323 W forrootDevice
ZE_RESULT_ERROR_UNSUPPORTED_FEATURE returned by zesPowerGetLimits(handle, &sustainedGetDefault, nullptr, &peakGetDefault): testSysmanPower: 336
ZE_RESULT_ERROR_UNSUPPORTED_FEATURE returned by zesPowerSetLimits(handle, &sustainedGetDefault, nullptr, &peakGetDefault): testSysmanPower: 338
ZE_RESULT_ERROR_UNSUPPORTED_FEATURE returned by zesPowerGetLimitsExt(handle, &limitCount, nullptr): getPowerLimits: 201
powerLimitDesc.count = 0

 ----  Memory tests ---- 
Memory Type = ZES_MEM_TYPE_DDR
On Subdevice = 0
Subdevice Id = 0
Memory Size = 0
Number of channels = -1
Memory Health = ZES_MEM_HEALTH_OK
The total allocatable memory in bytes = 8522825728
The free memory in bytes = 1785864192
ZE_RESULT_ERROR_UNSUPPORTED_FEATURE returned by zesMemoryGetBandwidth(handle, &memoryBandwidth): testSysmanMemory: 1061
Memory Read Counter = 0
Memory Write Counter = 0
Memory Maximum Bandwidth = 0
Memory Timestamp = 0

 ----  Ras tests ---- 
Could not retrieve Ras Error Sets

Also, power is a suprise here. Thankfully no more idle 30W power usage.

@eero-t
Copy link

eero-t commented Jun 17, 2024

I gave it a try... It seems the ras and temperature is currently not supported. Temperature metrics is such a needed feature imo. I opened a report on i915's kernel driver repository.

Those are OoB (Out of Band) metrics, i.e. not provided by i915 kernel (GPU) driver, but by intel_pmt (PMT) driver.

I get temperature metrics for A770 both with drm-tip 6.9 kernel [1] with latest (self-built) compute-runtime release, and when using Ubuntu 6.5 HWE kernel with kernel DKMS package(s) from Intel repo [2].

[1] https://cgit.freedesktop.org/drm-tip/
[2] https://dgpu-docs.intel.com/

@qnixsynapse
Copy link

qnixsynapse commented Jun 18, 2024

Unfortunately, I still can't get temperature metrics even with the 6.9.5 kernel. I am using Arch Linux and I don't mind compiling a kernel with the patches which enables the metrics.

I get temperature metrics for A770 both with drm-tip 6.9 kernel [1] with latest (self-built) compute-runtime release, and when using Ubuntu 6.5 HWE kernel with kernel DKMS package(s) from Intel repo [2].

This feels like the support is on that (i915) drm kernel driver rather than intel_pmt driver to me(Unless the dkms driver from Intel repo adds new intel_pmt driver). I am trying to find the commit which enables it.

edit. And this is where it should have been but it isn't.

@eero-t
Copy link

eero-t commented Jun 18, 2024

Unfortunately, I still can't get temperature metrics even with the 6.9.5 kernel. I am using Arch Linux and I don't mind compiling a kernel with the patches which enables the metrics.

Are these enabled in your kernel builds?

# grep PMT /boot/config-<kernelversion>
CONFIG_INTEL_PMT_CLASS=m
CONFIG_INTEL_PMT_TELEMETRY=m
CONFIG_INTEL_PMT_CRASHLOG=m

I get temperature metrics for A770 both with drm-tip 6.9 kernel [1] with latest (self-built) compute-runtime release, and when using Ubuntu 6.5 HWE kernel with kernel DKMS package(s) from Intel repo [2].

...(Unless the dkms driver from Intel repo adds new intel_pmt driver)....

It does:

# dpkg -L intel-i915-dkms | grep pmt
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/mfd/intel_pmt.c
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/Kconfig
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/Makefile
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/class.c
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/class.h
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/crashlog.c
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/intel_pmt_class.c
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/intel_pmt_class.h
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/intel_pmt_crashlog.c
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/intel_pmt_telemetry.c
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/telemetry.c
/usr/src/intel-i915-dkms-1.24.1.11.240117.14/drivers/platform/x86/intel/pmt/telemetry.h

I am trying to find the commit which enables it.

Note that drm-tip (which I'm using) is upstream DRM infra integration tree. It gets DRM driver (i915 etc) changes before they go to Linus' upstream tree.

@qnixsynapse
Copy link

qnixsynapse commented Jun 18, 2024

Are these enabled in your kernel builds?

Yes:

$ grep PMT config
CONFIG_INTEL_PMT_CLASS=m
CONFIG_INTEL_PMT_TELEMETRY=m
CONFIG_INTEL_PMT_CRASHLOG=m

It does:

hmm... I will try to build an arch linux package later. Thank you for your help!

@sumseq
Copy link

sumseq commented Aug 1, 2024

Just adding here that having ARC support would be greatly appreciated as I use an Arc card to develop on before trying to run on the MAX 1550. Or, maybe at least have plans to support BattleMage GPUs whenever they are released?

@eero-t
Copy link

eero-t commented Aug 20, 2024

Just adding here that having ARC support would be greatly appreciated

As commented above, XPUM should work fine with Arc. What metrics are available depends on what FW / kernel / L0 driver versions are installed.

as I use an Arc card to develop on before trying to run on the MAX 1550.

For Max, you need to use kernel and user-space drivers from Intel's driver repository: https://dgpu-docs.intel.com/driver/installation.html

Or, maybe at least have plans to support BattleMage GPUs whenever they are released?

They should also work with XPUM as long, as you have correct kernel + user-space driver installed.

@Qubitium
Copy link

Qubitium commented Dec 16, 2024

We recently got a Arc B580 the xpu-smi situation is strange. xpu-smi requiresintel-level-zero-gpu but this breaks pytorch xpu support. intel-gpu-compute is required for pytorch to work buit requires libze-intel-gpu1 but this breaks xpu-smi from even being installable.

What is this level-zero driver and libze-intel-gpu1 + intel-gpu-compute? This is super confusing.

Even if I am able to install xpu-smi via intel ubuntu 24.04 repo, xpu-smi shows no gpu discovered even if clinfo shows gpu is there.

Coming from a single cuda-toolkit and single nvidia-smi, the current state of xpu management needs some consolidation.

Without intel-gpu-compute which requires libze-intel-gpu1 and uninstalls intel-level-zero-gpu, pytorch doesn't work/see xpu.

On 6.12.4 kernel xanmod with Ubuntu 24.04 os.

@eero-t
Copy link

eero-t commented Dec 16, 2024

What is this level-zero driver and libze-intel-gpu1 + intel-gpu-compute? This is super confusing.

@Qubitium I've never heard of package named intel-gpu-compute, but it sounds like you're mixing packages from different repositories, with conflicting packages and dependencies [1].

Where you got pytorch and what apt policy clinfo xpu-smi intel-level-zero-gpu intel-gpu-compute libze-intel-gpu1 reports?

Coming from a single cuda-toolkit and single nvidia-smi, the current state of xpu management needs some consolidation.

XPU releases seem to be built against Intel repo packages. I added separate #89 bug about building them against distro driver packages.

But all related Intel SW is open source, so you could also file bug against the distro you're using, so that it adds package for the missing project (xpu-smi) to their repositories.


I'm not sure what would be the best workaround in the meanwhile:

  • XPUM container images at DockerHub seem to be too old (from last year) to support Xe kernel driver (needed for B580)
  • While one could manually do things necessary to get the Intel repo xpu-smi package working with distro driver stack, it's a bit too messy to describe/debug here

One temporary solution could be to install distro level-zero dev package (libze-dev), build collectd v6-rc and run it with gpu_sysman plugin enabled, as that's IMHO easier to build than XPUM:


[1] Background info

Reasons why you do not see this "mess" for Nvidia, is that Nvidia (CUDA) drivers are proprietary, so distros won't include their own versions of them (meaning things won't work out of the box, but user can install the driver from Nvidia after accepting their driver license).

Intel driver projects use their own names for the driver packages they released. However, when distros eventually packaged those drivers, they chose different names for their own packages, and in some cases also for the library binaries compiled from those sources (sadly it happened both with Debian/Ubuntu and Fedora/RHEL).

Packages in Intel repo naturally use Intel package names for their dependencies, and distro packages use distro specific names for their dependencies.

Additionally, distro package drivers are built against upstream kernel uAPI, whereas Intel repository drivers are (or at least have been) built against the (earlier and for a long time, more extensive, out-of-tree) uAPI, provided by the DKMS kernel driver in Intel repository.

While that kernel API difference is nowadays mostly relevant for media driver, as a general rule, you should not mix (kernel and user-space) packages from different repositories, as they might not be configured with support for uAPI used by kernel driver from another repository.

@Qubitium
Copy link

Qubitium commented Dec 17, 2024

@eero-t Thanks for the quick reply. Here is my intel repos in /etc/apt/souces.list.d

cuda-ubuntu2404-x86_64.list

# deb [signed-by=/usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg] https://apt.repos.intel.com/intel-for-pytorch-gpu-dev all main
intel-for-pytorch-gpu-dev.list

# https://repositories.intel.com/gpu/ubuntu noble client
intel-gpu-noble.list

## deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main
oneAPI.list

ubuntu.sources

Info on intel-gpu-compute

dpkg -l | grep intel-gpu-compute
ii  intel-gpu-compute  2441.19.0-2~24.04 amd64  Install Intel GPU compute runtime packages
apt policy intel-gpu-compute
intel-gpu-compute:
  Installed: 2441.19.0-2~24.04
  Candidate: 2441.19.0-2~24.04
  Version table:
 *** 2441.19.0-2~24.04 500
        500 https://repositories.intel.com/gpu/ubuntu noble/client amd64 Packages
        100 /var/lib/dpkg/status
     2437.26.0-2~24.04 500
        500 https://repositories.intel.com/gpu/ubuntu noble/client amd64 Packages
     2423.31.0-2~24.04 500
        500 https://repositories.intel.com/gpu/ubuntu noble/client amd64 Packages

intel-gpu-compute is from the intel-gpu-noble repo and looks be added based on official how-to on Intel Linux Driver guide for Arc.

output of: apt policy clinfo xpu-smi intel-level-zero-gpu intel-gpu-compute libze-intel-gpu1

clinfo:
  Installed: 3.0.23.01.25-1build1
  Candidate: 3.0.23.01.25-1build1
  Version table:
 *** 3.0.23.01.25-1build1 500
        500 http://archive.ubuntu.com/ubuntu noble/universe amd64 Packages
        100 /var/lib/dpkg/status
xpu-smi:
  Installed: 1.2.39-66~24.04
  Candidate: 1.2.39-66~24.04
  Version table:
 *** 1.2.39-66~24.04 500
        500 https://repositories.intel.com/gpu/ubuntu noble/client amd64 Packages
        100 /var/lib/dpkg/status
     1.2.35-64~24.04 500
        500 https://repositories.intel.com/gpu/ubuntu noble/client amd64 Packages
     1.2.35-56~24.04 500
        500 https://repositories.intel.com/gpu/ubuntu noble/client amd64 Packages
intel-level-zero-gpu:
  Installed: (none)
  Candidate: 1.3.29735.27-914~24.04
  Version table:
     1.3.29735.27-914~24.04 500
        500 https://repositories.intel.com/gpu/ubuntu noble/client amd64 Packages
intel-gpu-compute:
  Installed: 2441.19.0-2~24.04
  Candidate: 2441.19.0-2~24.04
  Version table:
 *** 2441.19.0-2~24.04 500
        500 https://repositories.intel.com/gpu/ubuntu noble/client amd64 Packages
        100 /var/lib/dpkg/status
     2437.26.0-2~24.04 500
        500 https://repositories.intel.com/gpu/ubuntu noble/client amd64 Packages
     2423.31.0-2~24.04 500
        500 https://repositories.intel.com/gpu/ubuntu noble/client amd64 Packages
libze-intel-gpu1:
  Installed: 24.39.31294.20-1032~24.04
  Candidate: 24.39.31294.20-1032~24.04
  Version table:
 *** 24.39.31294.20-1032~24.04 500
        500 https://repositories.intel.com/gpu/ubuntu noble/client amd64 Packages
        100 /var/lib/dpkg/status
     24.35.30872.31-996~24.04 500
        500 https://repositories.intel.com/gpu/ubuntu noble/client amd64 Packages
     23.43.27642.40-1ubuntu3 500
        500 http://archive.ubuntu.com/ubuntu noble/universe amd64 Packages

@Qubitium
Copy link

Qubitium commented Dec 17, 2024

clinfo | grep "Device Name"
  Device Name                                     Intel(R) Graphics [0xe20b]
  Device Name                                     Intel(R) FPGA Emulation Device
  Device Name                                     NVIDIA PG506-230
    Device Name                                   Intel(R) Graphics [0xe20b]
    Device Name                                   Intel(R) Graphics [0xe20b]
    Device Name                                   Intel(R) Graphics [0xe20b]
xpu-smi discovery
No device discovered

Note that I am able to now install xpu-smi from intel-noble repo without conflict issues with pytorch+xpu. Strange. Also installing xpu-smi from apt didnot cause pytorch+xpu compat issues. Did not change anything except wake up from sleep but as show above that clinfo sees B580 but xpu-smi does not.

dmesg output with relevant info for intel arc b580

sudo dmesg | grep "96:00"
[    4.440816] pci 0000:96:00.0: [8086:e20b] type 00 class 0x030000 PCIe Endpoint
[    4.440834] pci 0000:96:00.0: BAR 0 [mem 0xe4000000-0xe4ffffff 64bit]
[    4.440847] pci 0000:96:00.0: BAR 2 [mem 0xe13f800000000-0xe13fbffffffff 64bit pref]
[    4.440870] pci 0000:96:00.0: ROM [mem 0xe5000000-0xe51fffff pref]
[    4.440954] pci 0000:96:00.0: PME# supported from D0 D3hot
[    6.518118] pci 0000:96:00.0: vgaarb: bridge control possible
[    6.518118] pci 0000:96:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[   21.306002] xe 0000:96:00.0: [drm] Found BATTLEMAGE (device ID e20b) display version 14.01 stepping B0
[   21.307637] xe 0000:96:00.0: [drm] Using GuC firmware from xe/bmg_guc_70.bin version 70.29.2
[   21.322426] xe 0000:96:00.0: [drm] Using GuC firmware from xe/bmg_guc_70.bin version 70.29.2
[   21.325224] xe 0000:96:00.0: [drm] Using HuC firmware from xe/bmg_huc.bin version 8.2.10
[   21.369744] xe 0000:96:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[   21.370530] xe 0000:96:00.0: [drm] VISIBLE VRAM: 0x000e13f800000000, 0x0000000400000000
[   21.371189] xe 0000:96:00.0: [drm] VRAM[0, 0]: Actual physical size 0x0000000300000000, usable size exclude stolen 0x00000002fb800000, CPU accessible size 0x00000002fb800000
[   21.371191] xe 0000:96:00.0: [drm] VRAM[0, 0]: DPA range: [0x0000000000000000-300000000], io range: [0x000e13f800000000-e13fafb800000]
[   21.371192] xe 0000:96:00.0: [drm] Total VRAM: 0x000e13f800000000, 0x0000000300000000
[   21.371193] xe 0000:96:00.0: [drm] Available VRAM: 0x000e13f800000000, 0x00000002fb800000
[   21.389740] xe 0000:96:00.0: [drm] Finished loading DMC firmware i915/bmg_dmc.bin (v2.6)
[   21.566082] xe 0000:96:00.0: [drm] ccs2 fused off
[   21.566084] xe 0000:96:00.0: [drm] ccs3 fused off
[   21.587146] xe 0000:96:00.0: [drm] vcs1 fused off
[   21.587149] xe 0000:96:00.0: [drm] vcs3 fused off
[   21.587150] xe 0000:96:00.0: [drm] vcs4 fused off
[   21.587150] xe 0000:96:00.0: [drm] vcs5 fused off
[   21.587150] xe 0000:96:00.0: [drm] vcs6 fused off
[   21.587151] xe 0000:96:00.0: [drm] vcs7 fused off
[   21.587151] xe 0000:96:00.0: [drm] vecs2 fused off
[   21.587152] xe 0000:96:00.0: [drm] vecs3 fused off
[   21.587155] xe 0000:96:00.0: [drm] gsccs disabled due to lack of FW
[   21.646506] [drm] Initialized xe 1.1.0 for 0000:96:00.0 on minor 13
[   21.701619] xe 0000:96:00.0: [drm] Cannot find any crtc or sizes
[   21.813685] xe 0000:96:00.0: [drm] Cannot find any crtc or sizes
[   21.869647] snd_hda_intel 0000:97:00.0: bound 0000:96:00.0 (ops i915_audio_component_bind_ops [xe])
[   21.869693] xe 0000:96:00.0: [drm] Cannot find any crtc or sizes

This line looks concerting. lack of FW (firmware)?

[   21.587155] xe 0000:96:00.0: [drm] gsccs disabled due to lack of FW

@Qubitium
Copy link

Qubitium commented Dec 17, 2024

Ok. Found the conflict. The issue was with the deb xpumanager_1.2.39_20240906.085820.11f3c29a.u24.04_amd64.deb provided by the repo:

dpkg -i xpumanager_1.2.39_20240906.085820.11f3c29a.u24.04_amd64.deb
(Reading database ... 110342 files and directories currently installed.)
Preparing to unpack xpumanager_1.2.39_20240906.085820.11f3c29a.u24.04_amd64.deb ...
Unpacking xpumanager (1.2.39-20240906.085820.11f3c29a~u24.04) ...
dpkg: dependency problems prevent configuration of xpumanager:
 xpumanager depends on intel-gsc (>= 0.8.4); however:
  Package intel-gsc is not installed.
  Version of intel-gsc on system, provided by libigsc0:amd64, is <none>.
 xpumanager depends on intel-level-zero-gpu (>= 1.3.23726); however:
  Package intel-level-zero-gpu is not installed.
  Version of intel-level-zero-gpu on system, provided by libze-intel-gpu1:amd64, is <none>.

Trying to install apt install intel-level-zero-gpu intel-gsc now creates conflicts and unable to install cleanly.

 apt install intel-level-zero-gpu intel-gsc
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
intel-level-zero-gpu is already the newest version (1.3.29735.27-914~24.04).
The following packages were automatically installed and are no longer required:
  intel-metrics-discovery intel-metrics-library libigsc0 libmetee4
Use 'apt autoremove' to remove them.
The following NEW packages will be installed:
  intel-gsc
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 0 B/60.9 kB of archives.
After this operation, 211 kB of additional disk space will be used.
Do you want to continue? [Y/n] y
(Reading database ... 110342 files and directories currently installed.)
Preparing to unpack .../intel-gsc_0.8.16+88~u24.04_amd64.deb ...
Unpacking intel-gsc (0.8.16+88~u24.04) ...
dpkg: error processing archive /var/cache/apt/archives/intel-gsc_0.8.16+88~u24.04_amd64.deb (--unpack):
 trying to overwrite '/usr/lib/x86_64-linux-gnu/libigsc.so.0', which is also in package libigsc0 0.9.3-104~u24.04
Errors were encountered while processing:
 /var/cache/apt/archives/intel-gsc_0.8.16+88~u24.04_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)

@Qubitium
Copy link

@eero-t Sorry for the multiple posts, I need to break down the long outputs and not have to edit super long single msg.

Pytorch+xpu provided by the intel pytorch repo does not work without intel-gpu-compute. As shown below:

# test.py
import torch

print("cuda", torch.cuda.is_available())
print("xpu",  torch.xpu.is_available())

Check pytorch + xpu

(base) root# python test.py
cuda False
/root/miniconda3/lib/python3.11/site-packages/torch/xpu/__init__.py:60: UserWarning: XPU device count is zero! (Triggered internally at /pytorch/c10/xpu/XPUFunctions.cpp:50.)
  return torch._C._xpu_getDeviceCount()
xpu False

Now install intel-gpu-compute

(base) root@gpu-xl:~# apt install intel-gpu-compute
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libze-intel-gpu1
The following packages will be REMOVED:
  intel-level-zero-gpu
The following NEW packages will be installed:
  intel-gpu-compute libze-intel-gpu1
0 upgraded, 2 newly installed, 1 to remove and 0 not upgraded.
Need to get 2335 kB of archives.
After this operation, 368 kB disk space will be freed.
Do you want to continue? [Y/n] y
Get:1 https://repositories.intel.com/gpu/ubuntu noble/client amd64 libze-intel-gpu1 amd64 24.39.31294.20-1032~24.04 [2333 kB]
Get:2 https://repositories.intel.com/gpu/ubuntu noble/client amd64 intel-gpu-compute amd64 2441.19.0-2~24.04 [2230 B]
Fetched 2335 kB in 2s (998 kB/s)       
(Reading database ... 110216 files and directories currently installed.)
Removing intel-level-zero-gpu (1.3.29735.27-914~24.04) ...
Selecting previously unselected package libze-intel-gpu1.
(Reading database ... 110211 files and directories currently installed.)
Preparing to unpack .../libze-intel-gpu1_24.39.31294.20-1032~24.04_amd64.deb ...
Unpacking libze-intel-gpu1 (24.39.31294.20-1032~24.04) ...
Selecting previously unselected package intel-gpu-compute.
Preparing to unpack .../intel-gpu-compute_2441.19.0-2~24.04_amd64.deb ...
Unpacking intel-gpu-compute (2441.19.0-2~24.04) ...
Setting up libze-intel-gpu1 (24.39.31294.20-1032~24.04) ...
Setting up intel-gpu-compute (2441.19.0-2~24.04) ...
Processing triggers for libc-bin (2.39-0ubuntu8.3) ...
Scanning processes...                                                                                                                                                           

No services need to be restarted.

No containers need to be restarted.

No user sessions are running outdated binaries.

No VM guests are running outdated hypervisor (qemu) binaries on this host.

Check again

(base) root# python test.py
cuda False
xpu True

I am as confused as you are about intel-gpu-compute and it is not even documented as required for pytorch+xpu but for me, it is mandatory.

@eero-t
Copy link

eero-t commented Dec 17, 2024

[ 21.587155] xe 0000:96:00.0: [drm] gsccs disabled due to lack of FW

GSC [1] FW is needed only for viewing protected content, i.e. media related thing, not compute or metrics.

Do you have FW package installed from Intel repo, or from Ubuntu?

Battlemage GSC FW is missing from upstream, so at least Ubuntu package cannot include it. I filed bug about that:
https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3866

[1] more info: https://lore.kernel.org/all/[email protected]/T/

@eero-t
Copy link

eero-t commented Dec 17, 2024

The following additional packages will be installed:
  libze-intel-gpu1
The following packages will be REMOVED:
  intel-level-zero-gpu
The following NEW packages will be installed:
  intel-gpu-compute libze-intel-gpu1

Both intel-level-zero-gpu and libze-intel-gpu1 are Intel GPU backends for level-zero (L0), built from https://github.com/intel/compute-runtime/ project. I.e. although they have different package and file names, they provide same functionality,

In theory, they should be interchangeable as L0 frontend should abstract what backend is loaded, and be able to load either of them.

Higher level app and lib packages should not be depending on the L0 backend, only on frontend, so this seems like intel-gpu-compute packaging bug...

I am as confused as you are about intel-gpu-compute and it is not even documented as required for pytorch+xpu but for me, it is mandatory.

What its package description states: dpkg -s intel-gpu-compute?

@eero-t
Copy link

eero-t commented Dec 17, 2024

Ok. Found the conflict. The issue was with the deb xpumanager_1.2.39_20240906.085820.11f3c29a.u24.04_amd64.deb:
...

Please add info from your comment to bug #89.

(I wonder what XPUM needs the GSC lib for, and why that also depends directly on L0 backend...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants