Use NVIDIA's official pynvml binding #107

wookayin · 2021-08-05T11:05:01Z

Since 2021, NVIDIA provides an official python binding pynvml: https://pypi.org/project/nvidia-ml-py/#history
which should replace a third-party community fork nvidia-ml-py3 that we have been using.

The main motivations are (1) to use an official library and (2) to add MIG support.
See #102 for more details.

Need to test whether:

The new pynvml API works well on old & recent NVIDIA Drivers; maybe some monkey patching needed
(see Add MIG support #102 (comment))
The new pynvml API works well on Windows
(see Faile to run ``gpustat --debug'': pynvml.NVMLError_LibraryNotFound: NVML Shared Library Not Found #90)

/cc @XuehaiPan @Stonesjtu

Important Changes

The official python bindings nvidia-ml-py needs to be installed, not nvidia-ml-py3. When the legacy one is installed for some reason, an error will occur:
```
ImportError: pynvml is missing or an outdated version is installed. 
```
To fix this error, please uninstall nvidia-ml-py3 and install nvidia-ml-py<=11.495.46 (please follow the instruction in the error message). Or you can [bypass] the validation if you really want.
Due to compatibility reasons, NVIDIA Driver version needs to be 450.66 or higher.

setup.py

wookayin · 2021-08-05T14:39:00Z

From @XuehaiPan's comment #102 (comment):

v1	NVIDIA 430.64	NVIDIA 470.57.02
`nvidia-ml-py==11.450.51`	works but without `CI ID` / `GI ID`	works but without `CI ID` / `GI ID`
`nvidia-ml-py>=11.450.129`	no exceptions in Python but gets wrong results (subscript out of range in C library)	no exceptions in Python but gets wrong results (subscript out of range in C library)


v2	NVIDIA 430.64	NVIDIA 470.57.02
`nvidia-ml-py==11.450.51`	function not found	no exceptions in Python but gets wrong results (subscript out of range in C library)
`nvidia-ml-py>=11.450.129`	function not found	works with correct `CI ID` / `GI ID`

wookayin · 2021-08-05T14:41:56Z

We should freeze this version as nvidia-ml-py==11.450.51 before we can find a way to patch nvmlProcessInfo_t and nvmlDeviceGetComputeRunningProcesses.

I don't like pinning the exact version because it will cause many other problems (e.g., an unexpected version still might be installed despite the pinned dependency specification).

Actually nvidia-ml-py==11.460.79 (which is the latest and >=11.450.129) works fine for me. It is giving me a correct process result with the latest driver version (I tested with 470.57.02).

With old driver versions (e.g. 430.64), however, getting the function pointer to nvmlDeviceGetComputeRunningProcesses_v2 will fail according to your report. I'll need to add a patch in this PR. Some details of how to elegantly do this is something we need to figure out.

Not sure what we need to patch nvmlProcessInfo_t. Is it because the struct size has changed?

XuehaiPan · 2021-08-05T16:43:55Z

Is it because the struct size has changed?

Yes, that will cause wrong type casting. And we will need to maintain nvmlDeviceGet{Compute,Graphics}RunningProcesses on our own for keeping backward compatiblity. I have tried to hack the function pointer cache, but that will give wrong results (wrong PIDs, memory usages, etc.). However nvidia-ml-py==11.450.51 always works fine on old and new drivers without any patch effort just like the unofficial bindings do. (indeed which drops MIG support though)

XuehaiPan · 2021-08-11T15:04:54Z

I think NVIDIA/go-nvml#21 and NVIDIA/go-nvml#22 could be helpful.

Since 2021, NVIDIA provies an official python binding `pynvml`, which should replace existing third-party (not maintained) one. The APIs are mostly the same, but it ships with some recent features such as MIG support. To ensure a correct `pynvml` package is installed, we perform some sanity check upon importing the library.

wookayin · 2022-03-15T16:48:54Z

Copied from https://forums.developer.nvidia.com/t/pypi-nvidia-ml-py-issue-reports-for-nvidia-ml-py/196506/2

There is another breaking change: nvidia-ml-py 11.515.0 (Jan 12, 2022) now even introduces v3 (nvmlDeviceGetComputeRunningProcesses_v3, etc.). I can confirm this breaks older driver versions anything before Jan 2022 (for instance, 470.86, which is released only 4 months ago!!!!!), because in nvidia driver the low-level function nvmlDeviceGetGraphicsRunningProcesses_v3 doesn’t exist.

I was thinking of dropping support for very-old driver versions (430.64 for Ubuntu 16.04, not supported any more in 2022), but now this would be a big problem for us. A dirty workaround of fallbacking to legacy functions for non-latest drivers must be made otherwise no process information would be available in the vast majority of the use cases.

Or I could simply the pin nvidia-ml-py again (which I don't like though): nvidia-ml-py<11.515.0, given that nvidia-ml-py no longer has my trust in keeping the backward compatibility.

wookayin · 2022-04-30T20:43:13Z

I am going to drop support for old, legacy nvidia driver versions, so that we don't need to workaround to support both v1 and v2 versions, or pin specific nvidia-ml-py versions.

The nvmlDeviceGetComputeRunningProcesses_v2 function was introduced in nvidia in pynvml 11.450.129 and nvidia driver 450.66 (released August 2020). Without the workaround mentioned above, gpustat won't be able to display process information for old nvidia drivers not newer than 450.66. I might add legacy driver support back in later versions, but since normal users are expected to use reasonably recent nvidia drivers (something newer than 2 years ago) so it would be fine.

Regarding yet another breaking change introduced in v3 APIs (Jan 12, 2022) since nvidia-ml-py ~~11.515.0~~ 11.510.69, breaking all the drivers not newer than 510.39.01 (see nvml.h diff).

Currently 11.515.0 is yanked from pypi so it may not cause an immediate problem, but we will have to pin nvidia-ml-py at <11.515.0 because it is very likely that the v3 API will break on the existing nvidia driver versions. The latest nvidia-ml-py 11.510.69 (Apr 18, 2022) already broke v2 API.

Requires nvidia-ml-py >= 11.450.129 and nvidia drivers >= 450.66. We also pin the version of nvidia-ml-py < 11.515 against a breaking change due to v3 APIs (from 510.39.01). See #107 for more details. We no longer need to exclude 375.* versions as the wrong package has been expelled from the pypi repository.

Requires nvidia-ml-py >= 11.450.129 and nvidia drivers >= 450.66. We also pin the version of nvidia-ml-py < 11.510 against a breaking change due to v3 APIs (from 510.39.01). The latest compatible version would be nvidia-ml-py==11.495.46. See #107 for more details. We no longer need to exclude 375.* versions as the wrong package has been expelled from the pypi repository.

wookayin · 2022-07-04T10:55:41Z

At the moment, we will pin the nvidia-ml-py version at 11.495.46 due to the breaking changes (nvmlDeviceGetComputeRunningProcesses_v3) introduced in 11.510.x.

If incompatible versions are installed such as pynvml>=11.510.69, process information will be not available: showing "(Not Supported)"
This also requires graphics driver version no lower than 450.66 (released 2020. 8. 18.) to be able to correctly display the process information.
All that said, with this combination (Driver >= 450.66, pynvml>=11.450.129,<=11.495.46), gpustat should have no problems in most situations.

For more modern GPU + feature support, we may need to depend on latest pynvml version but this will be supported in later gpustat versions (say 1.x), adding some fallback mechanisms for backward compatibility.

wookayin · 2022-07-04T11:13:30Z

There is a version conflict problem with other libraries that pin pynvml version, see XuehaiPan/nvitop#23. (From my end, this needs to be resolved nicely before releasing)

xwjiang2010 · 2022-07-05T20:22:57Z

Hi,
I am seeing the following after pip install gpustat (1.0.0rc1)

ImportError: pynvml is missing or an outdated version is installed. 

We require nvidia-ml-py>=11.450.129; see GH-107 for more details.
Your pynvml installation: <module 'pynvml' from '/opt/miniconda/lib/python3.7/site-packages/pynvml.py'>

-----------------------------------------------------------
Please reinstall `gpustat`:

$ pip install --force-reinstall gpustat

if it still does not fix the problem, manually fix nvidia-ml-py installation:

$ pip uninstall nvidia-ml-py3
$ pip install --force-reinstall 'nvidia-ml-py<=11.495.46'

If gpustat requires certain version of pynvml, should that requirement be included in requirements.txt?

wookayin · 2022-07-06T04:20:18Z

@xwjiang2010 It is already specified in setup.py, of course (I don't use requirements.txt): https://github.com/wookayin/gpustat/blob/master/setup.py#L79

Did you follow the instruction after you ran into the error? If you installed gpustat, a proper version of nvidia-ml-py should have installed. Such an error can happen due to a dependency conflict with other packages.

pcmoritz · 2022-07-19T23:59:25Z

It seems like this is related to comet-ml/issue-tracking#481 -- I hope the two projects can align on one common pynvml to use, so they can be used together :)

wookayin · 2022-07-20T05:13:06Z

There is a version conflict problem with other libraries that pin pynvml version, see XuehaiPan/nvitop#23. (From my end, this needs to be resolved nicely before releasing)

It seems like this is related to comet-ml/issue-tracking#481 -- I hope the two projects can align on one common pynvml to use, so they can be used together :)

The conflict between nvidia-ml-py3 (an obsolete one) and NVIDIA's official one (nvidia-ml-py) can be problematic because there are still many packages and projects in the wild (that are not quite actively being maintained) depending on nvidia-ml-py3. It's still possible that both packages are being installed, and which version of the pynvml module is installed seems undefined.

https://github.com/nicolargo/nvidia-ml-py3/network/dependents?dependent_type=PACKAGE&package_id=UGFja2FnZS01MjM0NzAyMQ%3D%3D shows at least 75+ such packages.

In such cases where nvidia-ml-py3 gets resolved in the dependency chain, users will need to uninstall nvidia-ml-py3 and install the correct package manually (as per the instruction in the error message):

$ pip uninstall nvidia-ml-py3
$ pip install --force-reinstall 'nvidia-ml-py<=11.495.46'

However, things can be complicated when it comes to non-interactive CI jobs (example: ray-project/ray#26295).

gpustat 1.0.0+ requires pynvml to be an official one (nvidia-ml-py) rather than an obsolete third-party package nvidia-ml-py3 (see #107), but this may cause some inconvenient conflict with other third-party packages that depend on the legacy nvidia-ml-py3. As a temporary workaround, we introduce an environment variable `ALLOW_LEGACY_PYNVML` whose use will result in bypassing pynvml validation. With this flag turned on, gpustat may work with the legacy pynvml library, but with a possibility that it may produce wrong results on running process information. e.g., ALLOW_LEGACY_PYNVML=1 gpustat

wookayin · 2022-08-04T18:01:28Z

4400d64 introduces the use of environment variable ALLOW_LEGACY_PYNVML to bypass pynvml validation. I won't document this upfront, but more information can be found in the commit message. I believe @mattip may find this useful, in a case gpustat would be broken due to the conflicting dependency on pynvml resulting from third-party packages that are out-of-control.

This commit adds unit tests for #107, where legacy and supported nvidia-drivers would behave differently on process-relatd APIs (e.g., nvmlDeviceGetComputeRunningProcesses_v2). Note: As already pointed out in #107, this test (and gpustat's process information) fails with nvidia-ml-py > 11.495.46 breaking the backward compatibility.

pynvml 11.510.69 has broken the backward compatibility by removing `nvml.nvmlDeviceGetComputeRunningProcesses_v2` which is replaced by v3 APIs (`nvml.nvmlDeviceGetComputeRunningProcesses_v3`), but this function does not exist for old nvidia drivers less than 510.39.01. Therefore we pinned pynvml version at 11.495.46 in gpustat v1.0 (#107), but we actually have to use recent pynvml versions for "latest" or modern NVIDIA drivers. To make compute/graphics process information work correctly when a combination of old nvidia drivers (`< 510.39`) AND `pynvml >= 11.510.69` is used, we need to monkey-patch pynvml functions in our custom manner such that, for instance, when v3 API is introduced, we can simply fallback to v2 APIs to retrieve the process information.

wookayin · 2023-03-02T14:45:47Z

#143 relaxes (maximum) version requirement; there will be only minimum requirement since v1.1

wookayin added this to the 1.0 milestone Aug 5, 2021

XuehaiPan reviewed Aug 5, 2021

View reviewed changes

setup.py Outdated Show resolved Hide resolved

wookayin mentioned this pull request Mar 15, 2022

[BUG] memory size is different #116

Closed

wookayin force-pushed the pynvml-nvidia branch from a5189b4 to bac0eca Compare April 30, 2022 21:10

wookayin force-pushed the pynvml-nvidia branch from bac0eca to 578da6d Compare May 1, 2022 03:20

wookayin added the pynvml label May 24, 2022

minho-comcom-ai mentioned this pull request May 30, 2022

gpustat이나 nvitop같은걸 쓰고싶어요. ainize-team/innoacad-worker-qna#20

Open

wookayin force-pushed the pynvml-nvidia branch from 578da6d to 7a9884f Compare June 27, 2022 13:48

wookayin merged commit cd65b1e into master Jul 4, 2022

wookayin mentioned this pull request Jul 4, 2022

nvidia-ml-py version conflicts with other packages (e.g., gpustat) XuehaiPan/nvitop#23

Closed

wookayin mentioned this pull request Jul 5, 2022

Please make a new release #92

Closed

pcmoritz mentioned this pull request Jul 19, 2022

update gpustat to version 1.0.0 ray-project/ray#26295

Merged

6 tasks

mattip mentioned this pull request Jul 27, 2022

Swich nvidia-ml-py3 to nvidia-ml-py ivy-llc/ivy#2343

Closed

mattip mentioned this pull request Aug 4, 2022

This package depends on nvidia-ml-py3 which is quite old, could it use nvidia-ml-py instead? jags111/floral-diffusion#1

Open

mattip mentioned this pull request Aug 4, 2022

This package depends on nvidia-ml-py3 which is quite old, could it use nvidia-ml-py instead? slothfulxtx/TransLoc3D#4

Closed

wookayin mentioned this pull request Oct 14, 2022

No process info, just "Not Supported" when using legacy NVIDIA drivers #138

Closed

wookayin deleted the pynvml-nvidia branch October 16, 2022 19:28

wookayin mentioned this pull request Nov 8, 2022

Incorrect memory usage for nvidia driver higher than R510 #141

Closed

wookayin mentioned this pull request Nov 27, 2022

Add pynvml compatibility by monkey-patching #143

Merged

XuehaiPan mentioned this pull request Apr 16, 2023

Add GPU stats features giampaolo/psutil#526

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use NVIDIA's official pynvml binding #107

Use NVIDIA's official pynvml binding #107

wookayin commented Aug 5, 2021 •

edited

Loading

wookayin commented Aug 5, 2021

wookayin commented Aug 5, 2021 •

edited

Loading

XuehaiPan commented Aug 5, 2021 •

edited

Loading

XuehaiPan commented Aug 11, 2021

wookayin commented Mar 15, 2022 •

edited

Loading

wookayin commented Apr 30, 2022 •

edited

Loading

wookayin commented Jul 4, 2022 •

edited

Loading

wookayin commented Jul 4, 2022

xwjiang2010 commented Jul 5, 2022 •

edited by wookayin

Loading

wookayin commented Jul 6, 2022

pcmoritz commented Jul 19, 2022

wookayin commented Jul 20, 2022

wookayin commented Aug 4, 2022

wookayin commented Mar 2, 2023 •

edited

Loading

Use NVIDIA's official pynvml binding #107

Use NVIDIA's official pynvml binding #107

Conversation

wookayin commented Aug 5, 2021 • edited Loading

Important Changes

wookayin commented Aug 5, 2021

wookayin commented Aug 5, 2021 • edited Loading

XuehaiPan commented Aug 5, 2021 • edited Loading

XuehaiPan commented Aug 11, 2021

wookayin commented Mar 15, 2022 • edited Loading

wookayin commented Apr 30, 2022 • edited Loading

wookayin commented Jul 4, 2022 • edited Loading

wookayin commented Jul 4, 2022

xwjiang2010 commented Jul 5, 2022 • edited by wookayin Loading

wookayin commented Jul 6, 2022

pcmoritz commented Jul 19, 2022

wookayin commented Jul 20, 2022

wookayin commented Aug 4, 2022

wookayin commented Mar 2, 2023 • edited Loading

wookayin commented Aug 5, 2021 •

edited

Loading

wookayin commented Aug 5, 2021 •

edited

Loading

XuehaiPan commented Aug 5, 2021 •

edited

Loading

wookayin commented Mar 15, 2022 •

edited

Loading

wookayin commented Apr 30, 2022 •

edited

Loading

wookayin commented Jul 4, 2022 •

edited

Loading

xwjiang2010 commented Jul 5, 2022 •

edited by wookayin

Loading

wookayin commented Mar 2, 2023 •

edited

Loading