-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use NVIDIA's official pynvml binding #107
Conversation
From @XuehaiPan's comment #102 (comment):
|
I don't like pinning the exact version because it will cause many other problems (e.g., an unexpected version still might be installed despite the pinned dependency specification). Actually With old driver versions (e.g. 430.64), however, getting the function pointer to Not sure what we need to patch |
Yes, that will cause wrong type casting. And we will need to maintain |
I think NVIDIA/go-nvml#21 and NVIDIA/go-nvml#22 could be helpful. |
Since 2021, NVIDIA provies an official python binding `pynvml`, which should replace existing third-party (not maintained) one. The APIs are mostly the same, but it ships with some recent features such as MIG support. To ensure a correct `pynvml` package is installed, we perform some sanity check upon importing the library.
Copied from https://forums.developer.nvidia.com/t/pypi-nvidia-ml-py-issue-reports-for-nvidia-ml-py/196506/2 There is another breaking change: nvidia-ml-py 11.515.0 (Jan 12, 2022) now even introduces v3 ( I was thinking of dropping support for very-old driver versions (430.64 for Ubuntu 16.04, not supported any more in 2022), but now this would be a big problem for us. A dirty workaround of fallbacking to legacy functions for non-latest drivers must be made otherwise no process information would be available in the vast majority of the use cases. Or I could simply the pin nvidia-ml-py again (which I don't like though): |
I am going to drop support for old, legacy nvidia driver versions, so that we don't need to workaround to support both v1 and v2 versions, or pin specific nvidia-ml-py versions. The Regarding yet another breaking change introduced in v3 APIs (Jan 12, 2022) since nvidia-ml-py
|
Requires nvidia-ml-py >= 11.450.129 and nvidia drivers >= 450.66. We also pin the version of nvidia-ml-py < 11.515 against a breaking change due to v3 APIs (from 510.39.01). See #107 for more details. We no longer need to exclude 375.* versions as the wrong package has been expelled from the pypi repository.
Requires nvidia-ml-py >= 11.450.129 and nvidia drivers >= 450.66. We also pin the version of nvidia-ml-py < 11.510 against a breaking change due to v3 APIs (from 510.39.01). The latest compatible version would be nvidia-ml-py==11.495.46. See #107 for more details. We no longer need to exclude 375.* versions as the wrong package has been expelled from the pypi repository.
Requires nvidia-ml-py >= 11.450.129 and nvidia drivers >= 450.66. We also pin the version of nvidia-ml-py < 11.510 against a breaking change due to v3 APIs (from 510.39.01). The latest compatible version would be nvidia-ml-py==11.495.46. See #107 for more details. We no longer need to exclude 375.* versions as the wrong package has been expelled from the pypi repository.
At the moment, we will pin the nvidia-ml-py version at 11.495.46 due to the breaking changes (
For more modern GPU + feature support, we may need to depend on latest pynvml version but this will be supported in later gpustat versions (say 1.x), adding some fallback mechanisms for backward compatibility. |
There is a version conflict problem with other libraries that pin pynvml version, see XuehaiPan/nvitop#23. (From my end, this needs to be resolved nicely before releasing) |
Hi,
If gpustat requires certain version of pynvml, should that requirement be included in |
@xwjiang2010 It is already specified in setup.py, of course (I don't use requirements.txt): https://github.com/wookayin/gpustat/blob/master/setup.py#L79 Did you follow the instruction after you ran into the error? If you installed gpustat, a proper version of |
It seems like this is related to comet-ml/issue-tracking#481 -- I hope the two projects can align on one common pynvml to use, so they can be used together :) |
The conflict between https://github.com/nicolargo/nvidia-ml-py3/network/dependents?dependent_type=PACKAGE&package_id=UGFja2FnZS01MjM0NzAyMQ%3D%3D shows at least 75+ such packages. In such cases where
However, things can be complicated when it comes to non-interactive CI jobs (example: ray-project/ray#26295). |
gpustat 1.0.0+ requires pynvml to be an official one (nvidia-ml-py) rather than an obsolete third-party package nvidia-ml-py3 (see #107), but this may cause some inconvenient conflict with other third-party packages that depend on the legacy nvidia-ml-py3. As a temporary workaround, we introduce an environment variable `ALLOW_LEGACY_PYNVML` whose use will result in bypassing pynvml validation. With this flag turned on, gpustat may work with the legacy pynvml library, but with a possibility that it may produce wrong results on running process information. e.g., ALLOW_LEGACY_PYNVML=1 gpustat
4400d64 introduces the use of environment variable |
This commit adds unit tests for #107, where legacy and supported nvidia-drivers would behave differently on process-relatd APIs (e.g., nvmlDeviceGetComputeRunningProcesses_v2). Note: As already pointed out in #107, this test (and gpustat's process information) fails with nvidia-ml-py > 11.495.46 breaking the backward compatibility.
pynvml 11.510.69 has broken the backward compatibility by removing `nvml.nvmlDeviceGetComputeRunningProcesses_v2` which is replaced by v3 APIs (`nvml.nvmlDeviceGetComputeRunningProcesses_v3`), but this function does not exist for old nvidia drivers less than 510.39.01. Therefore we pinned pynvml version at 11.495.46 in gpustat v1.0 (#107), but we actually have to use recent pynvml versions for "latest" or modern NVIDIA drivers. To make compute/graphics process information work correctly when a combination of old nvidia drivers (`< 510.39`) AND `pynvml >= 11.510.69` is used, we need to monkey-patch pynvml functions in our custom manner such that, for instance, when v3 API is introduced, we can simply fallback to v2 APIs to retrieve the process information.
pynvml 11.510.69 has broken the backward compatibility by removing `nvml.nvmlDeviceGetComputeRunningProcesses_v2` which is replaced by v3 APIs (`nvml.nvmlDeviceGetComputeRunningProcesses_v3`), but this function does not exist for old nvidia drivers less than 510.39.01. Therefore we pinned pynvml version at 11.495.46 in gpustat v1.0 (#107), but we actually have to use recent pynvml versions for "latest" or modern NVIDIA drivers. To make compute/graphics process information work correctly when a combination of old nvidia drivers (`< 510.39`) AND `pynvml >= 11.510.69` is used, we need to monkey-patch pynvml functions in our custom manner such that, for instance, when v3 API is introduced, we can simply fallback to v2 APIs to retrieve the process information.
pynvml 11.510.69 has broken the backward compatibility by removing `nvml.nvmlDeviceGetComputeRunningProcesses_v2` which is replaced by v3 APIs (`nvml.nvmlDeviceGetComputeRunningProcesses_v3`), but this function does not exist for old nvidia drivers less than 510.39.01. Therefore we pinned pynvml version at 11.495.46 in gpustat v1.0 (#107), but we actually have to use recent pynvml versions for "latest" or modern NVIDIA drivers. To make compute/graphics process information work correctly when a combination of old nvidia drivers (`< 510.39`) AND `pynvml >= 11.510.69` is used, we need to monkey-patch pynvml functions in our custom manner such that, for instance, when v3 API is introduced, we can simply fallback to v2 APIs to retrieve the process information.
#143 relaxes (maximum) version requirement; there will be only minimum requirement since v1.1 |
Since 2021, NVIDIA provides an official python binding
pynvml
: https://pypi.org/project/nvidia-ml-py/#historywhich should replace a third-party community fork nvidia-ml-py3 that we have been using.
The main motivations are (1) to use an official library and (2) to add MIG support.
See #102 for more details.
Need to test whether:
(see Add MIG support #102 (comment))
(see Faile to run ``gpustat --debug'': pynvml.NVMLError_LibraryNotFound: NVML Shared Library Not Found #90)
/cc @XuehaiPan @Stonesjtu
Important Changes
The official python bindings
nvidia-ml-py
needs to be installed, notnvidia-ml-py3
. When the legacy one is installed for some reason, an error will occur:To fix this error, please uninstall
nvidia-ml-py3
and installnvidia-ml-py<=11.495.46
(please follow the instruction in the error message). Or you can [bypass] the validation if you really want.Due to compatibility reasons, NVIDIA Driver version needs to be 450.66 or higher.