Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use NVIDIA's official pynvml binding #107

Merged
merged 2 commits into from
Jul 4, 2022
Merged

Use NVIDIA's official pynvml binding #107

merged 2 commits into from
Jul 4, 2022

Conversation

wookayin
Copy link
Owner

@wookayin wookayin commented Aug 5, 2021

Since 2021, NVIDIA provides an official python binding pynvml: https://pypi.org/project/nvidia-ml-py/#history
which should replace a third-party community fork nvidia-ml-py3 that we have been using.

The main motivations are (1) to use an official library and (2) to add MIG support.
See #102 for more details.

Need to test whether:

/cc @XuehaiPan @Stonesjtu

Important Changes

  • The official python bindings nvidia-ml-py needs to be installed, not nvidia-ml-py3. When the legacy one is installed for some reason, an error will occur:

    ImportError: pynvml is missing or an outdated version is installed. 
    
  • To fix this error, please uninstall nvidia-ml-py3 and install nvidia-ml-py<=11.495.46 (please follow the instruction in the error message). Or you can [bypass] the validation if you really want.

  • Due to compatibility reasons, NVIDIA Driver version needs to be 450.66 or higher.

@wookayin wookayin added this to the 1.0 milestone Aug 5, 2021
setup.py Outdated Show resolved Hide resolved
@wookayin
Copy link
Owner Author

wookayin commented Aug 5, 2021

From @XuehaiPan's comment #102 (comment):

v1 NVIDIA 430.64 NVIDIA 470.57.02
nvidia-ml-py==11.450.51 works but without CI ID / GI ID works but without CI ID / GI ID
nvidia-ml-py>=11.450.129 no exceptions in Python
but gets wrong results
(subscript out of range in C library)
no exceptions in Python
but gets wrong results
(subscript out of range in C library)
v2 NVIDIA 430.64 NVIDIA 470.57.02
nvidia-ml-py==11.450.51 function not found no exceptions in Python
but gets wrong results
(subscript out of range in C library)
nvidia-ml-py>=11.450.129 function not found works with correct CI ID / GI ID

@wookayin
Copy link
Owner Author

wookayin commented Aug 5, 2021

We should freeze this version as nvidia-ml-py==11.450.51 before we can find a way to patch nvmlProcessInfo_t and nvmlDeviceGetComputeRunningProcesses.

I don't like pinning the exact version because it will cause many other problems (e.g., an unexpected version still might be installed despite the pinned dependency specification).

Actually nvidia-ml-py==11.460.79 (which is the latest and >=11.450.129) works fine for me. It is giving me a correct process result with the latest driver version (I tested with 470.57.02).

With old driver versions (e.g. 430.64), however, getting the function pointer to nvmlDeviceGetComputeRunningProcesses_v2 will fail according to your report. I'll need to add a patch in this PR. Some details of how to elegantly do this is something we need to figure out.

Not sure what we need to patch nvmlProcessInfo_t. Is it because the struct size has changed?

@XuehaiPan
Copy link
Contributor

XuehaiPan commented Aug 5, 2021

Is it because the struct size has changed?

Yes, that will cause wrong type casting. And we will need to maintain nvmlDeviceGet{Compute,Graphics}RunningProcesses on our own for keeping backward compatiblity. I have tried to hack the function pointer cache, but that will give wrong results (wrong PIDs, memory usages, etc.). However nvidia-ml-py==11.450.51 always works fine on old and new drivers without any patch effort just like the unofficial bindings do. (indeed which drops MIG support though)

@XuehaiPan
Copy link
Contributor

I think NVIDIA/go-nvml#21 and NVIDIA/go-nvml#22 could be helpful.

Since 2021, NVIDIA provies an official python binding `pynvml`,
which should replace existing third-party (not maintained) one.
The APIs are mostly the same, but it ships with some recent features
such as MIG support.

To ensure a correct `pynvml` package is installed, we perform
some sanity check upon importing the library.
@wookayin
Copy link
Owner Author

wookayin commented Mar 15, 2022

Copied from https://forums.developer.nvidia.com/t/pypi-nvidia-ml-py-issue-reports-for-nvidia-ml-py/196506/2

There is another breaking change: nvidia-ml-py 11.515.0 (Jan 12, 2022) now even introduces v3 (nvmlDeviceGetComputeRunningProcesses_v3, etc.). I can confirm this breaks older driver versions anything before Jan 2022 (for instance, 470.86, which is released only 4 months ago!!!!!), because in nvidia driver the low-level function nvmlDeviceGetGraphicsRunningProcesses_v3 doesn’t exist.

I was thinking of dropping support for very-old driver versions (430.64 for Ubuntu 16.04, not supported any more in 2022), but now this would be a big problem for us. A dirty workaround of fallbacking to legacy functions for non-latest drivers must be made otherwise no process information would be available in the vast majority of the use cases.

Or I could simply the pin nvidia-ml-py again (which I don't like though): nvidia-ml-py<11.515.0, given that nvidia-ml-py no longer has my trust in keeping the backward compatibility.

@wookayin
Copy link
Owner Author

wookayin commented Apr 30, 2022

I am going to drop support for old, legacy nvidia driver versions, so that we don't need to workaround to support both v1 and v2 versions, or pin specific nvidia-ml-py versions.

The nvmlDeviceGetComputeRunningProcesses_v2 function was introduced in nvidia in pynvml 11.450.129 and nvidia driver 450.66 (released August 2020). Without the workaround mentioned above, gpustat won't be able to display process information for old nvidia drivers not newer than 450.66. I might add legacy driver support back in later versions, but since normal users are expected to use reasonably recent nvidia drivers (something newer than 2 years ago) so it would be fine.

Regarding yet another breaking change introduced in v3 APIs (Jan 12, 2022) since nvidia-ml-py 11.515.0 11.510.69, breaking all the drivers not newer than 510.39.01 (see nvml.h diff).

Currently 11.515.0 is yanked from pypi so it may not cause an immediate problem, but we will have to pin nvidia-ml-py at <11.515.0 because it is very likely that the v3 API will break on the existing nvidia driver versions. The latest nvidia-ml-py 11.510.69 (Apr 18, 2022) already broke v2 API.

wookayin added a commit that referenced this pull request Apr 30, 2022
Requires nvidia-ml-py >= 11.450.129 and nvidia drivers >= 450.66.

We also pin the version of nvidia-ml-py < 11.515 against a breaking
change due to v3 APIs (from 510.39.01). See #107 for more details.

We no longer need to exclude 375.* versions as the wrong package
has been expelled from the pypi repository.
wookayin added a commit that referenced this pull request May 1, 2022
Requires nvidia-ml-py >= 11.450.129 and nvidia drivers >= 450.66.

We also pin the version of nvidia-ml-py < 11.510 against a breaking
change due to v3 APIs (from 510.39.01). The latest compatible version
would be nvidia-ml-py==11.495.46. See #107 for more details.

We no longer need to exclude 375.* versions as the wrong package
has been expelled from the pypi repository.
Requires nvidia-ml-py >= 11.450.129 and nvidia drivers >= 450.66.

We also pin the version of nvidia-ml-py < 11.510 against a breaking
change due to v3 APIs (from 510.39.01). The latest compatible version
would be nvidia-ml-py==11.495.46. See #107 for more details.

We no longer need to exclude 375.* versions as the wrong package
has been expelled from the pypi repository.
@wookayin
Copy link
Owner Author

wookayin commented Jul 4, 2022

At the moment, we will pin the nvidia-ml-py version at 11.495.46 due to the breaking changes (nvmlDeviceGetComputeRunningProcesses_v3) introduced in 11.510.x.

  • If incompatible versions are installed such as pynvml>=11.510.69, process information will be not available: showing "(Not Supported)"
  • This also requires graphics driver version no lower than 450.66 (released 2020. 8. 18.) to be able to correctly display the process information.
  • All that said, with this combination (Driver >= 450.66, pynvml>=11.450.129,<=11.495.46), gpustat should have no problems in most situations.

For more modern GPU + feature support, we may need to depend on latest pynvml version but this will be supported in later gpustat versions (say 1.x), adding some fallback mechanisms for backward compatibility.

@wookayin
Copy link
Owner Author

wookayin commented Jul 4, 2022

There is a version conflict problem with other libraries that pin pynvml version, see XuehaiPan/nvitop#23. (From my end, this needs to be resolved nicely before releasing)

@wookayin wookayin mentioned this pull request Jul 5, 2022
@xwjiang2010
Copy link

xwjiang2010 commented Jul 5, 2022

Hi,
I am seeing the following after pip install gpustat (1.0.0rc1)

ImportError: pynvml is missing or an outdated version is installed. 

We require nvidia-ml-py>=11.450.129; see GH-107 for more details.
Your pynvml installation: <module 'pynvml' from '/opt/miniconda/lib/python3.7/site-packages/pynvml.py'>

-----------------------------------------------------------
Please reinstall `gpustat`:

$ pip install --force-reinstall gpustat

if it still does not fix the problem, manually fix nvidia-ml-py installation:

$ pip uninstall nvidia-ml-py3
$ pip install --force-reinstall 'nvidia-ml-py<=11.495.46'

If gpustat requires certain version of pynvml, should that requirement be included in requirements.txt?

@wookayin
Copy link
Owner Author

wookayin commented Jul 6, 2022

@xwjiang2010 It is already specified in setup.py, of course (I don't use requirements.txt): https://github.com/wookayin/gpustat/blob/master/setup.py#L79

Did you follow the instruction after you ran into the error? If you installed gpustat, a proper version of nvidia-ml-py should have installed. Such an error can happen due to a dependency conflict with other packages.

@pcmoritz
Copy link

It seems like this is related to comet-ml/issue-tracking#481 -- I hope the two projects can align on one common pynvml to use, so they can be used together :)

@wookayin
Copy link
Owner Author

There is a version conflict problem with other libraries that pin pynvml version, see XuehaiPan/nvitop#23. (From my end, this needs to be resolved nicely before releasing)

It seems like this is related to comet-ml/issue-tracking#481 -- I hope the two projects can align on one common pynvml to use, so they can be used together :)

The conflict between nvidia-ml-py3 (an obsolete one) and NVIDIA's official one (nvidia-ml-py) can be problematic because there are still many packages and projects in the wild (that are not quite actively being maintained) depending on nvidia-ml-py3. It's still possible that both packages are being installed, and which version of the pynvml module is installed seems undefined.

https://github.com/nicolargo/nvidia-ml-py3/network/dependents?dependent_type=PACKAGE&package_id=UGFja2FnZS01MjM0NzAyMQ%3D%3D shows at least 75+ such packages.

In such cases where nvidia-ml-py3 gets resolved in the dependency chain, users will need to uninstall nvidia-ml-py3 and install the correct package manually (as per the instruction in the error message):

$ pip uninstall nvidia-ml-py3
$ pip install --force-reinstall 'nvidia-ml-py<=11.495.46'

However, things can be complicated when it comes to non-interactive CI jobs (example: ray-project/ray#26295).

wookayin added a commit that referenced this pull request Aug 4, 2022
gpustat 1.0.0+ requires pynvml to be an official one (nvidia-ml-py)
rather than an obsolete third-party package nvidia-ml-py3 (see #107),
but this may cause some inconvenient conflict with other third-party
packages that depend on the legacy nvidia-ml-py3.

As a temporary workaround, we introduce an environment variable
`ALLOW_LEGACY_PYNVML` whose use will result in bypassing pynvml
validation. With this flag turned on, gpustat may work with the legacy
pynvml library, but with a possibility that it may produce wrong
results on running process information.

e.g., ALLOW_LEGACY_PYNVML=1 gpustat
@wookayin
Copy link
Owner Author

wookayin commented Aug 4, 2022

4400d64 introduces the use of environment variable ALLOW_LEGACY_PYNVML to bypass pynvml validation. I won't document this upfront, but more information can be found in the commit message. I believe @mattip may find this useful, in a case gpustat would be broken due to the conflicting dependency on pynvml resulting from third-party packages that are out-of-control.

@wookayin wookayin deleted the pynvml-nvidia branch October 16, 2022 19:28
wookayin added a commit that referenced this pull request Nov 27, 2022
This commit adds unit tests for #107, where legacy and supported
nvidia-drivers would behave differently on process-relatd APIs (e.g.,
nvmlDeviceGetComputeRunningProcesses_v2).

Note: As already pointed out in #107, this test (and gpustat's process
information) fails with nvidia-ml-py > 11.495.46 breaking the backward
compatibility.
wookayin added a commit that referenced this pull request Nov 27, 2022
pynvml 11.510.69 has broken the backward compatibility by removing
`nvml.nvmlDeviceGetComputeRunningProcesses_v2` which is replaced by v3
APIs (`nvml.nvmlDeviceGetComputeRunningProcesses_v3`), but this function
does not exist for old nvidia drivers less than 510.39.01.

Therefore we pinned pynvml version at 11.495.46 in gpustat v1.0 (#107),
but we actually have to use recent pynvml versions for "latest" or modern
NVIDIA drivers. To make compute/graphics process information work
correctly when a combination of old nvidia drivers (`< 510.39`) AND
`pynvml >= 11.510.69` is used, we need to monkey-patch pynvml functions
in our custom manner such that, for instance, when v3 API is introduced,
we can simply fallback to v2 APIs to retrieve the process information.
wookayin added a commit that referenced this pull request Nov 27, 2022
pynvml 11.510.69 has broken the backward compatibility by removing
`nvml.nvmlDeviceGetComputeRunningProcesses_v2` which is replaced by v3
APIs (`nvml.nvmlDeviceGetComputeRunningProcesses_v3`), but this function
does not exist for old nvidia drivers less than 510.39.01.

Therefore we pinned pynvml version at 11.495.46 in gpustat v1.0 (#107),
but we actually have to use recent pynvml versions for "latest" or modern
NVIDIA drivers. To make compute/graphics process information work
correctly when a combination of old nvidia drivers (`< 510.39`) AND
`pynvml >= 11.510.69` is used, we need to monkey-patch pynvml functions
in our custom manner such that, for instance, when v3 API is introduced,
we can simply fallback to v2 APIs to retrieve the process information.
wookayin added a commit that referenced this pull request Nov 27, 2022
pynvml 11.510.69 has broken the backward compatibility by removing
`nvml.nvmlDeviceGetComputeRunningProcesses_v2` which is replaced by v3
APIs (`nvml.nvmlDeviceGetComputeRunningProcesses_v3`), but this function
does not exist for old nvidia drivers less than 510.39.01.

Therefore we pinned pynvml version at 11.495.46 in gpustat v1.0 (#107),
but we actually have to use recent pynvml versions for "latest" or modern
NVIDIA drivers. To make compute/graphics process information work
correctly when a combination of old nvidia drivers (`< 510.39`) AND
`pynvml >= 11.510.69` is used, we need to monkey-patch pynvml functions
in our custom manner such that, for instance, when v3 API is introduced,
we can simply fallback to v2 APIs to retrieve the process information.
@wookayin
Copy link
Owner Author

wookayin commented Mar 2, 2023

#143 relaxes (maximum) version requirement; there will be only minimum requirement since v1.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants