-
-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to specify CUDA version in a conda package? #687
Comments
What you're seeing with cuda100 and such is that people are creating packages as stand-ins for constraints. That's fine as a short-hand. It is important to understand why versions must be specified. Ultimately, this is all about compatibility ranges with CUDA. If something is built against CUDA 9, will it work with CUDA 10 runtimes? I don't know. Conda-build has a way to help manage that (https://conda.io/docs/user-guide/tasks/build-packages/variants.html#customizing-compatibility) One other hard aspect to this CUDA stuff is that we can ship CUDA runtimes, but we can't alter the graphics driver that users have. This seriously hampers any flexibility that we have in distributing newer runtimes, and requires that the user understand what their system is currently compatible with in a way that is not generally a problem with other software. Some addition of hardware/driver version detection to conda itself, and having conda choose appropriate CUDA versions would be helpful so that users don't need to figure these things out. |
Would it be possible in the short term to print some warning about driver compatibility when linking cuda things? Bonus points if we can do some light inspection of the system state and give a more detailed message? |
CUDA drivers (the part that conda cannot install) are backward compatible with applications compiled with older versions of CUDA. So, for example, the CUDA 9.2 build of PyTorch would only require that CUDA >= 9.2 is present on the system. This backward compatibility also extends to the cudatoolkit (the userspace libraries supplied by NVIDIA which Anaconda already packages), where a conda environment with cudatoolkit 8.0 would work just fine with a system that has the CUDA 9.2 drivers. So, on one hand, there is motivation (much like glibc) to pick and arbitrary old CUDA and build everything with that, and rely on driver backward compatibility. Aside from new CUDA language features (which project may choose to ignore for compatibility reasons), building with newer CUDA versions can also improve performance as well as add native support for newer hardware. A package compiled for CUDA 8 will not run on Volta GPUs without a lengthy JIT recompilation of all the CUDA functions in the project, which happens automatically, but can still be a bad user experience. As an example, TensorFlow compiled with CUDA 8 can take 10+ minutes to start up on a Volta GPU. These two conflicting desires for compatibility and performance explain why it makes sense to compile packages with a range of CUDA versions (right now, I'd say 8.0-10 or 9.0 to 10.0 would be the best choice), but still leaves the burden on the user to know which CUDA version they need. Because nearly all CUDA projects require the CUDA toolkit libraries, and Anaconda packages them, we use the
And that will get you a PyTorch compiled with CUDA 8, rather than something else. The CUDA driver provides a C API to query what maximum version of CUDA is supported by the driver, so a few months ago I wrote a self-contained Python function for detecting what version of CUDA (if any) is present on the system: https://gist.github.com/seibert/52a204395cdc84eeeaf0ce05464a636b This was for the conda team to potentially incorporate into conda as a "marker" (I think that is the right term), so that conda could include a I don't know where this work is on the roadmap for conda (@msarahan?), but if there is additional work needed on the conda side to get this to the finish line, I'm happy to help. It would go a long way toward unifying the various approaches as well as improving the user experience. |
@soumith can you (or someone else who works on PyTorch who might be more engaged on this topic) comment on if depending on @mike-wendt and @kkraus14 does the approach above work for you for RAPIDS? |
Depending on Incidentally, when we started the feature-tracking hack, i.e. I'll look to moving our packages to using this format / convention. Thanks for starting this conversation @mrocklin |
@soumith Thanks for jumping in. We've been using labels in the RAPIDS project and it's been helpful so far. Should we consider that as well? I'm not sure conda should be the only way to install the CTK even for conda users. With labels people can pull the package they need for the right CUDA version, and if they also want to install CTK from conda they have the option. |
It's important to note that labels (I assume you mean things like this) are properties of a conda package in a particular channel, and not intrinsic metadata for the package itself. Labels are a good way to separate packages for different purposes (for example, dev, qa, release), but the labels have no impact on the conda dependency solver. This means it would be possible for a user to mix the CUDA 9.2 version of |
Do RAPIDS libraries depend on the |
One thing we really need is for conda-forge/NumFOCUS to be able to redistribute cudatoolkit at all. @mrocklin @datametrician, any updates on this front? |
My understanding of the CUDA toolkit EULA is that the libraries--which is what the However, cuDNN (used by all the deep learning frameworks) is still shipped separately from the CUDA toolkit and technically requires an NVIDIA developer registration. Anaconda obtained a special license from NVIDIA to redistribute it in the Anaconda Distribution, but (IMHO) the registered developer requirement on cuDNN should be lifted so it can be redistributed on the same terms as the rest of the CUDA toolkit libraries. |
@scopatz I believe Pramod Ramarao already responded to you in an email on 11/15 essentially saying what @seibert said. Stan, I agree with you on cuDNN, and will point Pramod and the cuDNN PMs to this thread. Let's see if that will change anything. @mrocklin I believe this is doable from our end, especially since everyone is essentially doing it there own way and this would allow some consolidation. I'll let @mike-wendt and @kkraus14 chime in though. |
Also, I'm not sure what the status is of NCCL2. Is it also limited to registered developers? |
NCCL2 is open source and free to redistribute. The source code is also on github now. |
Right now we do not rely on
I'm not against standardizing around Certain libraries are distributed with the GPU driver that we depend on, and for systems that do not have this installed the RAPIDS libraries fail to build. So we have more of a CUDA and NVIDIA Driver dependency. While it sounds like we're introducing more complexity, we handle this by restricting the level of CUDA we support in RAPIDS. The reason we rely on this approach is due to the fact that each version of CUDA has a minimum NVIDIA driver version that it needs to operate. If that driver version is not satisfied, then the installation fails and the user usually upgrades their driver to a compatible version. I want to be clear though that we should not make this a true dependency; that is require a driver version that matches the Thoughts
Questions
|
This is an interesting question. Presumably this happens in other disciplines today? If I'm working on a new BLAS implementation is there some way for me to link a conda-installed numpy to my version of BLAS (or really any version other than OpenBLAS and MKL) or am I outside of conda-supported workloads and I should be handling things on my own at this point? I'm curious about your questions regarding mirroring @mike-wendt . What packages would you mirror and how would this solve the problem of working with bleeding edge |
I think from our in-person discussion this is "outside of conda-supported workloads." The target user for RAPIDS, pytorch, and others using CUDA are just that "users." They primarily want a way to get up and running quickly instead of trying to figure out dependencies. Standardizing around For the rest they are developers which comes with some work. They need to be aware of the tools and how to use them, so they can verify and test approaches for users using conda, but not totally dependent on them for development. Your BLAS example is a good one, as is RAPIDS, which needs the full CUDA development install for compilers and includes. Not to mention our unique case where we are testing nightly builds and need a process outside of conda for that. Could we move that to conda and publish nightly packages privately? Sure, but I don't believe it will bring the value I thought it would previously. I'm in favor of using
|
OK, it seems like everyone is on board with specifying CUDA version numbers by expressing dependencies on It also sounds like @mike-wendt is proposing a convention around including the cuda version number in build strings. Is there any objection to this? @mike-wendt would the next release of RAPIDS follow this convention, or is that too early for you? @soumith , what does this process look like on the PyTorch side? Is it easy for you all to change around your builds and your installation instructions? I can imagine that you would want to have some sort of smooth transition. |
@mrocklin The blocker for us is the lack of a The major concern I have this week and next is the Conda-Forge plan for gcc7 switchover that occurs on 1/15. So I think it is safe to say a lot of us will be busy that week dealing with the conversions and any necessary updates related to that primarily. Right now we are scheduled to freeze for v0.5 on 1/16 so I think it will be hard to guarantee that it makes it this release, but we might be able to do a hotfix release the week after. |
There is no immediate stress on this. Happy to play a long game. So the first time that RAPIDS would use this convention would be sometime in March? |
@mrocklin process on PyTorch side is easy, we just have to change our build scripts. I'm inclined to change when CUDA10 cudatoolkit is available as well, because otherwise half of our install commands are via feature packages |
CUDA 10.0 cudatoolkit recipe is live https://github.com/numba/conda-recipe-cudatoolkit. |
I had a talk with @seibert yesterday about what he thinks he needs from conda to support this. I think we agreed that conda needs "virtual packages" which @kalefranz has been lumping in with "markers" but which I think are actually separate. A virtual package is something that represents some aspect of the system. Its version and build string can be dynamically determined by having conda run some code for that particular virtual package. It would then be considered in the solver as a package with a strict pinning. For cuda, it means that we need to decide what this package name should be. Then all packages would express their CUDA compatibility as normal dependencies on that package. A user's system may present something like a dependency of:
while packages such as pytorch should express normal version dependencies like:
(adjusted as appropriate for the actual compatibility expectations of cuda) Conda could obviously never update cuda, but it would be nice to have it recognize ways outside of its control to update (i.e. tell the user that they can update their driver or upgrade their hardware). Depending on the time it takes for this cuda virtual package to represent itself, it may be something that we cache on disk and have a "refresh"-type command. @seibert volunteered some time towards getting this implemented in conda. We'll hope to have something ready soon - likely with the next minor release of conda, 4.7.0. |
@msarahan to be clear it sounds like you're proposing this as an alternative to using |
could be? I'm ambivalent on that. If you can ship runtimes that work with a variety of drivers, maybe they can be independent. |
Or should I say: maybe cudatoolkit stays in usage the same as it is now, but cudatoolkit itself grows a dependency to this new virtual package to establish driver requirements. |
OK, so you think that it's still the right approach for downstream packages to depend on |
Yep, definitely important to bridge the gap with cudatoolkit, since new conda versions may take a while to be available. Perhaps cudatoolkit can be dropped in the more distant future when this new approach is proven and commonly available. Thankfully, I expect the CUDA-using community will be quick adopters of new conda versions, rather than laggards holding onto old versions. |
@msarahan thanks!
Is there something we need to do to get this into defaults, or is it in the pipeline already? |
I will work on getting cudatoolkit 10.0 into defaults next week. |
On my side, the Thanks all for the thread. |
Alternatively, you can use |
Does Anaconda also handle the builds for |
conda/conda#8267 will add support for a |
@jjhelmus builds the cupy packages, I think. They should already depend on the |
The |
To help codify this a bit more, I've put up PR ( conda-forge/docker-images#93 ) and PR ( conda-forge/staged-recipes#8229 ). These provide a Docker image (based off conda-forge's current Docker image) for compiling packages and a shim package to get NVCC and conda-build to talk to each other. Please share your thoughts on these. |
Something else worth mentioning here. I've noticed that CMake when using the CUDA language feature often likes to statically link to the CUDA runtime library. There use to be a way to disable this (e.g. |
make sure that environment is backwards compatible for cudatoolkit as shown here, it really depends on your system. conda-forge/conda-forge.github.io#687 (comment)
How should a package maintainer specify a dependency on a specific CUDA version like 9.2 or 10.0?
As an example, here is how PyTorch does things today:
conda install pytorch torchvision cuda80 -c pytorch
conda install pytorch torchvision -c pytorch
conda install pytorch torchvision cuda100 -c pytorch
conda install pytorch-cpu torchvision-cpu -c pytorch
I believe that NVIDIA and Anaconda handle things differently. I have zero thoughts on which way is correct, but I thought it would be useful to start such a conversation around this. My hope is that we can come to some consensus on packaging conventions that can help users avoid broken environments more easily and provide a good pattern for future package maintainers to follow.
cc @jjhelmus @msarahan @nehaljwani @stuartarchibald @seibert @sklam @soumith @kkraus14 @mike-wendt @datametrician
The text was updated successfully, but these errors were encountered: