Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CuPy #9959

Merged
merged 4 commits into from
Oct 25, 2019
Merged

Add CuPy #9959

merged 4 commits into from
Oct 25, 2019

Conversation

jakirkham
Copy link
Member

@jakirkham jakirkham commented Oct 23, 2019

Closes #8638

Based off PR ( #8638 ), this adds the cupy package. Tidied the PR a bit. Also dropped some changes to staged-recipes that are not needed.

Checklist

  • License file is packaged (see here for an example)
  • Source is from official source
  • Package does not vendor other packages. (If a package uses the source of another package, they should be separate packages or the licenses of all packages need to be packaged)
  • If static libraries are linked in, the license of the static library is packaged.
  • Build number is 0
  • A tarball (url) rather than a repo (e.g. git_url) is used in your recipe (see here for more details)
  • GitHub users listed in the maintainer section have posted a comment confirming they are willing to be listed there

@conda-forge-linter
Copy link

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipes/cupy) and found it was in an excellent condition.

@jakirkham jakirkham mentioned this pull request Oct 23, 2019
5 tasks
@jakirkham
Copy link
Member Author

@conda-forge/core, this is ready for review! 😄

@leofang
Copy link
Member

leofang commented Oct 23, 2019

@jakirkham this is awesome! Thanks for working on it. Could you point me to where - {{ compiler("cuda") }} is documented? I think it's pretty new and I wasn't aware how the EULA issue with Nvidia was resolved. Having this working would solve a lot of headache for packaging CUDA dependent projects.

@jakirkham
Copy link
Member Author

@jakirkham this is awesome! Thanks for working on it.

Of course! Happy to help 😄

Could you point me to where - {{ compiler("cuda") }} is documented?

It needs to be. 😅 This is on my todo list. ( conda-forge/conda-forge.github.io#901 )

I think it's pretty new and I wasn't aware how the EULA issue with Nvidia was resolved.

Yeah it is pretty new. We plan to advertise it a bit more. Stay tuned.

The short answer is we are using NVIDIA-based Docker images that already have the toolchain to do the build.

Having this working would solve a lot of headache for packaging CUDA dependent projects.

Agreed!

On a different note, would you be interested in helping maintain the CuPy package? 😉

@isuruf
Copy link
Member

isuruf commented Oct 23, 2019

Can you change the docker image and make tests pass? We can revert the docker image change before merging. (This is what we did for openmm)

@jakirkham
Copy link
Member Author

Sure I've pushed a commit to build the package on CUDA 9.2.

@@ -7,7 +7,7 @@ jobs:
matrix:
linux:
CONFIG: azure-linux-64-comp7
IMAGE_NAME: condaforge/linux-anvil-comp7
IMAGE_NAME: condaforge/linux-anvil-cuda:9.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to not have to change this file in this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we'll have to use docker on docker to do that. That would also help with testing CentOS7 packages that you wanted to add.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just so we can test it here. We will revert it when we are done.

That said, we did start discussing how to do GPU tests in issue ( #9773 ). Have a rough start of a potential solution in PR ( #9804 ), but it could use help if anyone has time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question: would it imply that all CUDA packages built this way for conda-forge, and the - {{ compiler("cuda") }} flag, will be pinned at CUDA 9.2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is why we need docs. 😅

So currently conda-forge builds against 3 CUDA versions. As some recipes have a CPU-only option, we support 1 CPU version + 3 CUDA versions in total. These can be seen here (None is the CPU version). These versions map one-to-one to these Docker images where the build occurs. Each package then gains a dependency on the cudatoolkit package with the same major minor version.

Now what does this mean for CuPy. This means it will be built against all of the CUDA versions, but no CPU versions. Each build of CuPy will pin to a major minor version of CUDA (so 9.2, 10.0, and 10.1). Users can then constrain which build they get by selecting the version of cudatoolkit they want.

Copy link
Member

@leofang leofang Oct 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the detailed explanation and the links, @jakirkham! These are very useful to know for building CUDA packages, not just CuPy.

Yes, it's be great to have a central place documenting all these. I've been trying to follow the progress of cudatoolkit for a while, but I completely missed these new developments ☹️

Do you know which CUDA 10.1 exactly is in the Docker image? You probably have noticed this new issue: cupy/cupy#2569. CuPy doesn't support 10.1 Update 2 currently.

btw, Nvidia is awful in their versioning -- why not just call it 10.1.2, or even simpler 10.3 (there's already an Update 1, which should have been 10.2)... This versioning semantics might need to be taken care in conda-forge's CUDA stack...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad that helps. Thanks for the feedback. At some point it would be good if you can share your thoughts on what has been confusing and useful as well as what would be valuable to have in the docs. Have been collecting feedback from others in this doc issue ( conda-forge/conda-forge.github.io#901 ). 🙂

Sorry we have not done a great job in publicizing it. Have been working through rough edges with individual contributors to try and smooth out the overall experience. That said, it would be useful to write up (even an incomplete) blogpost about the state of things from a high-level. Thanks for the nudge. 😉

FWICT the image is using CUDA 10.1 Update 2 (nvcc output below). Thanks for letting me know. IIUC that just affects master and the latest RC right (asked on the issue too)? In either case will follow that issue closely.

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

Good point. Have also been thinking about this. Could you please raise an issue about more detailed versioning on the nvcc feedstock?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK will do. Also listing version strings here for future ref:
10.1: 10.1.105
10.1 Update 1: 10.1.168
10.1 Update 2: 10.1.243

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recipes/cupy/meta.yaml Outdated Show resolved Hide resolved
recipes/cupy/meta.yaml Outdated Show resolved Hide resolved
recipes/cupy/build.sh Outdated Show resolved Hide resolved
recipes/cupy/build.sh Outdated Show resolved Hide resolved
@jakirkham jakirkham force-pushed the add_cupy_final branch 4 times, most recently from b8a3061 to a10b883 Compare October 23, 2019 22:03
run:
- python
- setuptools
- {{ pin_compatible("cudnn") }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I misunderstood: why just pinned cudnn but not nccl? There are incompatibilities even within the v2 series, and I am not sure if the built cupy.cuda.nccl module would be happy with a different nccl version discovered at runtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, the CuPy team bundles a specific NCCL version in the CuPy wheel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah not pinning NCCL is dangerous. The NCCL version is determined at built time:
https://github.com/cupy/cupy/blob/b9db374f4d0ccee358a998d7b283248f77cef351/cupy/cuda/cupy_nccl.h#L131-L133

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the plan is to pin that globally conda-forge/conda-forge-pinning-feedstock#309 (comment)

Copy link
Member

@leofang leofang Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @CJ-Wright. I wasn't aware of that discussion.

But still, don't we need a {{ pin_compatible("nccl") }} here? Once a CuPy package is built, it is supposed to be pinned to the precise NCCL version it was built against. I feel the global pin might be a bit fragile when a newer NCCL comes out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the compiler might warn that the symbol is unused, but I don't think it will just get rid of the symbol altogether. There may be flags to enable this kind of elimination, but am pretty sure it is opt-in. I could be wrong about this though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways we should probably move this discussion to CuPy issue ( cupy/cupy#2568 ).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well currently we are using the latest major minor version of NCCL with CuPy. Now that we are aware of this issue, I think we just need to stay on top of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until that issue is resolved, shouldn't you pin exactly?

Copy link
Member Author

@jakirkham jakirkham Oct 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NCCL issue? We are using the latest major minor version of NCCL and the current release of CuPy does too. So they are not adding any symbols at the moment that will clash with NCCL. Also they've acknowledge the issue upstream and seem willing to work on it. We certainly need to keep an eye on things as new releases come out and adjust our strategy accordingly. That said, I think we are doing all we can with this issue ATM.

@jakirkham
Copy link
Member Author

@conda-forge/core, would it be possible to get another review of this? I think all comments from the last round were addressed. 🙂

@leofang
Copy link
Member

leofang commented Oct 24, 2019

See #9959 (comment).

@jakirkham
Copy link
Member Author

(responding in that thread :)

@leofang
Copy link
Member

leofang commented Oct 24, 2019

Oh, almost forgot:

On a different note, would you be interested in helping maintain the CuPy package? 😉

I am more than happy to help if needed. I think I know CuPy internal enough.

But, I am a bit hesitating to be listed as a maintainer. Last time I worked on a package that has non-Python dependencies conda-forge/pyqt-feedstock#61 it was super painful. I explored in the dark without understanding how conda-build really works (lots of things either undocumented or have special treatments that it's impossible for a new comer to know), nor was I able to successfully debug myself (lack of resource and knowledge) when seeing issues in CI. @isuruf eventually took over and helped finish it (thanks again btw). The time investment is a bit daunting from that experience.

@jakirkham
Copy link
Member Author

I think the Qt stack is one of the hardest pieces of our ecosystem. So props for diving in the deep end. Most of the tricky bits that I found from building CuPy were not specific to CuPy, but were more general GPU building things that have been upstreamed into the tooling in conda-forge. At this point, I would argue that building CuPy is not harder than building your average Python C extension (not PyQt 😉).

Should add I'm happy to help here too. Am using CuPy regularly in my day-to-day and expect that to continue. So will feel some pressure to keep things working here.

Though I think it would be good if some people that are not me become familiar with the CuPy recipe and updating it. Am happy to handhold as we move in that direction. In the end, I think you and other CuPy developers know way more about how CuPy works than I do. So it will be good to transition to having you and other CuPy developers directing how the package looks going forward. We've found this model successful with several packages. 🙂

@leofang
Copy link
Member

leofang commented Oct 24, 2019

FYI, the CuPy team pins at NCCL 2.4.2.1 for CUDA 9.2:
https://github.com/cupy/cupy-release-tools/blob/d542e5436cf1380913924babfb3306357ef5dac7/download_nccl.sh#L27-L28
Would it be nice to coordinate with / follow them?

@CJ-Wright
Copy link
Member

Would it be nice to coordinate with / follow them?

It would be good to coordinate, although we also need to think of other downstream consumers of nccl.

@jakirkham
Copy link
Member Author

We don't currently have a build of NCCL 2.4.2.1. It could be added though.

@jakirkham
Copy link
Member Author

Do you know why they chose NCCL 2.4.2.1?

@leofang
Copy link
Member

leofang commented Oct 24, 2019

although we also need to think of other downstream consumers of nccl.

This sounds against a global pin that was planned...

Do you know why they chose NCCL 2.4.2.1?

Wasn't explained here: cupy/cupy-release-tools#29 Probably it was the latest version at that time?

@leofang
Copy link
Member

leofang commented Oct 24, 2019

@kmaehashi would you have time to comment / review on the NCCL pinning issue and the overall CuPy recipe here? Would be great to have an extra pair of eyes, especially someone from the CuPy/Chainer team!

@CJ-Wright
Copy link
Member

This sounds against a global pin that was planned...

This is the reason to have a global pin, so that all consumers of nccl can have the same version. That way the downstream consumers don't cause errors for each other.

@leofang
Copy link
Member

leofang commented Oct 24, 2019

This sounds against a global pin that was planned...

This is the reason to have a global pin, so that all consumers of nccl can have the same version. That way the downstream consumers don't cause errors for each other.

Oh I see. So keep the pain at the coordination stage.

@CJ-Wright
Copy link
Member

CJ-Wright commented Oct 24, 2019

Yes, this is also why we're so interested in the exact level of pinning needed. If the pin is very strict then we will need to rebuild all consumers every time a new bug fix release is produced, which could be a lot of overhead for maintainers.

Changes the image to use the CUDA 9.2 Docker image and constrain the
recipe to only build for CUDA 9.2.
@leofang
Copy link
Member

leofang commented Oct 25, 2019

Note this cupy/cupy#2568 (comment):

Thanks for pointing this issue out. The same thing can be said to cuDNN as we're exposing symbols whose names are same as cuDNN depending on build-time cuDNN version, and cudnn is linked as libcudnn.so.7 (not libcudnn.so.7.5).

@jakirkham
Copy link
Member Author

@conda-forge/core, I think things are basically ready to go. Can revert the CI hack once we are ready to merge. What do people think?

@jakirkham
Copy link
Member Author

Note this cupy/cupy#2568 (comment)

Oh sorry. I thought I posted that comment, but I guess that didn't happen. Thanks for raising @leofang. Currently we are just using the latest version of cuDNN at all times. So this shouldn't cause the same problem.

@jakirkham
Copy link
Member Author

Can revert the CI hack once we are ready to merge.

Have gone ahead and done this. Can readd if further iteration is needed.

@scopatz
Copy link
Member

scopatz commented Oct 25, 2019

Seems like this is failing on linux

@jakirkham
Copy link
Member Author

Right that's because we don't have GPU builds on staged-recipes as discussed earlier in the thread. Also as noted I've removed the hack that enabled one GPU build. If you are interested in the CI status of the recipe, please look one commit prior. 🙂

@leofang
Copy link
Member

leofang commented Oct 25, 2019

@jakirkham add me to the maintainer. I'd feel less nervous if I could work with someone knowledgable like you 🙂

@leofang
Copy link
Member

leofang commented Oct 25, 2019

@jakirkham btw see my new question in #9959 (comment):

Do you know which CUDA 10.1 exactly is in the Docker image? You probably have noticed this new issue: cupy/cupy#2569. CuPy doesn't support 10.1 Update 2 currently.

@jakirkham
Copy link
Member Author

@jakirkham add me to the maintainer. I'd feel less nervous if I could work with someone knowledgable like you 🙂

Thanks Leo! Would be great to have you. Added you in the latest commit. 🙂

@jakirkham btw see my new question in #9959 (comment):

Ah sorry! Missed that one before. Have responded in ( #9959 (comment) ).

@isuruf isuruf merged commit b71cb49 into conda-forge:master Oct 25, 2019
@jakirkham jakirkham deleted the add_cupy_final branch October 25, 2019 21:51
@jakirkham
Copy link
Member Author

Thanks all for the reviews and feedback here! Am really happy with how much nicer this recipe turned out as a result. 🎉 😄


build:
number: 0
skip: True # [not (linux64 and cuda_compiler_version != "None")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using De Morgan's Laws, this skip statement could be made easier to read:

[(not linux64) or cuda_compiler_version == "None"]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A PR to the feedstock would be welcome. 🙂

ref: https://github.com/conda-forge/cupy-feedstock

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😎

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

7 participants