Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torchvision: fix #217878; migrate to cudaPackages #218035

Merged

Conversation

ConnorBaker
Copy link
Contributor

@ConnorBaker ConnorBaker commented Feb 24, 2023

Description of changes
  • when building with CUDA, we must make sure to use the same C/C++ compiler as NVCC to avoid symbol errors when linking
  • move to cudaPackages to reduce closure size

Currently, torchvision doesn't even build on master because the default GCC is 12, which is unsupported by NVCC until CUDA 12.0.

This fixes #217878.

Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 23.05 Release Notes (or backporting 22.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

@ConnorBaker
Copy link
Contributor Author

Result of nixpkgs-review pr 218035 run on x86_64-linux 1

2 packages marked as broken and skipped:
  • python310Packages.bambi
  • python310Packages.pywick
3 packages failed to build:
  • python310Packages.elegy
  • python310Packages.torchinfo
  • tts
20 packages built:
  • easyocr (python310Packages.easyocr)
  • khoj
  • larynx-train (python310Packages.larynx-train)
  • python310Packages.arviz
  • python310Packages.boxx
  • python310Packages.bpycv
  • python310Packages.clip
  • python310Packages.ffcv
  • python310Packages.grad-cam
  • python310Packages.pymc
  • python310Packages.pytorch-lightning
  • python310Packages.pytorch-metric-learning
  • python310Packages.pytorch-pfn-extras
  • python310Packages.sentence-transformers
  • python310Packages.tensorboardx
  • python310Packages.timm
  • python310Packages.torch-tb-profiler
  • python310Packages.torchvision
  • python310Packages.trainer
  • python310Packages.zcs

@ConnorBaker ConnorBaker force-pushed the fix/torchvision-set-cuda-compilers branch from fdec768 to a38527c Compare February 24, 2023 20:29
@ConnorBaker
Copy link
Contributor Author

Result of nixpkgs-review pr 218035 run on x86_64-linux 1

1 package marked as broken and skipped:
  • python310Packages.pywick
2 packages failed to build:
  • python310Packages.elegy
  • python310Packages.torchinfo
21 packages built:
  • easyocr (python310Packages.easyocr)
  • khoj
  • larynx-train (python310Packages.larynx-train)
  • python310Packages.arviz
  • python310Packages.boxx
  • python310Packages.bpycv
  • python310Packages.clip
  • python310Packages.ffcv
  • python310Packages.grad-cam
  • python310Packages.pymc
  • python310Packages.pytorch-lightning
  • python310Packages.pytorch-metric-learning
  • python310Packages.pytorch-pfn-extras
  • python310Packages.sentence-transformers
  • python310Packages.tensorboardx
  • python310Packages.timm
  • python310Packages.torch-tb-profiler
  • python310Packages.torchvision
  • python310Packages.trainer
  • python310Packages.zcs
  • tts

@ConnorBaker
Copy link
Contributor Author

ConnorBaker commented Feb 24, 2023

Failed derivations

@ConnorBaker ConnorBaker force-pushed the fix/torchvision-set-cuda-compilers branch from a38527c to d397194 Compare February 25, 2023 02:34
@ConnorBaker
Copy link
Contributor Author

Running again after rebasing on master

@ConnorBaker
Copy link
Contributor Author

Result of nixpkgs-review pr 218035 run on x86_64-linux 1

1 package marked as broken and skipped:
  • python310Packages.pywick
2 packages failed to build:
  • python310Packages.elegy
  • python310Packages.torchinfo
21 packages built:
  • easyocr (python310Packages.easyocr)
  • khoj
  • larynx-train (python310Packages.larynx-train)
  • python310Packages.arviz
  • python310Packages.boxx
  • python310Packages.bpycv
  • python310Packages.clip
  • python310Packages.ffcv
  • python310Packages.grad-cam
  • python310Packages.pymc
  • python310Packages.pytorch-lightning
  • python310Packages.pytorch-metric-learning
  • python310Packages.pytorch-pfn-extras
  • python310Packages.sentence-transformers
  • python310Packages.tensorboardx
  • python310Packages.timm
  • python310Packages.torch-tb-profiler
  • python310Packages.torchvision
  • python310Packages.trainer
  • python310Packages.zcs
  • tts

@ConnorBaker
Copy link
Contributor Author

ConnorBaker commented Feb 25, 2023

Failed derivations

@ConnorBaker ConnorBaker marked this pull request as ready for review February 25, 2023 05:45
@ConnorBaker
Copy link
Contributor Author

cc @NixOS/cuda-maintainers

@SomeoneSerge SomeoneSerge added the 6.topic: cuda Parallel computing platform and API label Feb 25, 2023
Copy link
Contributor

@SomeoneSerge SomeoneSerge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Builds and runs (some random snippet)

@FRidh
Copy link
Member

FRidh commented Feb 28, 2023

For wrapping the compiler, we could probably do something like

  gcc11Stdenv = overrideCC gccStdenv buildPackages.gcc11;

and use wrapCC on top of the cuda compiler. That way we could get a cudaStdenv.

@samuela
Copy link
Member

samuela commented Feb 28, 2023

Actually, I think we should just upgrade the cudatoolkit_12 gcc to gcc12 to match stdenv: https://cs.github.com/NixOS/nixpkgs/blob/4a6cd14f37b22c9d2f90e3774b8ea9601025ad6b/pkgs/development/compilers/cudatoolkit/versions.toml#L79

It was only ever held back to gcc11 because stdenv was stuck on gcc11. now that stdenv is on gcc12 IIUC that should resolve these issues?

@ConnorBaker ConnorBaker self-assigned this Mar 8, 2023
@ConnorBaker
Copy link
Contributor Author

ConnorBaker commented Mar 9, 2023

Actually, I think we should just upgrade the cudatoolkit_12 gcc to gcc12 to match stdenv: https://cs.github.com/NixOS/nixpkgs/blob/4a6cd14f37b22c9d2f90e3774b8ea9601025ad6b/pkgs/development/compilers/cudatoolkit/versions.toml#L79

It was only ever held back to gcc11 because stdenv was stuck on gcc11. now that stdenv is on gcc12 IIUC that should resolve these issues?

I'm in support of that upgrade, though it might be problematic because some packages (cough torch cough) have explicit requirements for CUDA versions:

assert !cudaSupport || (let majorIs = lib.versions.major cudatoolkit.version;
in majorIs == "9" || majorIs == "10" || majorIs == "11");

On a different note, is it reasonable to have asserts like that in packages? I don't know how one would be able to circumvent that if they wanted to build against HEAD (which mostly supports CUDA 12) by overriding the src attribute. (That's excluding the fact the attribute set is recursive, which is perhaps why I've been unsuccessful so far in trying to do exactly that and had to write my own derivation.)

@samuela
Copy link
Member

samuela commented Mar 9, 2023

Do we know that pytorch doesn't support CUDA 12, or is that assert overly cautious? We can always pin pytorch to a CUDA 11.x version until they get up to speed.

@ConnorBaker
Copy link
Contributor Author

Do we know that pytorch doesn't support CUDA 12, or is that assert overly cautious? We can always pin pytorch to a CUDA 11.x version until they get up to speed.

iirc you can build with whatever, and a quick once-over of cpp_extension.py didn't reveal any specific matching on CUDA version to prohibit builds: https://github.com/pytorch/pytorch/blob/fe05266fda4f908130dea7cbac37e9264c0429a2/torch/utils/cpp_extension.py#L382-L420

However, they do have a tracking issue for CUDA 12 support: pytorch/pytorch#91122

In my experience, when built against HEAD it mostly just works, though there are a few outstanding issues.

@SomeoneSerge
Copy link
Contributor

I think it's safe to say that we should prefer setting meta.broken or showing warnings to just killing Nix eval. The particular piece of code is legacy

@samuela
Copy link
Member

samuela commented Mar 9, 2023

Ok, yeah in that case we should wait until pytorch/pytorch#91122 is resolved. Curious what this all has to do with the gcc upgrade though?

@ConnorBaker
Copy link
Contributor Author

Ok, yeah in that case we should wait until pytorch/pytorch#91122 is resolved. Curious what this all has to do with the gcc upgrade though?

That's my bad -- I misinterpreted your comment about cudaPackages_12 and GCC 12 (#218035 (comment)) to mean "we should make that the default." Sorry!

@ConnorBaker ConnorBaker force-pushed the fix/torchvision-set-cuda-compilers branch from d397194 to 16dbe78 Compare March 9, 2023 19:27
@ConnorBaker ConnorBaker requested a review from samuela March 9, 2023 19:29
@ConnorBaker ConnorBaker changed the title torchvision: fix compilers when using CUDA; migrate to cudaPackages torchvision: fix #217878; migrate to cudaPackages Mar 9, 2023
…ages

- when building with CUDA, we must make sure to use the same C/C++ compiler as NVCC to avoid symbol errors when linking
- move to cudaPackages to reduce closure size
- separate build-time and run-time CUDA dependencies
@ConnorBaker ConnorBaker force-pushed the fix/torchvision-set-cuda-compilers branch from 16dbe78 to ed00c38 Compare March 9, 2023 21:27
@samuela
Copy link
Member

samuela commented Mar 9, 2023

Result of nixpkgs-review pr 218035 run on x86_64-linux 1

1 package marked as broken and skipped:
  • python310Packages.pywick
2 packages failed to build:
  • python310Packages.elegy
  • python310Packages.torchinfo
23 packages built:
  • easyocr (python310Packages.easyocr)
  • khoj
  • larynx-train (python310Packages.larynx-train)
  • python310Packages.arviz
  • python310Packages.bambi
  • python310Packages.boxx
  • python310Packages.bpycv
  • python310Packages.clip
  • python310Packages.fastai
  • python310Packages.ffcv
  • python310Packages.grad-cam
  • python310Packages.pymc
  • python310Packages.pytorch-lightning
  • python310Packages.pytorch-metric-learning
  • python310Packages.pytorch-pfn-extras
  • python310Packages.sentence-transformers
  • python310Packages.tensorboardx
  • python310Packages.timm
  • python310Packages.torch-tb-profiler
  • python310Packages.torchvision
  • python310Packages.trainer
  • python310Packages.zcs
  • tts

@samuela
Copy link
Member

samuela commented Mar 9, 2023

Failures:

error: builder for '/nix/store/9hrvr6cfrjcm41829sanav6r1y2p5fhv-python3.10-torchinfo-1.64.drv' failed with exit code 1;
       last 10 log lines:
       >                 "Failed to run torchinfo. See above stack traces for more details. "
       >                 f"Executed layers up to: {executed_layers}"
       >             ) from e
       > E           RuntimeError: Failed to run torchinfo. See above stack traces for more details. Executed layers up to: []
       >
       > torchinfo/torchinfo.py:303: RuntimeError
       > =========================== short test summary info ============================
       > FAILED tests/exceptions_test.py::test_input_size_half_precision - RuntimeError: Failed to run torchinfo. See above stack traces for more deta...
       > ============ 1 failed, 55 passed, 6 skipped, 2 deselected in 7.04s =============
       > /nix/store/c3f4jdwzn8fm9lp72m91ffw524bakp6v-stdenv-linux/setup: line 1593: pop_var_context: head of shell_variables not a function context
       For full logs, run 'nix log /nix/store/9hrvr6cfrjcm41829sanav6r1y2p5fhv-python3.10-torchinfo-1.64.drv'.
error: builder for '/nix/store/n06x3d9zs61a2m0cw1nmmxsd42n5n93l-python3.10-treex-0.6.11.drv' failed with exit code 1;
       last 10 log lines:
       >   /build/source/tests/nn/test_sequential.py:38: FutureWarning: jax.tree_leaves is deprecated, and will be removed in a future release. Use jax.tree_util.tree_leaves instead.
       >     leaves = jax.tree_leaves(mlp)
       >
       > -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
       > =========================== short test summary info ============================
       > FAILED tests/test_treex.py::TestTreex::test_is_tree - assert False
       > FAILED tests/test_treex.py::TestTreex::test_list - AssertionError: assert False
       > FAILED tests/test_treex.py::TestTreex::test_treelist - assert False
       > ============ 3 failed, 147 passed, 101 warnings in 98.92s (0:01:38) ============
       > /nix/store/c3f4jdwzn8fm9lp72m91ffw524bakp6v-stdenv-linux/setup: line 1593: pop_var_context: head of shell_variables not a function context
       For full logs, run 'nix log /nix/store/n06x3d9zs61a2m0cw1nmmxsd42n5n93l-python3.10-treex-0.6.11.drv'.
error: 1 dependencies of derivation '/nix/store/gizrf302d6d0ips7ngigzk4wy6nmvrzi-python3.10-elegy-0.8.6.drv' failed to build
error: 2 dependencies of derivation '/nix/store/l9ljdzkw64025nwhjbc7382pinb4c5w3-review-shell.drv' failed to build

are not related.

@samuela samuela merged commit 508cf08 into NixOS:master Mar 9, 2023
@samuela
Copy link
Member

samuela commented Mar 9, 2023

LGTM thanks @ConnorBaker !

@SomeoneSerge
Copy link
Contributor

(We already had bumped cuda12's gcc in the -ccbin pr iirc)

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/tweag-nix-dev-update-45/26397/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Python 3.10 closure contains incompatible versions of CUDA and g++, causing build failures
5 participants