Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ucx: enableCuda in an overlay causes infinite recursion #239182

Closed
SomeoneSerge opened this issue Jun 22, 2023 · 8 comments
Closed

ucx: enableCuda in an overlay causes infinite recursion #239182

SomeoneSerge opened this issue Jun 22, 2023 · 8 comments
Labels
0.kind: bug Something is broken

Comments

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Jun 22, 2023

Describe the bug

A regression in the config.cudaSupport = true package set. Affected attributes:

0: "blender"​
1: "colmapWithCuda"​
2: "cudaPackages.cuda_compat"
3: "cudaPackages.cutensor"
4: "cudaPackages.libcudla"
5: "python3Packages.jax"
6: "python3Packages.jaxlib"
7: "python3Packages.tensorflowWithCuda"
8: "python3Packages.torch"
9: "python3Packages.torchvision"
10: "tts"

Cf. https://hercules-ci.com/github/SomeoneSerge/nixpkgs-cuda-ci/jobs/4832 for logs

Probably caused by the 11.7 -> 11.8 update

Notify maintainers

@NixOS/cuda-maintainers

@SomeoneSerge SomeoneSerge added the 0.kind: bug Something is broken label Jun 22, 2023
@ConnorBaker
Copy link
Contributor

Will take a look shortly -- that's odd!

I was able to build blender just now, will try the others soon.

@ConnorBaker
Copy link
Contributor

ConnorBaker commented Jun 22, 2023

Looks like cuda_compat is an ARM only library:

"cuda_compat": {
"name": "CUDA compat L4T",
"license": "CUDA Toolkit",
"version": "11.8.31339915",
"linux-aarch64": {
"relative_path": "cuda_compat/linux-aarch64/cuda_compat-linux-aarch64-11.8.31339915-archive.tar.xz",
"sha256": "7aa1b62da35b52eaa13e254d1072aff10c907416604e5e5cc1ddcebbfe341dc7",
"md5": "41cba7b241724ad04234dc3f20526525",
"size": "15780868"
}
},

It’s also worth noting that it’s new to 11.8, maybe to help support older CUDA code on the ARM-based Grace Hopper chip? That’s why we didn’t see this failure before.

@ConnorBaker
Copy link
Contributor

ConnorBaker commented Jun 22, 2023

Will be updated as I continue to build them.

For packages which aren't available on x86_64-linux, should they be accessible through cudaPackages on that system?

  • blender
    • Built without errors: /nix/store/klxbl8jk0b2wabn8q52qxqaljaf2s45c-blender-3.5.1
  • colmapWithCuda
    • Built without errors: /nix/store/xdy04l10k6y4qyz025iz039x6s5w9mpd-colmap-3.7
  • cudaPackages.cuda_compat
    • Only available on aarch64-linux

      "cuda_compat": {
      "name": "CUDA compat L4T",
      "license": "CUDA Toolkit",
      "version": "11.8.31339915",
      "linux-aarch64": {
      "relative_path": "cuda_compat/linux-aarch64/cuda_compat-linux-aarch64-11.8.31339915-archive.tar.xz",
      "sha256": "7aa1b62da35b52eaa13e254d1072aff10c907416604e5e5cc1ddcebbfe341dc7",
      "md5": "41cba7b241724ad04234dc3f20526525",
      "size": "15780868"
      }
      },

  • cudaPackages.cutensor
    • Built without error: /nix/store/616xl2ldszxnlampg2jc8vxv0lm7ba6q-cudatoolkit-11.8-cutensor-1.5.0.3
  • cudaPackages.libcudla
    • Only available on aarch64-linux

      "libcudla": {
      "name": "cuDLA",
      "license": "CUDA Toolkit",
      "version": "11.8.86",
      "linux-aarch64": {
      "relative_path": "libcudla/linux-aarch64/libcudla-linux-aarch64-11.8.86-archive.tar.xz",
      "sha256": "2fedefe9ebd567767e0079e168155f643100b7bf4ff6331c14f791290c932614",
      "md5": "14b0a2506fa1377d54b5fefe3acf5420",
      "size": "65508"
      }
      },

  • python3Packages.jax
    • Built without errors: /nix/store/spklzl860idj6awrsfdzznk3490afrbr-python3.10-jax-0.4.5
  • python3Packages.jaxlib
    • Built without errors: /nix/store/ym0sk71xlkhi0l3c7yqwj49ykwz5v7fc-python3.10-jaxlib-0.4.4
  • python3Packages.tensorflowWithCuda
    • Built without errors: /nix/store/zm00wmzzwazi5hvqhd357fpr9vi5f3jr-python3.10-tensorflow-gpu-2.11.1
  • python3Packages.torch
    • Perhaps it was built without using the static version of magma?
      • I do see the dynamic version is too large (+2GB) to be linked with the default code model (which should be small).
      • I'm trying a separate build of dynamic Magma with "-DCMAKE_C_FLAGS=-mcmodel=medium" "-DCMAKE_CXX_FLAGS=-mcmodel=medium" to see if that allows it to build.
  • python3Packages.torchvision
  • tts
    • Built without errors: /nix/store/29jmgl1rgmfcxmcfwgpfpzdq8kx0f38c-tts-0.14.0

Flake reproducer (to be used with nix flake check):

{
  inputs.nixpkgs.url = "github:NixOS/nixpkgs/e5e5c5e2035f07dd73d0da1afe16a3ee22e35d6e";

  nixConfig = {
    extra-substituters = [
      "https://cantcache.me"
      "https://cuda-maintainers.cachix.org"
    ];
    extra-trusted-substituters = [
      "https://cantcache.me"
      "https://cuda-maintainers.cachix.org"
    ];
    extra-trusted-public-keys = [
      "cantcache.me:Y+FHAKfx7S0pBkBMKpNMQtGKpILAfhmqUSnr5oNwNMs="
      "cuda-maintainers.cachix.org-1:0dq3bujKpuEPMCX6U4WylrUDZ9JyUG0VpVZa7CNfq5E="
    ];
  };

  outputs = inputs: let
    system = "x86_64-linux";
    config = {
      allowUnfree = true;
      cudaSupport = true;
    };
    pkgs = import inputs.nixpkgs {inherit system config;};
  in {
    checks.${system} = inputs.self.packages.${system};
    packages.${system} = {
      inherit
        (pkgs)
        blender
        colmapWithCuda
        tts
        ;

      # Both cuda_compat and libcudla are only available on `aarch64-linux`.
      inherit
        (pkgs.cudaPackages)
        cutensor
        ;

      inherit
        (pkgs.python3Packages)
        jax
        jaxlib
        tensorflowWithCuda
        torch
        torchvision
        ;
    };
    formatter.${system} = pkgs.alejandra;
  };
}

@ConnorBaker
Copy link
Contributor

@SomeoneSerge I think I figured out the PyTorch failure. Not sure about the others/unable to reproduce.

I suspect that your CI managed to grab a commit in-between when CUDA 11.8 was made the default (which broke dynamically linked Magma due to the binary size increase) and when I made Magma default to static builds for CUDA.

@SomeoneSerge
Copy link
Contributor Author

Ok, it seems that the original issue does not belong in nixpkgs, but I don't know how to move issues between repos.
What is a nixpkgs issue, is that (import <nixpkgs> { overlays = [ (final: prev: { ucx = prev.ucx.cudaSupport = true; }) ]; }).ucx is an infinite recursion, because of:

] ++ lib.optionals (lib.versionAtLeast version "11.8") [
(lib.getLib libtiff)
qt6Packages.qtwayland
rdma-core
ucx

This is exactly what happens at
https://github.com/SomeoneSerge/nixpkgs-cuda-ci/blob/8b461ff67bae99323d1b8913f89f387aa92bc020/nix/overlays.nix#L32-L34

@SomeoneSerge SomeoneSerge changed the title blender, colmapWithCuda, cudaPackages: evaluation error, infinite recursion ucx: enableCuda in an overlay causes infinite recursion Jun 26, 2023
@SomeoneSerge
Copy link
Contributor Author

Temporary work-around is to update the overlays like so:

        cudaPackages = prev.cudaPackages.overrideScope' (fin: pre: {
          cudatoolkit = pre.cudatoolkit.override { ucx = final.ucx.override { enableCuda = false; }; };
        });


        ucx = prev.ucx.override {
          enableCuda = true;
        };

Mid-term, we can update ucx to use the redist packages.
The longer-term solution is to deprecate cudaPackages.cudatoolkit since it's a bootstrap nightmare

@SomeoneSerge
Copy link
Contributor Author

@ConnorBaker thank you a lot for all the tests you've run!

@SomeoneSerge SomeoneSerge moved this from New to 🏗 In progress in CUDA Team Jul 20, 2023
@SomeoneSerge
Copy link
Contributor Author

This must've been addressed by #271078 or #272063

@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in CUDA Team Dec 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken
Projects
Status: Done
Development

No branches or pull requests

2 participants