ucx: enableCuda in an overlay causes infinite recursion #239182

SomeoneSerge · 2023-06-22T11:59:39Z

Describe the bug

A regression in the config.cudaSupport = true package set. Affected attributes:

0: "blender"
1: "colmapWithCuda"
2: "cudaPackages.cuda_compat"
3: "cudaPackages.cutensor"
4: "cudaPackages.libcudla"
5: "python3Packages.jax"
6: "python3Packages.jaxlib"
7: "python3Packages.tensorflowWithCuda"
8: "python3Packages.torch"
9: "python3Packages.torchvision"
10: "tts"

Cf. https://hercules-ci.com/github/SomeoneSerge/nixpkgs-cuda-ci/jobs/4832 for logs

Probably caused by the 11.7 -> 11.8 update

Notify maintainers

@NixOS/cuda-maintainers

The text was updated successfully, but these errors were encountered:

ConnorBaker · 2023-06-22T13:02:57Z

Will take a look shortly -- that's odd!

I was able to build blender just now, will try the others soon.

ConnorBaker · 2023-06-22T13:27:39Z

Looks like cuda_compat is an ARM only library:

nixpkgs/pkgs/development/compilers/cudatoolkit/redist/manifests/redistrib_11.8.0.json

Lines 38 to 48 in 437d2a8

    
           "cuda_compat": { 
        
               "name": "CUDA compat L4T", 
        
               "license": "CUDA Toolkit", 
        
               "version": "11.8.31339915", 
        
               "linux-aarch64": { 
        
                   "relative_path": "cuda_compat/linux-aarch64/cuda_compat-linux-aarch64-11.8.31339915-archive.tar.xz", 
        
                   "sha256": "7aa1b62da35b52eaa13e254d1072aff10c907416604e5e5cc1ddcebbfe341dc7", 
        
                   "md5": "41cba7b241724ad04234dc3f20526525", 
        
                   "size": "15780868" 
        
               } 
        
           },

It’s also worth noting that it’s new to 11.8, maybe to help support older CUDA code on the ARM-based Grace Hopper chip? That’s why we didn’t see this failure before.

ConnorBaker · 2023-06-22T14:00:06Z

Will be updated as I continue to build them.

For packages which aren't available on x86_64-linux, should they be accessible through cudaPackages on that system?

Flake reproducer (to be used with nix flake check):

{
  inputs.nixpkgs.url = "github:NixOS/nixpkgs/e5e5c5e2035f07dd73d0da1afe16a3ee22e35d6e";

  nixConfig = {
    extra-substituters = [
      "https://cantcache.me"
      "https://cuda-maintainers.cachix.org"
    ];
    extra-trusted-substituters = [
      "https://cantcache.me"
      "https://cuda-maintainers.cachix.org"
    ];
    extra-trusted-public-keys = [
      "cantcache.me:Y+FHAKfx7S0pBkBMKpNMQtGKpILAfhmqUSnr5oNwNMs="
      "cuda-maintainers.cachix.org-1:0dq3bujKpuEPMCX6U4WylrUDZ9JyUG0VpVZa7CNfq5E="
    ];
  };

  outputs = inputs: let
    system = "x86_64-linux";
    config = {
      allowUnfree = true;
      cudaSupport = true;
    };
    pkgs = import inputs.nixpkgs {inherit system config;};
  in {
    checks.${system} = inputs.self.packages.${system};
    packages.${system} = {
      inherit
        (pkgs)
        blender
        colmapWithCuda
        tts
        ;

      # Both cuda_compat and libcudla are only available on `aarch64-linux`.
      inherit
        (pkgs.cudaPackages)
        cutensor
        ;

      inherit
        (pkgs.python3Packages)
        jax
        jaxlib
        tensorflowWithCuda
        torch
        torchvision
        ;
    };
    formatter.${system} = pkgs.alejandra;
  };
}

ConnorBaker · 2023-06-22T18:48:20Z

@SomeoneSerge I think I figured out the PyTorch failure. Not sure about the others/unable to reproduce.

I suspect that your CI managed to grab a commit in-between when CUDA 11.8 was made the default (which broke dynamically linked Magma due to the binary size increase) and when I made Magma default to static builds for CUDA.

SomeoneSerge · 2023-06-26T21:15:21Z

Ok, it seems that the original issue does not belong in nixpkgs, but I don't know how to move issues between repos.
What is a nixpkgs issue, is that (import <nixpkgs> { overlays = [ (final: prev: { ucx = prev.ucx.cudaSupport = true; }) ]; }).ucx is an infinite recursion, because of:

nixpkgs/pkgs/development/compilers/cudatoolkit/common.nix

Lines 127 to 131 in d409d42

    
           ] ++ lib.optionals (lib.versionAtLeast version "11.8") [ 
        
             (lib.getLib libtiff) 
        
             qt6Packages.qtwayland 
        
             rdma-core 
        
             ucx

This is exactly what happens at
https://github.com/SomeoneSerge/nixpkgs-cuda-ci/blob/8b461ff67bae99323d1b8913f89f387aa92bc020/nix/overlays.nix#L32-L34

SomeoneSerge · 2023-06-26T21:59:36Z

Temporary work-around is to update the overlays like so:

        cudaPackages = prev.cudaPackages.overrideScope' (fin: pre: {
          cudatoolkit = pre.cudatoolkit.override { ucx = final.ucx.override { enableCuda = false; }; };
        });


        ucx = prev.ucx.override {
          enableCuda = true;
        };

Mid-term, we can update ucx to use the redist packages.
The longer-term solution is to deprecate cudaPackages.cudatoolkit since it's a bootstrap nightmare

SomeoneSerge · 2023-06-26T22:00:19Z

@ConnorBaker thank you a lot for all the tests you've run!

SomeoneSerge · 2023-12-10T01:30:02Z

This must've been addressed by #271078 or #272063

SomeoneSerge added the 0.kind: bug Something is broken label Jun 22, 2023

SomeoneSerge added this to CUDA Team Jun 22, 2023

github-project-automation bot moved this to New in CUDA Team Jun 22, 2023

SomeoneSerge changed the title ~~blender, colmapWithCuda, cudaPackages: evaluation error, infinite recursion~~ ucx: enableCuda in an overlay causes infinite recursion Jun 26, 2023

SomeoneSerge mentioned this issue Jul 20, 2023

Respect global config.cudaSupport #224068

Merged

12 tasks

SomeoneSerge moved this from New to 🏗 In progress in CUDA Team Jul 20, 2023

SomeoneSerge closed this as completed Dec 10, 2023

github-project-automation bot moved this from 🏗 In progress to ✅ Done in CUDA Team Dec 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ucx: enableCuda in an overlay causes infinite recursion #239182

ucx: enableCuda in an overlay causes infinite recursion #239182

SomeoneSerge commented Jun 22, 2023 •

edited

Loading

ConnorBaker commented Jun 22, 2023

ConnorBaker commented Jun 22, 2023 •

edited

Loading

ConnorBaker commented Jun 22, 2023 •

edited

Loading

ConnorBaker commented Jun 22, 2023

SomeoneSerge commented Jun 26, 2023

SomeoneSerge commented Jun 26, 2023

SomeoneSerge commented Jun 26, 2023

SomeoneSerge commented Dec 10, 2023

ucx: enableCuda in an overlay causes infinite recursion #239182

ucx: enableCuda in an overlay causes infinite recursion #239182

Comments

SomeoneSerge commented Jun 22, 2023 • edited Loading

Describe the bug

Notify maintainers

ConnorBaker commented Jun 22, 2023

ConnorBaker commented Jun 22, 2023 • edited Loading

ConnorBaker commented Jun 22, 2023 • edited Loading

ConnorBaker commented Jun 22, 2023

SomeoneSerge commented Jun 26, 2023

SomeoneSerge commented Jun 26, 2023

SomeoneSerge commented Jun 26, 2023

SomeoneSerge commented Dec 10, 2023

SomeoneSerge commented Jun 22, 2023 •

edited

Loading

ConnorBaker commented Jun 22, 2023 •

edited

Loading

ConnorBaker commented Jun 22, 2023 •

edited

Loading