Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport release-23.11] cuda-modules #272784

Conversation

ConnorBaker
Copy link
Contributor

@ConnorBaker ConnorBaker commented Dec 7, 2023

Important

This PR must be rebased to target staging-23.11 after the following PRs are merged:

Description of changes

Things done

Backport of #256324.

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.05 Release Notes (or backporting 23.05 and 23.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@ConnorBaker ConnorBaker added the 6.topic: cuda Parallel computing platform and API label Dec 7, 2023
@ConnorBaker ConnorBaker self-assigned this Dec 7, 2023
@github-actions github-actions bot added 6.topic: python 8.has: documentation This PR adds or changes documentation labels Dec 7, 2023
@ofborg ofborg bot added 8.has: clean-up 8.has: package (new) This PR adds a new package 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 101-500 labels Dec 8, 2023
@ConnorBaker ConnorBaker changed the title Backport 256324 to release 23.11 [Backport release-23.11] #256324 Dec 12, 2023
@ConnorBaker ConnorBaker marked this pull request as draft December 12, 2023 14:27
Connor Baker added 7 commits December 12, 2023 14:35
cudaPackages.cuda_compat: ignore missing libs provided at runtime

cudaPackages.gpus: Jetson should never build by default

cudaPackages.flags: don't build Jetson capabilities by default

cudaPackages: re-introduce filter for pre-existing CUDA redist packages in overrides

cudaPackages: only recurseIntoAttrs for the latest of each major version

cudaPackages.nvccCompatabilities: use GCC 10 through CUDA 11.5 to avoid a GLIBC incompatability

cudaPackages.cutensor: acquire libcublas through cudatoolkit prior to 11.4

cudaPackages.cuda_compat: mark as broken on aarch64-linux if not targeting Jetson

cudaPackages.cutensor_1_4: fix build

cudaPackages: adjust use of autoPatchelfIgnoreMissingDeps

cudaPackages.cuda_nvprof: remove unecessary override to add addOpenGLRunpath

cudaPackages: use getExe' to avoid patchelf warning about missing meta.mainProgram

cudaPackages: fix evaluation with Nix 2.3

cudaPackages: fix platform detection for Jetson/non-Jetson aarch64-linux

python3Packages.tensorrt: mark as broken if required packages are missing

Note: evaluating the name of the derivation will fail if tensorrt is not present,
which is why we wrap the value in `lib.optionalString`.

cudaPackages.flags.getNixSystem: add guard based on jetsonTargets

cudaPackages.cudnn: use explicit path to patchelf

cudaPackages.tensorrt: use explicit path to patchelf

(cherry picked from commit 8e800ce)
(cherry picked from commit bfaefd0)
(cherry picked from commit 0a7dacf)
@ConnorBaker ConnorBaker changed the title [Backport release-23.11] #256324 [Backport release-23.11] cuda-modules Dec 12, 2023
After testing on a Jetson device, it turns out `cuda_compat` requires libnvdla_runtime.so which can't be satisfied by autoPatchElf, as it is provided by the runtime driver. This commit simply adds this library to the list of dependency to be ignored by autoPatchElf.

(cherry picked from commit a3ac436)
@ConnorBaker ConnorBaker force-pushed the backport-256324-to-release-23.11 branch from c5a131e to d883056 Compare December 12, 2023 14:40
@RaitoBezarius
Copy link
Member

I'm not sure I understand whether this fits the backport policy, what is the rationale here?

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Dec 12, 2023

Before this gets into a release, the following need to be addressed:

(cherry picked from commit a39043f)
Some nvidia devices, such as the Jetson family, support the Nvidia compatibility package (nvidia_compat) which allows to run executables built against a higher CUDA major version on a system with an older CUDA driver. On such platforms, the consensus among CUDA maintainers is that there is no downside in always enabling it by default.

This commit links to the relevant cuda_compat shared libraries by patching the CUDA core packages' runpaths when cuda_compat is available, in the same way as we do for OpenGL drivers currently.

(cherry picked from commit d6c198a)
@jonringer
Copy link
Contributor

I'm not sure I understand whether this fits the backport policy, what is the rationale here?

It does if there's no regressions. A lot of this work is about improving Nixpkgs + Cuda support, which can be viewed as an additive feature. It should be tested well to ensure that it doesn't introduce any new failures.

@jonringer
Copy link
Contributor

cc @NixOS/cuda-maintainers , who would have a good idea of potential regressions

@ConnorBaker
Copy link
Contributor Author

@jonringer excellent point about regression tests -- doubly so for runtime behavior. We don't have anything in the way of tracking performance over time... we're still working on making sure we have enough builders to keep our cache somewhat populated, let alone automated testing or regression detection.

The only runtime tests I have are manual: cudaPackages.saxpy, which is really just meant to check that the driver works, and nix-cuda-test (https://github.com/ConnorBaker/nix-cuda-test), which has a dinky little ViT model.

@graham33 would you happen to have anything you'd be able to use for regression testing on the 23.11 branch? You're one of the handful of people I know using Nix+CUDA and I'd like to make sure this doesn't cause fires or headaches :)

@gbpdt
Copy link
Contributor

gbpdt commented Dec 13, 2023

@graham33 would you happen to have anything you'd be able to use for regression testing on the 23.11 branch? You're one of the handful of people I know using Nix+CUDA and I'd like to make sure this doesn't cause fires or headaches :)

(Switching to my other Github ID)

We've started our internal upgrade process to adopt 23.11, which would at least allow us to do some basic testing. We could definitely try to build against your branch and make sure we can get some basic CUDA validation working.

This resolves crashes in nsys-ui

(cherry picked from commit a5b8caa)
@SomeoneSerge
Copy link
Contributor

I mentioned this in the matrix chat, but I'll reiterate here: I don't see a particular rush for backporting these changes to 23.11, unless there's a specific user/customer interested in the new functionality (meaning mostly jetsons) but otherwise limited to using the release branches. I haven't any objections either

@IlyaNiklyaev
Copy link
Contributor

IlyaNiklyaev commented Feb 26, 2024

Hi 👋
Our team is interested in this backport to land stable channel 🙂 Yes, we target both jetsons and x86 and we're not ready to switch to unstable.
At the moment we have this PR as a patch over nixos-23.11, and it works just fine as we can see (for both x86 and jetson). The only issue is cross-build (buildPlatform x86_64, hostPlatform - aarch64, both linux) : I guess we need some additional fixes from master, right now I'm trying to figure out the right set of commits.

to @ConnorBaker we can definitely help with testing if you have some specific tests to check before this PR can be merged.

@SomeoneSerge
Copy link
Contributor

Hi @IlyaNiklyaev,

I think backporting new features and breaking interface changes is far from the optimal way to go here:

  • it's two months until the next release;
  • the PR had (as anticipated) caused a bit of fall out on the unstable which we've been slowly addressing, that it won't be clear what else to backport;
  • cross-compilation support actually requires more work anyway (I think @ConnorBaker is iterating on that rn in https://github.com/ConnorBaker/cuda-modules).

I too would like to see a model more like rolling releases with merge trains, but right now it's easier to accept that "new features" take up to 6 months to "stabilize". Wdyt?

@IlyaNiklyaev
Copy link
Contributor

IlyaNiklyaev commented Mar 7, 2024

Hi @SomeoneSerge,
I think we can't wait until the next release, but I totally understand that there is a lot of moving around this functionality in the unstable which prevents us from just backporting all of it. So I'll continue to fix cross-build in our current state (stable 23.11 + the patch from this PR). I think a couple of our fixes would be useful in unstable too :)

@SomeoneSerge
Copy link
Contributor

@IlyaNiklyaev I wonder, do you actually "patch" anything, or is it enough to use overlays? E.g.

inputs.nixpkgs.url = "github:NixOS/nixpkgs/nixos-23.11";
inptus.nixpkgs-unstable.url = "...";
outputs = { nixpkgs, nixpkgs-unstable, ... }: {
  packages.x86_64-linux = let pkgs = import nixpkgs { ...; overlays = [(final: prev: {
    cudaPackages = final.callPackage "${nixpkgs-unstable}/pkgs/top-level/cuda-packages.nix" { };
    cudaPackages_XX_Y = ...;
    # ...
  })];
};

...might suffice to use the current cudaPackages expressions with the older nixpkgs release

@IlyaNiklyaev
Copy link
Contributor

IlyaNiklyaev commented Mar 11, 2024

@SomeoneSerge hmm I even haven't tried that. I guess this is a tree-wide change, so it would be much more difficult to apply it as an overlay?
For example, I just tried what you suggested, and got this

evaluation aborted with the following error message: 'Function called without required argument "addDriverRunpath" at /nix/store/zafmv50n5r12j0v3cpazsmj2s7v8f5wc-source/pkgs/development/cuda-modules/cuda/overrides.nix:1'

So nixos-23.11 is not compatible with this cfa0701. Ofc I could bisect and try to find some point in unstable's history that will work but it's just much simpler to take this PR as a patch for now.

@SomeoneSerge
Copy link
Contributor

I believe that most of the changes were actually contained to the cudaPackages* attributes and to the respective file paths... OOC, I tested cudaPackages.saxpy here and it worked: https://github.com/SomeoneSerge/nixos-2311-unstable-cuda/blob/master/mk-overlay.nix

@IlyaNiklyaev
Copy link
Contributor

@SomeoneSerge wow that really works :) Thanks, I'll try to follow this approach. It doesn't help much with cross-build though but it's not directly related to this backport I suppose.

@wegank wegank added the 2.status: merge conflict This PR has merge conflicts with the target branch label Mar 20, 2024
@wegank
Copy link
Member

wegank commented Jun 30, 2024

As NixOS 23.11 reaches end-of-life today, this PR is closed as part of the EOL cleanup.

@wegank wegank closed this Jun 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.status: merge conflict This PR has merge conflicts with the target branch 6.topic: cuda Parallel computing platform and API 6.topic: python 8.has: clean-up 8.has: documentation This PR adds or changes documentation 8.has: package (new) This PR adds a new package 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 101-500
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

9 participants