Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rPackages.torch: Fix torch #273536

Draft
wants to merge 1 commit into
base: r-updates
Choose a base branch
from
Draft

Conversation

PhDyellow
Copy link

Description of changes

The latest commit in NixOS/r-updates allows the R package torch to install. However, calling library(torch) causes the package to attempt to download and install binaries if it does not find them in $TORCH_PATH. This PR allows torch to actually run. I have been using torch with an RTX 3070 laptop card and it is working well

torch requires a range of CUDA related packages and libtorch. Python is not needed, liblantern is a wrapper for libtorch that interfaces with R.

Ideally, libtorch would be built from source. However, I struggled to get libtorch and liblantern to compile, but I did succeed in getting the binaries to download, install, and run correctly. I used libtorch-bin from nixpkgs as an example.

The PR is a draft. I don't like that it pulls binaries, or how the code is organised. r-modules/default.nix doesn't seem like the right place to put all the torch overrides. I also have very little free time for the next few months, so I don't think I can figure out how to make it fit nixpkgs better without assistance.

@jbedo

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.05 Release Notes (or backporting 23.05 and 23.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

Torch requires a number of CUDA related packages

Using the binary supplied by R torch.

Precedent is in libtorch-bin in nixpkgs.
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux labels Dec 11, 2023
@PhDyellow
Copy link
Author

I picked the wrong time to put this PR up for review.

When I tested this PR rebased onto master, it doesn't run, because torch checks for CUDA 11.7.0, and master now packages 11.7.1.

Looks like CUDA is getting a major overhaul in nixpkgs right now as well. https://discourse.nixos.org/t/cuda-team-roadmap-and-call-for-sponsors/29495.

Building from source may be easier once CUDA has been reworked.

@wegank wegank added the 2.status: merge conflict This PR has merge conflicts with the target branch label Apr 5, 2024
@wegank wegank added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jul 4, 2024
@jbedo
Copy link
Contributor

jbedo commented Jul 23, 2024

I had a go at compiling liblantern but encountered torch version compatibility issues. I guess we could use binary torch and compile lantern, but this doesn't seem to improve things much. Not a cuda expert so unsure how to improve this.

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jul 23, 2024
@PhDyellow
Copy link
Author

Thanks for taking a look.

Ideally, compiling from source would be an option, but I couldn't figure it out, and I spent a good amount of time trying to debug it. Maybe someone from the CUDA team could help though.

Worst case scenario, we just keep downloading the binaries.

If #328980 just works with the GPU, it would make the nix code a lot cleaner as we just download one binary.
However, I won't assume it does work with the GPU until someone tests it.

@b-rodrigues
Copy link
Contributor

b-rodrigues commented Jul 25, 2024

So I don't have cuda on my system, and tried the binary of the gpu version. torch::cuda_memory_stats() returns a message cuda not available!.

Trying the same binary on an non-nix installation of R (so using opensuse's package manager) of the same binary file (and still no cuda installed), I do get a list of memory statistics using torch::cuda_memory_stats(). As they explain in the documentation, the pre-compiled binary bundles everything, so no need to instal cuda. However, a nix-shell doesn't "see" my gpu so we would need this PR as a base to get it working using the gpu.

Or we just provide the cpu version, and let users deal with the gpu version, because it requires quite a lot of deps and is also hardware specific. If you don't have an nvidia gpu, you can't use the gpu version anyway, but can still take advantage of the cpu version... or we make two packages, a torch-cpu and a torch-gpu, and users install whatever they need. This way, we can have a smaller version of the package (torch-cpu) and still provide torch-gpu for nvidia users.

I will also close my PR, so we can keep the discussion in one place.

@PhDyellow
Copy link
Author

@b-rodrigues, thanks for testing the GPU version, I had hoped that your simple fix would also work for GPU, it would have made maintenance much easier. Which Nvidia GPU did you test with? I had access to a GTX 3070, and ran it on a range of GPUs in a HPC cluster, A100, H100 and maybe an L45 if I recall correctly.

Could you try out one more thing before abandoning your approach?

It sounds like you were using Nix inside Opensuse, and not NixOS. Is that correct? If so, please try again, but with nix-gl-host, eg nixglhost R -e "torch::cuda_memory_stats()". If you need more support using nix-gl-host, please ask, I can provide more details.

Torch bundled by this PR only worked on a HPC running CentOS when I added nix-gl-host before my call to R. The reason is that Nix outside of NixOS does not really provide access to the system graphics drivers, and software bundled in nix is modified to not look outside of Nix for external dependencies. It took some troubleshooting to figure that out, as I didn't have to use nix-gl-host on my local dev machine running NixOS.

@PhDyellow
Copy link
Author

Regarding two packages versus one package, I think the way to approach it is to check for CUDA. If the user has

config.cudaSupport = true;

then provide the GPU accelerated version, otherwise provide the CPU-only version.

We may need some documentation or a warning if cudaSupport = false, I think most people using torch expect to use a GPU, but may not realise that they have to enable CUDA in Nix globally.

A wiki is probably important, as issues like requiring nix-gl-host for running CUDA software installed by Nix on non-NixOS systems is surprising and not obvious. I got lucky, I stumbled across the nixGL webpage by chance and was able to use it to get CUDA working on the HPC.

@PhDyellow
Copy link
Author

Another thought on having two packages: the GPU binaries fall back to CPU operation anyway. The only advantage I see for a CPU-only binary is saving space.

@b-rodrigues
Copy link
Contributor

@b-rodrigues, thanks for testing the GPU version, I had hoped that your simple fix would also work for GPU, it would have made maintenance much easier. Which Nvidia GPU did you test with? I had access to a GTX 3070, and ran it on a range of GPUs in a HPC cluster, A100, H100 and maybe an L45 if I recall correctly.

Could you try out one more thing before abandoning your approach?

It sounds like you were using Nix inside Opensuse, and not NixOS. Is that correct? If so, please try again, but with nix-gl-host, eg nixglhost R -e "torch::cuda_memory_stats()". If you need more support using nix-gl-host, please ask, I can provide more details.

Torch bundled by this PR only worked on a HPC running CentOS when I added nix-gl-host before my call to R. The reason is that Nix outside of NixOS does not really provide access to the system graphics drivers, and software bundled in nix is modified to not look outside of Nix for external dependencies. It took some troubleshooting to figure that out, as I didn't have to use nix-gl-host on my local dev machine running NixOS.

Amazing, that seems to work! I wrote the following shell.nix:

{ pkgs ? import "/home/b-rodrigues/Documents/github_repos/nixpkgs" { }, lib ? pkgs.lib }:

let
 nixglhost-sources = pkgs.fetchFromGitHub {
   owner = "numtide";
   repo = "nix-gl-host";
   rev = "main";
   # Replace this with the hash Nix will complain about, TOFU style.
   hash = "sha256-e3EAKVsrmuC4NwEEpAU6nS4LgNoGH/7fgax1eUtxz18=";
 };

 nixglhost = pkgs.callPackage "${nixglhost-sources}/default.nix" { };

in pkgs.mkShell {
  buildInputs = [
    nixglhost
    pkgs.rPackages.torch
    pkgs.R
  ];
}

and then used nix-shell shell.nix. Once in the shell, I ran nixglhost R and then torch::cuda_memory_stats() returned:

List of 13
 $ num_alloc_retries   : num 0
 $ num_ooms            : num 0
 $ max_split_size      : num -1
 $ oversize_allocations:List of 4
  ..$ current  : num 0
  ..$ peak     : num 0
  ..$ allocated: num 0
  ..$ freed    : num 0
 $ oversize_segments   :List of 4
  ..$ current  : num 0
  ..$ peak     : num 0
  ..$ allocated: num 0
  ..$ freed    : num 0
 $ allocation          :List of 3
  ..$ all       :List of 4
  .. ..$ current  : num 1
  .. ..$ peak     : num 1
  .. ..$ allocated: num 1
  .. ..$ freed    : num 0
  ..$ small_pool:List of 4
  .. ..$ current  : num 1
  .. ..$ peak     : num 1
  .. ..$ allocated: num 1
  .. ..$ freed    : num 0
  ..$ large_pool:List of 4
  .. ..$ current  : num 0
  .. ..$ peak     : num 0
  .. ..$ allocated: num 0
  .. ..$ freed    : num 0
 $ segment             :List of 3
  ..$ all       :List of 4
  .. ..$ current  : num 1
  .. ..$ peak     : num 1
  .. ..$ allocated: num 1
  .. ..$ freed    : num 0
  ..$ small_pool:List of 4
  .. ..$ current  : num 1
  .. ..$ peak     : num 1
  .. ..$ allocated: num 1
  .. ..$ freed    : num 0
  ..$ large_pool:List of 4
  .. ..$ current  : num 0
  .. ..$ peak     : num 0
  .. ..$ allocated: num 0
  .. ..$ freed    : num 0
 $ active              :List of 3
  ..$ all       :List of 4
  .. ..$ current  : num 1
  .. ..$ peak     : num 1
  .. ..$ allocated: num 1
  .. ..$ freed    : num 0
  ..$ small_pool:List of 4
  .. ..$ current  : num 1
  .. ..$ peak     : num 1
  .. ..$ allocated: num 1
  .. ..$ freed    : num 0
  ..$ large_pool:List of 4
  .. ..$ current  : num 0
  .. ..$ peak     : num 0
  .. ..$ allocated: num 0
  .. ..$ freed    : num 0
 $ inactive_split      :List of 3
  ..$ all       :List of 4
  .. ..$ current  : num 1
  .. ..$ peak     : num 1
  .. ..$ allocated: num 1
  .. ..$ freed    : num 0
  ..$ small_pool:List of 4
  .. ..$ current  : num 1
  .. ..$ peak     : num 1
  .. ..$ allocated: num 1
  .. ..$ freed    : num 0
  ..$ large_pool:List of 4
  .. ..$ current  : num 0
  .. ..$ peak     : num 0
  .. ..$ allocated: num 0
  .. ..$ freed    : num 0
 $ allocated_bytes     :List of 3
  ..$ all       :List of 4
  .. ..$ current  : num 512
  .. ..$ peak     : num 512
  .. ..$ allocated: num 512
  .. ..$ freed    : num 0
  ..$ small_pool:List of 4
  .. ..$ current  : num 512
  .. ..$ peak     : num 512
  .. ..$ allocated: num 512
  .. ..$ freed    : num 0
  ..$ large_pool:List of 4
  .. ..$ current  : num 0
  .. ..$ peak     : num 0
  .. ..$ allocated: num 0
  .. ..$ freed    : num 0
 $ reserved_bytes      :List of 3
  ..$ all       :List of 4
  .. ..$ current  : num 2097152
  .. ..$ peak     : num 2097152
  .. ..$ allocated: num 2097152
  .. ..$ freed    : num 0
  ..$ small_pool:List of 4
  .. ..$ current  : num 2097152
  .. ..$ peak     : num 2097152
  .. ..$ allocated: num 2097152
  .. ..$ freed    : num 0
  ..$ large_pool:List of 4
  .. ..$ current  : num 0
  .. ..$ peak     : num 0
  .. ..$ allocated: num 0
  .. ..$ freed    : num 0
 $ active_bytes        :List of 3
  ..$ all       :List of 4
  .. ..$ current  : num 512
  .. ..$ peak     : num 512
  .. ..$ allocated: num 512
  .. ..$ freed    : num 0
  ..$ small_pool:List of 4
  .. ..$ current  : num 512
  .. ..$ peak     : num 512
  .. ..$ allocated: num 512
  .. ..$ freed    : num 0
  ..$ large_pool:List of 4
  .. ..$ current  : num 0
  .. ..$ peak     : num 0
  .. ..$ allocated: num 0
  .. ..$ freed    : num 0
 $ inactive_split_bytes:List of 3
  ..$ all       :List of 4
  .. ..$ current  : num 2096640
  .. ..$ peak     : num 2096640
  .. ..$ allocated: num 2096640
  .. ..$ freed    : num 0
  ..$ small_pool:List of 4
  .. ..$ current  : num 2096640
  .. ..$ peak     : num 2096640
  .. ..$ allocated: num 2096640
  .. ..$ freed    : num 0
  ..$ large_pool:List of 4
  .. ..$ current  : num 0
  .. ..$ peak     : num 0
  .. ..$ allocated: num 0
  .. ..$ freed    : num 0
 - attr(*, "class")= chr "cuda_memory_stats"

trying to run torch::cuda_memory_stats() by running first R instead of nixglhost R results in the following:

Error in `torch::cuda_memory_stats()`:
! CUDA is not available.

that's pretty neat. Should we go with the binary route then?

Regarding two packages versus one package, I think the way to approach it is to check for CUDA. If the user has

config.cudaSupport = true;

then provide the GPU accelerated version, otherwise provide the CPU-only version.

but that config is only for NixOS right? How could we do it for non-NixOS Nix package managers users?

@PhDyellow
Copy link
Author

That's promising! This PR doesn't actually build the C code anyway, it just downloads binaries too. Most of the code in this PR is for getting the precompiled liblantern and libtorch binaries to talk to the correct Nix CUDA installation.

Getting the correct version of CUDA bundled along with the R package will simplify the Nix code.

@PhDyellow
Copy link
Author

Regarding the option, it is a nixpkgs option, and I used it in a nix shell on CentOS, so it isn't exclusive to NixOS. However, the code pattern I often see in nixpkgs is to do something like this:

r-packages.nix

{...
, withCudu = ? false
...}:

if withCuda ....

Then by default, CUDA is off, but users can override it for the R package set, and NixOS can override it according to config.cudaSupport. (This is non-functional pseudo-code of course).

If config.cudaSupport is part of nixpkgs, not NixOS, then we shouldn't even need that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.status: merge conflict This PR has merge conflicts with the target branch 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants