rPackages.torch: Fix torch #273536

PhDyellow · 2023-12-11T13:05:20Z

Description of changes

The latest commit in NixOS/r-updates allows the R package torch to install. However, calling library(torch) causes the package to attempt to download and install binaries if it does not find them in $TORCH_PATH. This PR allows torch to actually run. I have been using torch with an RTX 3070 laptop card and it is working well

torch requires a range of CUDA related packages and libtorch. Python is not needed, liblantern is a wrapper for libtorch that interfaces with R.

Ideally, libtorch would be built from source. However, I struggled to get libtorch and liblantern to compile, but I did succeed in getting the binaries to download, install, and run correctly. I used libtorch-bin from nixpkgs as an example.

The PR is a draft. I don't like that it pulls binaries, or how the code is organised. r-modules/default.nix doesn't seem like the right place to put all the torch overrides. I also have very little free time for the next few months, so I don't think I can figure out how to make it fit nixpkgs better without assistance.

@jbedo

Things done

Add a 👍 reaction to pull requests you find important.

Torch requires a number of CUDA related packages Using the binary supplied by R torch. Precedent is in libtorch-bin in nixpkgs.

PhDyellow · 2023-12-11T23:49:44Z

I picked the wrong time to put this PR up for review.

When I tested this PR rebased onto master, it doesn't run, because torch checks for CUDA 11.7.0, and master now packages 11.7.1.

Looks like CUDA is getting a major overhaul in nixpkgs right now as well. https://discourse.nixos.org/t/cuda-team-roadmap-and-call-for-sponsors/29495.

Building from source may be easier once CUDA has been reworked.

jbedo · 2024-07-23T21:43:40Z

I had a go at compiling liblantern but encountered torch version compatibility issues. I guess we could use binary torch and compile lantern, but this doesn't seem to improve things much. Not a cuda expert so unsure how to improve this.

PhDyellow · 2024-07-24T03:19:24Z

Thanks for taking a look.

Ideally, compiling from source would be an option, but I couldn't figure it out, and I spent a good amount of time trying to debug it. Maybe someone from the CUDA team could help though.

Worst case scenario, we just keep downloading the binaries.

If #328980 just works with the GPU, it would make the nix code a lot cleaner as we just download one binary.
However, I won't assume it does work with the GPU until someone tests it.

b-rodrigues · 2024-07-25T18:37:27Z

So I don't have cuda on my system, and tried the binary of the gpu version. torch::cuda_memory_stats() returns a message cuda not available!.

Trying the same binary on an non-nix installation of R (so using opensuse's package manager) of the same binary file (and still no cuda installed), I do get a list of memory statistics using torch::cuda_memory_stats(). As they explain in the documentation, the pre-compiled binary bundles everything, so no need to instal cuda. However, a nix-shell doesn't "see" my gpu so we would need this PR as a base to get it working using the gpu.

Or we just provide the cpu version, and let users deal with the gpu version, because it requires quite a lot of deps and is also hardware specific. If you don't have an nvidia gpu, you can't use the gpu version anyway, but can still take advantage of the cpu version... or we make two packages, a torch-cpu and a torch-gpu, and users install whatever they need. This way, we can have a smaller version of the package (torch-cpu) and still provide torch-gpu for nvidia users.

I will also close my PR, so we can keep the discussion in one place.

PhDyellow · 2024-07-26T02:04:10Z

@b-rodrigues, thanks for testing the GPU version, I had hoped that your simple fix would also work for GPU, it would have made maintenance much easier. Which Nvidia GPU did you test with? I had access to a GTX 3070, and ran it on a range of GPUs in a HPC cluster, A100, H100 and maybe an L45 if I recall correctly.

Could you try out one more thing before abandoning your approach?

It sounds like you were using Nix inside Opensuse, and not NixOS. Is that correct? If so, please try again, but with nix-gl-host, eg nixglhost R -e "torch::cuda_memory_stats()". If you need more support using nix-gl-host, please ask, I can provide more details.

Torch bundled by this PR only worked on a HPC running CentOS when I added nix-gl-host before my call to R. The reason is that Nix outside of NixOS does not really provide access to the system graphics drivers, and software bundled in nix is modified to not look outside of Nix for external dependencies. It took some troubleshooting to figure that out, as I didn't have to use nix-gl-host on my local dev machine running NixOS.

PhDyellow · 2024-07-26T02:13:17Z

Regarding two packages versus one package, I think the way to approach it is to check for CUDA. If the user has

config.cudaSupport = true;

then provide the GPU accelerated version, otherwise provide the CPU-only version.

We may need some documentation or a warning if cudaSupport = false, I think most people using torch expect to use a GPU, but may not realise that they have to enable CUDA in Nix globally.

A wiki is probably important, as issues like requiring nix-gl-host for running CUDA software installed by Nix on non-NixOS systems is surprising and not obvious. I got lucky, I stumbled across the nixGL webpage by chance and was able to use it to get CUDA working on the HPC.

PhDyellow · 2024-07-26T06:28:04Z

Another thought on having two packages: the GPU binaries fall back to CPU operation anyway. The only advantage I see for a CPU-only binary is saving space.

b-rodrigues · 2024-07-26T19:24:57Z

@b-rodrigues, thanks for testing the GPU version, I had hoped that your simple fix would also work for GPU, it would have made maintenance much easier. Which Nvidia GPU did you test with? I had access to a GTX 3070, and ran it on a range of GPUs in a HPC cluster, A100, H100 and maybe an L45 if I recall correctly.

Could you try out one more thing before abandoning your approach?

It sounds like you were using Nix inside Opensuse, and not NixOS. Is that correct? If so, please try again, but with nix-gl-host, eg nixglhost R -e "torch::cuda_memory_stats()". If you need more support using nix-gl-host, please ask, I can provide more details.

Torch bundled by this PR only worked on a HPC running CentOS when I added nix-gl-host before my call to R. The reason is that Nix outside of NixOS does not really provide access to the system graphics drivers, and software bundled in nix is modified to not look outside of Nix for external dependencies. It took some troubleshooting to figure that out, as I didn't have to use nix-gl-host on my local dev machine running NixOS.

Amazing, that seems to work! I wrote the following shell.nix:

{ pkgs ? import "/home/b-rodrigues/Documents/github_repos/nixpkgs" { }, lib ? pkgs.lib }:

let
 nixglhost-sources = pkgs.fetchFromGitHub {
   owner = "numtide";
   repo = "nix-gl-host";
   rev = "main";
   # Replace this with the hash Nix will complain about, TOFU style.
   hash = "sha256-e3EAKVsrmuC4NwEEpAU6nS4LgNoGH/7fgax1eUtxz18=";
 };

 nixglhost = pkgs.callPackage "${nixglhost-sources}/default.nix" { };

in pkgs.mkShell {
  buildInputs = [
    nixglhost
    pkgs.rPackages.torch
    pkgs.R
  ];
}

and then used nix-shell shell.nix. Once in the shell, I ran nixglhost R and then torch::cuda_memory_stats() returned:

List of 13
 $ num_alloc_retries   : num 0
 $ num_ooms            : num 0
 $ max_split_size      : num -1
 $ oversize_allocations:List of 4
  ..$ current  : num 0
  ..$ peak     : num 0
  ..$ allocated: num 0
  ..$ freed    : num 0
 $ oversize_segments   :List of 4
  ..$ current  : num 0
  ..$ peak     : num 0
  ..$ allocated: num 0
  ..$ freed    : num 0
 $ allocation          :List of 3
  ..$ all       :List of 4
  .. ..$ current  : num 1
  .. ..$ peak     : num 1
  .. ..$ allocated: num 1
  .. ..$ freed    : num 0
  ..$ small_pool:List of 4
  .. ..$ current  : num 1
  .. ..$ peak     : num 1
  .. ..$ allocated: num 1
  .. ..$ freed    : num 0
  ..$ large_pool:List of 4
  .. ..$ current  : num 0
  .. ..$ peak     : num 0
  .. ..$ allocated: num 0
  .. ..$ freed    : num 0
 $ segment             :List of 3
  ..$ all       :List of 4
  .. ..$ current  : num 1
  .. ..$ peak     : num 1
  .. ..$ allocated: num 1
  .. ..$ freed    : num 0
  ..$ small_pool:List of 4
  .. ..$ current  : num 1
  .. ..$ peak     : num 1
  .. ..$ allocated: num 1
  .. ..$ freed    : num 0
  ..$ large_pool:List of 4
  .. ..$ current  : num 0
  .. ..$ peak     : num 0
  .. ..$ allocated: num 0
  .. ..$ freed    : num 0
 $ active              :List of 3
  ..$ all       :List of 4
  .. ..$ current  : num 1
  .. ..$ peak     : num 1
  .. ..$ allocated: num 1
  .. ..$ freed    : num 0
  ..$ small_pool:List of 4
  .. ..$ current  : num 1
  .. ..$ peak     : num 1
  .. ..$ allocated: num 1
  .. ..$ freed    : num 0
  ..$ large_pool:List of 4
  .. ..$ current  : num 0
  .. ..$ peak     : num 0
  .. ..$ allocated: num 0
  .. ..$ freed    : num 0
 $ inactive_split      :List of 3
  ..$ all       :List of 4
  .. ..$ current  : num 1
  .. ..$ peak     : num 1
  .. ..$ allocated: num 1
  .. ..$ freed    : num 0
  ..$ small_pool:List of 4
  .. ..$ current  : num 1
  .. ..$ peak     : num 1
  .. ..$ allocated: num 1
  .. ..$ freed    : num 0
  ..$ large_pool:List of 4
  .. ..$ current  : num 0
  .. ..$ peak     : num 0
  .. ..$ allocated: num 0
  .. ..$ freed    : num 0
 $ allocated_bytes     :List of 3
  ..$ all       :List of 4
  .. ..$ current  : num 512
  .. ..$ peak     : num 512
  .. ..$ allocated: num 512
  .. ..$ freed    : num 0
  ..$ small_pool:List of 4
  .. ..$ current  : num 512
  .. ..$ peak     : num 512
  .. ..$ allocated: num 512
  .. ..$ freed    : num 0
  ..$ large_pool:List of 4
  .. ..$ current  : num 0
  .. ..$ peak     : num 0
  .. ..$ allocated: num 0
  .. ..$ freed    : num 0
 $ reserved_bytes      :List of 3
  ..$ all       :List of 4
  .. ..$ current  : num 2097152
  .. ..$ peak     : num 2097152
  .. ..$ allocated: num 2097152
  .. ..$ freed    : num 0
  ..$ small_pool:List of 4
  .. ..$ current  : num 2097152
  .. ..$ peak     : num 2097152
  .. ..$ allocated: num 2097152
  .. ..$ freed    : num 0
  ..$ large_pool:List of 4
  .. ..$ current  : num 0
  .. ..$ peak     : num 0
  .. ..$ allocated: num 0
  .. ..$ freed    : num 0
 $ active_bytes        :List of 3
  ..$ all       :List of 4
  .. ..$ current  : num 512
  .. ..$ peak     : num 512
  .. ..$ allocated: num 512
  .. ..$ freed    : num 0
  ..$ small_pool:List of 4
  .. ..$ current  : num 512
  .. ..$ peak     : num 512
  .. ..$ allocated: num 512
  .. ..$ freed    : num 0
  ..$ large_pool:List of 4
  .. ..$ current  : num 0
  .. ..$ peak     : num 0
  .. ..$ allocated: num 0
  .. ..$ freed    : num 0
 $ inactive_split_bytes:List of 3
  ..$ all       :List of 4
  .. ..$ current  : num 2096640
  .. ..$ peak     : num 2096640
  .. ..$ allocated: num 2096640
  .. ..$ freed    : num 0
  ..$ small_pool:List of 4
  .. ..$ current  : num 2096640
  .. ..$ peak     : num 2096640
  .. ..$ allocated: num 2096640
  .. ..$ freed    : num 0
  ..$ large_pool:List of 4
  .. ..$ current  : num 0
  .. ..$ peak     : num 0
  .. ..$ allocated: num 0
  .. ..$ freed    : num 0
 - attr(*, "class")= chr "cuda_memory_stats"

trying to run torch::cuda_memory_stats() by running first R instead of nixglhost R results in the following:

Error in `torch::cuda_memory_stats()`:
! CUDA is not available.

that's pretty neat. Should we go with the binary route then?

Regarding two packages versus one package, I think the way to approach it is to check for CUDA. If the user has
config.cudaSupport = true;
then provide the GPU accelerated version, otherwise provide the CPU-only version.

but that config is only for NixOS right? How could we do it for non-NixOS Nix package managers users?

PhDyellow · 2024-07-26T21:44:38Z

That's promising! This PR doesn't actually build the C code anyway, it just downloads binaries too. Most of the code in this PR is for getting the precompiled liblantern and libtorch binaries to talk to the correct Nix CUDA installation.

Getting the correct version of CUDA bundled along with the R package will simplify the Nix code.

PhDyellow · 2024-07-26T21:44:54Z

Regarding the option, it is a nixpkgs option, and I used it in a nix shell on CentOS, so it isn't exclusive to NixOS. However, the code pattern I often see in nixpkgs is to do something like this:

r-packages.nix

{...
, withCudu = ? false
...}:

if withCuda ....

Then by default, CUDA is off, but users can override it for the R package set, and NixOS can override it according to config.cudaSupport. (This is non-functional pseudo-code of course).

If config.cudaSupport is part of nixpkgs, not NixOS, then we shouldn't even need that.

rPackages.torch: Fix torch

796dc68

Torch requires a number of CUDA related packages Using the binary supplied by R torch. Precedent is in libtorch-bin in nixpkgs.

ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux labels Dec 11, 2023

wegank added the 2.status: merge conflict This PR has merge conflicts with the target branch label Apr 5, 2024

wegank added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jul 4, 2024

b-rodrigues mentioned this pull request Jul 21, 2024

rPackages.torch: use binary package for installation #328980

Open

13 tasks

stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jul 23, 2024

PhDyellow mentioned this pull request Sep 26, 2024

rPackages: compile r-torch from source #344593

Draft

20 tasks

jbedo force-pushed the r-updates branch from aed7ccc to f67d8df Compare November 2, 2024 20:54

jbedo force-pushed the r-updates branch from f67d8df to 668ccb9 Compare December 11, 2024 20:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rPackages.torch: Fix torch #273536

rPackages.torch: Fix torch #273536

PhDyellow commented Dec 11, 2023

PhDyellow commented Dec 11, 2023

jbedo commented Jul 23, 2024

PhDyellow commented Jul 24, 2024

b-rodrigues commented Jul 25, 2024 •

edited

Loading

PhDyellow commented Jul 26, 2024

PhDyellow commented Jul 26, 2024

PhDyellow commented Jul 26, 2024

b-rodrigues commented Jul 26, 2024

PhDyellow commented Jul 26, 2024

PhDyellow commented Jul 26, 2024

rPackages.torch: Fix torch #273536

Are you sure you want to change the base?

rPackages.torch: Fix torch #273536

Conversation

PhDyellow commented Dec 11, 2023

Description of changes

Things done

PhDyellow commented Dec 11, 2023

jbedo commented Jul 23, 2024

PhDyellow commented Jul 24, 2024

b-rodrigues commented Jul 25, 2024 • edited Loading

PhDyellow commented Jul 26, 2024

PhDyellow commented Jul 26, 2024

PhDyellow commented Jul 26, 2024

b-rodrigues commented Jul 26, 2024

PhDyellow commented Jul 26, 2024

PhDyellow commented Jul 26, 2024

b-rodrigues commented Jul 25, 2024 •

edited

Loading