rPackages: compile r-torch from source #344593

PhDyellow · 2024-09-26T07:07:51Z

Description of changes

The R package torch provides an R interface to libtorch, without calling intermediate python code for speed. torch can be compiled with CUDA support, allowing R users to work on GPUs. Currently, torch is broken in master but not marked as such.

torch has a relatively complex setup for an R package. The R code includes an interface to a compiled library, liblantern, which interfaces with the API of libtorch. CUDA support can be compiled in. During installation, the R code is built, but if the env variable BUILD_LANTERN is set to 1, torch will attempt to compile liblantern as well. Otherwise, during first use of the package, torch will attempt to either build or fetch binaries for liblantern and libtorch. The fetched binaries need to be patched to work with Nix and NixOS, and the default installation directory is probably in the nix store, so the binaries have to be present during the nix build.

In #271342, I succeeded in downloading the binaries during the build phase and patching them, and I was able to get work done on my local machine with GPU acceleration. I also managed to use the same PR code to run torch on an HPC cluster using nix-gl-host.

More recently, #328980 took another approach to downloading the binaries. When torch is downloaded as a precompiled binary from CRAN, rather than as source code, liblantern and libtorch are included. The binaries still need patching, but the nix code is cleaner, as the correct versions are already bundled together.

Using precompiled binaries in Nixpkgs is not ideal, it goes against the spirit of nixpkgs, as well as duplicating libtorch and related CUDA libraries in the nix store. However, torch is packaged on CRAN, and therefore it is provided by nixpkgs. It would be a confusing user experience for one CRAN package to require an unusual installation, such as setting up an overlay. I recommend merging one of the binary solutions when they are ready.

This PR aims to compile torch and liblantern from source, making use of the exisitng CUDA libraries and libtorch in nixpkgs. It would make #328980 and #271342 redundant if we can succeed. This PR is a draft because liblantern fails to build, and I would like to work with @SomeoneSerge to fix it.

The current issues:

liblantern fails to build. Build logs below.
Nixpkgs does not provide libtorch version 2.0.1, which liblantern expects
CMake warns that the kineto library is not found. As kineto is a monitoring library that liblantern does not appear to use, I have not addressed this yet.

Some surprises I encountered while getting this PR to it's current state:

liblantern seems to have a bug in the CMake configuration files. Fixed by substituteInPlace
The source code on CRAN has the liblantern source stripped out. Fixed by overriding the src attribute to point to the git repo.
libtorch-bin in nixpkgs was missing a header file. Fixed by switching to torchWithCuda.dev
CUDA v11 complains about using Gcc>11. Apparently cudaPackages: point nvcc at a compatible -ccbin #218265 fixes the GCC version used with CUDA, but gcc11 needed to be included as a build input.

Things done

Add a 👍 reaction to pull requests you find important.

Issues: - Nixpkgs doesn't have pytorch 2.0.1 - liblantern fails to compile

PhDyellow · 2024-09-26T07:41:24Z

Pinigng @SomeoneSerge as you suggested you are willing to help with this.

PhDyellow · 2024-09-27T06:33:59Z

build_log.txt

PhDyellow · 2024-09-30T03:45:56Z

After posting this PR, I continued to test out different approaches to getting it to build. I am documenting what I have tried here, and what failed.

Modify the pytorch version to 2.0.1 and adjust the hash
- Pytorch build breaks because 9.0a is no longer a recognised CUDA capability.
Keep (1) and comment out 9.0a as a cuda capability.
- fbgemm build breaks with std::int64_t not defined
Keep (1,2) and globally replace CUDA version by setting cudaPackages = cudaPackages_11_8 in all-packages.nix
- fbgemm build breaks with std::int64_t not defined. Is this because of mismatches between GCC11 and GCC13?
Keep (1,2,3) and globally replace gcc13 with gcc11 by setting default-gcc-version = 11 in all-packages.nix
- nghttp2 fails to build. Can't just replace GCC everywhere.
Drop (3,4), keep (1,2), override CUDA version for pytorch that is only seen by R torch
- pytorch refuses to build because magma CUDA does not match pytorch CUDA
Keep (1,2,5) and add an overridden Magma derivation for the overridden pytorch derivation
- fbgemm build breaks with std::int64_t not defined
Keep (1,2,5,6), but override stdenv in pytorch with stdenv = pkgs.gcc11Stdenv
- pytorch kept using GCC13
Keep (1,2,5,6) but override stdenv for the entire python package set with python312 = pythonInterpreters.python312.override {stdenv = gcc11Stdenv;};
- pytorch now used GCC11, but tensorboard failed to build

Searching online, it looks like the fbgemm issue is a pytorch issue in verison 2.0.1.
Patches are discussed at spack/spack#41222

Added the patch, tweaked the patch to suit nixpkgs. Now kineto is complaining with a similar error. Trialling a patch for Kineto that closely matches the patch for fbgemm.

That's where this PR is at for now. Hopefully the kineto patch gets it to build.

PhDyellow · 2024-10-03T06:26:16Z

After patching Kineto, further errors were thrown by ATen, then by c10.

Tensorboard failed to build if I overrode stdenv for the entire python package set. It seems like Tensorboard is pulled in as a binary wheel, so it looks for GCC13 libraries even if stdenv = gcc11Stdenv. The error was more specifically with protobuf, but I couldn't tell whether the python package pkgs.pythonPackages.protobuf or the library pkgs.protobuf were failing to link to GLIBCXX.

The current code in this PR does not override stdenv for the python package set, and just set cudaPackages to 11.8 for pytorch and magma. I have added patches to pytorch to deal with compile errors as they came up.

The current issue is that the build for pytorch 2.0.1 breaks, but I can't find any more build errors. It fails while building ATen.

UPDATE: while re-running the build to get a build log to attach, I saw the errors this time. It appeared that there were issues with not being included in a file again, so uint8_t was not defined, but because it was part of an enum declaration, the error message was different. The file causing trouble was quantization_type.h.

PhDyellow · 2024-10-03T06:48:50Z

Adding #include <csdtint> fixed the build, and now building breaks at another file for a similar reason. I'll keep patching, but if anyone can explain why std is not being handled correctly, I would appreciate the help.

PhDyellow · 2024-10-03T09:18:51Z

I see it is a GCC13 change, where some std headers are now required to be included explicitly.

PhDyellow · 2024-10-04T05:52:35Z

torch for R builds now! Have not tested if it actually runs yet.

PhDyellow · 2024-10-04T06:41:58Z

It does not run correctly, r-torch asks to download the binaries. I suspect liblantern is not being copied to the nix store, and maybe an environmental variable needs to be set to point to the compiled binaries.

rPackages: compile r-torch from source

bad68ba

Issues: - Nixpkgs doesn't have pytorch 2.0.1 - liblantern fails to compile

PhDyellow requested a review from jbedo as a code owner September 26, 2024 07:07

PhDyellow marked this pull request as draft September 26, 2024 07:39

ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux labels Sep 26, 2024

rPackage.torch: compile from source

c44f212

github-actions bot added the 6.topic: python label Oct 3, 2024

ofborg bot requested review from teh, thoughtpolice and tscholak October 3, 2024 06:08

ofborg bot added 10.rebuild-darwin: 101-500 10.rebuild-linux: 101-500 and removed 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux labels Oct 3, 2024

rPackages.torch: compile from source

c3a71a1

wegank added the 2.status: merge conflict This PR has merge conflicts with the target branch label Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rPackages: compile r-torch from source #344593

rPackages: compile r-torch from source #344593

PhDyellow commented Sep 26, 2024 •

edited

Loading

PhDyellow commented Sep 26, 2024

PhDyellow commented Sep 27, 2024

PhDyellow commented Sep 30, 2024

PhDyellow commented Oct 3, 2024

PhDyellow commented Oct 3, 2024

PhDyellow commented Oct 3, 2024

PhDyellow commented Oct 4, 2024

PhDyellow commented Oct 4, 2024

rPackages: compile r-torch from source #344593

Are you sure you want to change the base?

rPackages: compile r-torch from source #344593

Conversation

PhDyellow commented Sep 26, 2024 • edited Loading

Description of changes

Things done

PhDyellow commented Sep 26, 2024

PhDyellow commented Sep 27, 2024

PhDyellow commented Sep 30, 2024

PhDyellow commented Oct 3, 2024

PhDyellow commented Oct 3, 2024

PhDyellow commented Oct 3, 2024

PhDyellow commented Oct 4, 2024

PhDyellow commented Oct 4, 2024

PhDyellow commented Sep 26, 2024 •

edited

Loading