Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rPackages: compile r-torch from source #344593

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

PhDyellow
Copy link

@PhDyellow PhDyellow commented Sep 26, 2024

Description of changes

The R package torch provides an R interface to libtorch, without calling intermediate python code for speed. torch can be compiled with CUDA support, allowing R users to work on GPUs. Currently, torch is broken in master but not marked as such.

torch has a relatively complex setup for an R package. The R code includes an interface to a compiled library, liblantern, which interfaces with the API of libtorch. CUDA support can be compiled in. During installation, the R code is built, but if the env variable BUILD_LANTERN is set to 1, torch will attempt to compile liblantern as well. Otherwise, during first use of the package, torch will attempt to either build or fetch binaries for liblantern and libtorch. The fetched binaries need to be patched to work with Nix and NixOS, and the default installation directory is probably in the nix store, so the binaries have to be present during the nix build.

In #271342, I succeeded in downloading the binaries during the build phase and patching them, and I was able to get work done on my local machine with GPU acceleration. I also managed to use the same PR code to run torch on an HPC cluster using nix-gl-host.

More recently, #328980 took another approach to downloading the binaries. When torch is downloaded as a precompiled binary from CRAN, rather than as source code, liblantern and libtorch are included. The binaries still need patching, but the nix code is cleaner, as the correct versions are already bundled together.

Using precompiled binaries in Nixpkgs is not ideal, it goes against the spirit of nixpkgs, as well as duplicating libtorch and related CUDA libraries in the nix store. However, torch is packaged on CRAN, and therefore it is provided by nixpkgs. It would be a confusing user experience for one CRAN package to require an unusual installation, such as setting up an overlay. I recommend merging one of the binary solutions when they are ready.

This PR aims to compile torch and liblantern from source, making use of the exisitng CUDA libraries and libtorch in nixpkgs. It would make #328980 and #271342 redundant if we can succeed. This PR is a draft because liblantern fails to build, and I would like to work with @SomeoneSerge to fix it.

The current issues:

  • liblantern fails to build. Build logs below.
  • Nixpkgs does not provide libtorch version 2.0.1, which liblantern expects
  • CMake warns that the kineto library is not found. As kineto is a monitoring library that liblantern does not appear to use, I have not addressed this yet.

Some surprises I encountered while getting this PR to it's current state:

  • liblantern seems to have a bug in the CMake configuration files. Fixed by substituteInPlace
  • The source code on CRAN has the liblantern source stripped out. Fixed by overriding the src attribute to point to the git repo.
  • libtorch-bin in nixpkgs was missing a header file. Fixed by switching to torchWithCuda.dev
  • CUDA v11 complains about using Gcc>11. Apparently cudaPackages: point nvcc at a compatible -ccbin #218265 fixes the GCC version used with CUDA, but gcc11 needed to be included as a build input.

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.11 Release Notes (or backporting 23.11 and 24.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

Issues:

- Nixpkgs doesn't have pytorch 2.0.1
- liblantern fails to compile
@PhDyellow PhDyellow requested a review from jbedo as a code owner September 26, 2024 07:07
@PhDyellow PhDyellow marked this pull request as draft September 26, 2024 07:39
@PhDyellow
Copy link
Author

Pinigng @SomeoneSerge as you suggested you are willing to help with this.

@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux labels Sep 26, 2024
@PhDyellow
Copy link
Author

build_log.txt

@PhDyellow
Copy link
Author

After posting this PR, I continued to test out different approaches to getting it to build. I am documenting what I have tried here, and what failed.

  1. Modify the pytorch version to 2.0.1 and adjust the hash
    • Pytorch build breaks because 9.0a is no longer a recognised CUDA capability.
  2. Keep (1) and comment out 9.0a as a cuda capability.
    • fbgemm build breaks with std::int64_t not defined
  3. Keep (1,2) and globally replace CUDA version by setting cudaPackages = cudaPackages_11_8 in all-packages.nix
    • fbgemm build breaks with std::int64_t not defined. Is this because of mismatches between GCC11 and GCC13?
  4. Keep (1,2,3) and globally replace gcc13 with gcc11 by setting default-gcc-version = 11 in all-packages.nix
    • nghttp2 fails to build. Can't just replace GCC everywhere.
  5. Drop (3,4), keep (1,2), override CUDA version for pytorch that is only seen by R torch
    • pytorch refuses to build because magma CUDA does not match pytorch CUDA
  6. Keep (1,2,5) and add an overridden Magma derivation for the overridden pytorch derivation
    • fbgemm build breaks with std::int64_t not defined
  7. Keep (1,2,5,6), but override stdenv in pytorch with stdenv = pkgs.gcc11Stdenv
    • pytorch kept using GCC13
  8. Keep (1,2,5,6) but override stdenv for the entire python package set with python312 = pythonInterpreters.python312.override {stdenv = gcc11Stdenv;};
    • pytorch now used GCC11, but tensorboard failed to build

Searching online, it looks like the fbgemm issue is a pytorch issue in verison 2.0.1.
Patches are discussed at spack/spack#41222

Added the patch, tweaked the patch to suit nixpkgs. Now kineto is complaining with a similar error. Trialling a patch for Kineto that closely matches the patch for fbgemm.

That's where this PR is at for now. Hopefully the kineto patch gets it to build.

@ofborg ofborg bot requested review from teh, thoughtpolice and tscholak October 3, 2024 06:08
@ofborg ofborg bot added 10.rebuild-darwin: 101-500 10.rebuild-linux: 101-500 and removed 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux labels Oct 3, 2024
@PhDyellow
Copy link
Author

After patching Kineto, further errors were thrown by ATen, then by c10.

Tensorboard failed to build if I overrode stdenv for the entire python package set. It seems like Tensorboard is pulled in as a binary wheel, so it looks for GCC13 libraries even if stdenv = gcc11Stdenv. The error was more specifically with protobuf, but I couldn't tell whether the python package pkgs.pythonPackages.protobuf or the library pkgs.protobuf were failing to link to GLIBCXX.

The current code in this PR does not override stdenv for the python package set, and just set cudaPackages to 11.8 for pytorch and magma. I have added patches to pytorch to deal with compile errors as they came up.

The current issue is that the build for pytorch 2.0.1 breaks, but I can't find any more build errors. It fails while building ATen.

UPDATE: while re-running the build to get a build log to attach, I saw the errors this time. It appeared that there were issues with not being included in a file again, so uint8_t was not defined, but because it was part of an enum declaration, the error message was different. The file causing trouble was quantization_type.h.

@PhDyellow
Copy link
Author

Adding #include <csdtint> fixed the build, and now building breaks at another file for a similar reason. I'll keep patching, but if anyone can explain why std is not being handled correctly, I would appreciate the help.

@PhDyellow
Copy link
Author

I see it is a GCC13 change, where some std headers are now required to be included explicitly.

@PhDyellow
Copy link
Author

torch for R builds now! Have not tested if it actually runs yet.

@PhDyellow
Copy link
Author

It does not run correctly, r-torch asks to download the binaries. I suspect liblantern is not being copied to the nix store, and maybe an environmental variable needs to be set to point to the compiled binaries.

@wegank wegank added the 2.status: merge conflict This PR has merge conflicts with the target branch label Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants