Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rPackages.torch: use binary package for installation #328980

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

b-rodrigues
Copy link
Contributor

@b-rodrigues b-rodrigues commented Jul 21, 2024

Description of changes

With this PR, I suggest that we use the prebuild binary supplied by the mlverse team. Currently, the {torch} package installs, but at first run, it wants to download and install the libtorch and liblantern libraries which doesn't work. Instead, we could use the prebuilt binaries of the {torch} package which simplifies the whole process, as described in their documentation: https://torch.mlverse.org/docs/articles/installation#pre-built

This PR handles both the CPU and GPU versions for Linux. I have also updated the documentation to explain how to use it in a second commit.

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.11 Release Notes (or backporting 23.11 and 24.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@b-rodrigues

This comment was marked as outdated.

@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux labels Jul 21, 2024
@PhDyellow
Copy link

When you say "building the package from source", are you referring to the R package torch or the C binaries libtorch and liblantern? They are separate build processes. Building the R package is trivial, and downloading the precompiled R package doesn't actually also download the C binaries as part of the R package binary. If you look at the R source code in torch, you will find that the C binaries or source are downloaded by the R package either during the installation step or after the first run.

Please compare the code between this PR and #273536. Also have a look at my other attempts to make later versions of torch work on NixOS at https://github.com/PhDyellow/nixpkgs/tree/r-torch-master and https://github.com/PhDyellow/nixpkgs/tree/r-torch-master-12. I had a major project that benefited massively from GPU acceleration, so I took the time to get it working, so my fork of nixpkgs has a functioning torch package that uses a GPU. The hard part is getting all the cuda libraries lining up with the liblantern and libtorch binaries and making sure the precompiled binaries are searching in the right places.

I could not successfully build the GPU-capable C binaries from source, but I eventually managed to get the GPU-capable pre-compiled binaries to work, following the example of libtorch-bin. For non-NixOS systems, you also need https://github.com/numtide/nix-gl-host.

I felt the code I have written in #273536 is very much a hack, and was not ready to push the code to nixpkgs without a few more people looking over it to make improvements. No-one commented on the draft PR until now.

You shouldn't need a separate torch-gpu package, as it falls back to CPU-only by default, and nixpkgs already has flags for enabling CUDA, which we can examine to give the correct version of torch. Moving forward, I suggest taking #273536, or making a new PR from one of my newer branches, and giving it a proper test and code review. It will save you a few weeks of work unless you are already a Nix CUDA master.

Apologies if something is unclear, I don't have a lot of time at the moment to proofread, but happy to answer questions when I am able to get around to it.

@b-rodrigues

This comment was marked as outdated.

@PhDyellow
Copy link

If you can get the GPU version to run without all the messy code I added, I will be happy to close my PR and support this one.
I do not remember whether I tried to download the whole R package as a precompiled binary with GPU capabilities, but I don't think I did.

@jbedo
Copy link
Contributor

jbedo commented Jul 23, 2024

This does seem cleaner if GPU can be made to run.

@PhDyellow PhDyellow mentioned this pull request Jul 24, 2024
13 tasks
@b-rodrigues

This comment was marked as outdated.

@b-rodrigues b-rodrigues reopened this Jul 26, 2024
@b-rodrigues

This comment was marked as outdated.

@PhDyellow
Copy link

Great! 6 lines of code changed is much nicer than 70, and much more maintainable.

I suggest making a wiki entry or very visible doc strings as part of this PR if possible, as using nixglhost is not obvious or necessary in all cases, and I don't feel comfortable wrapping every call to R with nixglhost silently.

The user might also want to know how to download the CPU version, if hard drive space is a concern.

I can't test on NixOS right now as my dev machine is in for service. @jbedo, are you able to test on NixOS?

There might be some interest in compiling liblantern and libtorch from source, but I think that is a separate PR. We would need help from the CUDA team, because @jbedo and I both had trouble getting them to compile.

@ofborg ofborg bot added 10.rebuild-linux: 1-10 10.rebuild-linux: 1 and removed 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux labels Jul 27, 2024
@github-actions github-actions bot added the 8.has: documentation This PR adds or changes documentation label Jul 28, 2024
@b-rodrigues b-rodrigues force-pushed the r-torch-cpu branch 2 times, most recently from 6790962 to e69d654 Compare July 28, 2024 07:46
@b-rodrigues
Copy link
Contributor Author

thanks for your help @PhDyellow and @jbedo this PR is ready to be reviewed!
There are also prebuilt binary for macOS which we could provide as well, by using "cpu-intel" or "cpu-m1", but I don't have the hardware to test it.

@b-rodrigues b-rodrigues marked this pull request as ready for review July 28, 2024 07:50
@b-rodrigues b-rodrigues requested a review from jbedo as a code owner July 28, 2024 07:50
@b-rodrigues b-rodrigues marked this pull request as draft July 28, 2024 07:52
@b-rodrigues

This comment was marked as outdated.

@b-rodrigues b-rodrigues marked this pull request as ready for review July 28, 2024 08:00
The `torch` package included in `nixpkgs` is based on the binary packages by the
mlverse team. By default, the CPU version of the package is installed. To enable
GPU acceleration, set `config.cudaSupport` to true when importing `nixpkgs` in a
`shell.nix`. You will also need to use `nixglhost`. Here is an example of such a

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that you only need nixglhost if you are not on NixOS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the case, but I don't see any patching happening in this PR so likely it won't work on NixOS either. Should be possible to patch the binaries to operate without tricks on NixOs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is possible to patch the liblantern and libtorch binaries, I did it successfully in #273536. After patching, the binaries ran on both NixOS and nix in CentOS. I didn't try patching the full bundle being installed in this PR though.

Whether nixglhost works on NixOS is another thing to test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kupac tested on NixOS and confirmed he needed nixglhost as well. @Kupac could you perchance leave a review and describe the process? Is the description from the documentation aligned with what you had to do?

Copy link
Contributor

@SomeoneSerge SomeoneSerge Aug 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested on NixOS and confirmed he needed nixglhost as well.

This is a bug then, the appropriate patchelfing is necessary. Nixglhost should not be required on NixOS

You can use LD_DEBUG=libs R to find out which binaries (are loading cuda and) need the patching

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have amended the docs to explain this is only for non-NixOS linux and that we are looking for people to fix torch for NixOS and darwin.

Copy link

@PhDyellow PhDyellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't test functionality right now, but apart from comments I have left, everything looks like it is in order.

Documentation looks good too. It made sense to me, but I am familiar with running torch in R on a GPU, so I can't say whether it is clear to someone who has never tried before.

pkgs/development/r-modules/default.nix Outdated Show resolved Hide resolved
@PhDyellow
Copy link

thanks for your help @PhDyellow and @jbedo this PR is ready to be reviewed! There are also prebuilt binary for macOS which we could provide as well, by using "cpu-intel" or "cpu-m1", but I don't have the hardware to test it.

Maybe we leave a note in the language docs for someone to open a PR or issue if they have the hardware and can test it? I don't think it would be hard to add more logic to the accel variable you have set up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there much value to keeping a binary-only package in Nixpkgs rather than out-of-tree, is it a common dependency?

it wants to download and install the libtorch and liblantern libraries which doesn't work

Our python3Packages.torch can be/is consumed in C++/cmake projects, even though we haven't split libtorch out yet. Is there a way to undenvor libtorch in rPackages.torch? I'm not familiar with liblantern though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to study python3Packages.torch but I'm quite a beginner with Nix and doubt I'll be able to do better than this PR 😅

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

liblantern is a intermediate API between libtorch and R, maintained by the developers of R.torch. I think it was because calling libtorch from R was too difficult, and calling libtorch from R via python was too slow.

It would be good to get everything to compile from source. I would support that effort, but I would need a Nix CUDA expert to help because I kept getting issues with CUDA version compatibility. I was not able to successfully compile liblantern and libtorch despite spending days on the problem.

This PR bypasses all the complexity of trying to compile liblantern and libtorch from source, as well as the complexity of ensuring the correct CUDA libraries are present on the system. Consuming libtorch from python3Packages.torch may be possible, but compiling liblantern against libtorch didn't work easily for @jbedo over at #273536 (comment).

I feel the value of having this binary-only package in nixpkgs is for R users who are not nix users first. torch is in CRAN, and all CRAN packages are automatically included in pkgs.rPackages, so by the principle of least surprise, it should just work. The alternative is to cause it to fail in a way that directs the user to the Nix R wiki or something similar with detailed instructions for setting up their first overlay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compiling liblantern dosen't seem that difficult in itself, the main issue is it needs and old version of torch and doesn't work with python3Packages.torch. Having it work out of the box, even if we need a binary, is a goal since we want to maximise CRAN/BioC functionality. We do need to patch these binaries though to at least work with NixOS out of the box without tricks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need to patch these binaries though to at least work with NixOS out of the box without tricks.

Would a wrapper do the trick? Wrap the binary to use nixglhost ootb?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a wrapper do the trick? Wrap the binary to use nixglhost ootb?

If wrapping liblantern and libtorch worked, that sounds fine to me, but I am not sure we should wrap R in nixglhost by default on NixOS. I don't know whether it would cause any problems, it just seems excessive.

You could have a look at how I patched the binaries over at #273536. The important parts are the various patchelf commands and pkgs.cudaPackages.autoAddOpenGLRunpathHook.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to get everything to compile from source. I would support that effort, but I would need a Nix CUDA expert to help because I kept getting issues with CUDA version compatibility

We can try working together

Would a wrapper do the trick? Wrap the binary to use nixglhost ootb?

There shouldn't be any need, you just autoPatchelf and/or addDriverRunpath to link the relevant libraries directly

If wrapping liblantern and libtorch worked, that sounds fine to me, but I am not sure we should wrap R in nixglhost by default on NixOS

Correct, we shouldn't

pkgs.cudaPackages.autoAddOpenGLRunpathHook.

Note it's pkgs.autoAddDriverRunpath now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have amended the docs to explain this is only for non-NixOS linux and that we are looking for people to fix torch for NixOS and darwin.

For fixing this for NixOS I would need help, as I'm not running NixOS and don't have experience with patching binaries. Should we merge this one and then continue with another PR, or try to fix it as part of this PR?

@b-rodrigues b-rodrigues force-pushed the r-torch-cpu branch 2 times, most recently from ced2478 to 7c2dcd4 Compare August 1, 2024 19:08
Comment on lines 122 to 123
pkgs.rPackages.torch
pkgs.R
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be an rWrapper.override instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we drop in a shell this is not strictly required, but would be needed if calling R non-interactively afaik

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. This confuses people, let's not advertise this and use rWrapper instead: https://mastodon.acm.org/@[email protected]/112745541961800402

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, done with latest push!

Comment on lines -1806 to -1809
torch = old.torch.overrideAttrs (attrs: {
preConfigure = ''
patchShebangs configure
'';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I to understand that the original package was source-built? If so, I think we should keep a source-built torch and add a separate torch-bin attribute for the binary package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original R torch package was indeed source built but didn't provide the torch library, it suggested users to download and install it on first run

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was indeed source built but didn't provide the torch library, it suggested users to download and install it on first run

oh

Comment on lines +1819 to +1828
nativeBuildInputs = attrs.nativeBuildInputs or [ ] ++ [
pkgs.autoPatchelfHook
pkgs.autoAddDriverRunpath
];
buildInputs = attrs.buildInputs or [ ] ++ [
"${lib.getLib R}/lib/R" # libR
pkgs.zlib
] ++ lib.optionals pkgs.config.cudaSupport [
pkgs.cudaPackages.cuda_cudart
];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are sufficient to do something like

> library(torch)
> torch_sum(torch_randn(c(10, 10), device="cuda"))

However, normally libtorch_cuda links e.g. libcublas, etc, which I didn't have to explicitly mention. I suppose that means they're not listed in DT_NEEDED and are loaded by dlopen instead. Either way I expect all related functionality to be broken and I can't allocate the time to do more tests unfortunately

I also see there's a lot of garbage in the outputs, e.g. a vendored copy of libcudart.so. You could link those by telling autoPatchelf to add $ORIGIN, but I'd advise against it and recommend that you delete the vendored libraries. Cuda libraries tend to need ad hoc patchelfing as well, we've had issues with that before. Cuda libraries that originate from PyPi are usually broken, which is why we link the ones from cudaPackages instead

Comment on lines +109 to +115
nixglhost-sources = pkgs.fetchFromGitHub {
owner = "numtide";
repo = "nix-gl-host";
rev = "main";
# Replace this with the hash Nix will complain about, TOFU style.
hash = "";
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be documented in its own chapter rather than in the R section: #264433 (has been stale for a while, didn't get the chance to follow up) but maybe it's ok to temporarily merge this section and then replace it with a short reference when the chapter is added

Comment on lines +119 to +126
rpkgs = builtins.attrValues {
inherit (pkgs.rPackages)
torch;
};

wrapped_pkgs = pkgs.rWrapper.override {
packages = [ rpkgs ];
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just packages = with rPackages; ...?

torch;
};

wrapped_pkgs = pkgs.rWrapper.override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h'm, personally, I read rWrapper.override as a "wrapped interpreter" (wrapped so it knows where to find the packages), not as "wrapped packages"; e.g. I'd write python = python3.withPackages ...; I also saw people write env = python3.withPackages ...

pkgs makes me think of a package set, which this value is not

@FliegendeWurst FliegendeWurst added the awaiting_changes (old Marvin label, do not use) label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.has: documentation This PR adds or changes documentation 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 10.rebuild-linux: 1 awaiting_changes (old Marvin label, do not use)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants