Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use upstream's new split build #272

Open
hmaarrfk opened this issue Oct 6, 2024 · 18 comments
Open

Use upstream's new split build #272

hmaarrfk opened this issue Oct 6, 2024 · 18 comments
Labels
help wanted Extra attention is needed

Comments

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Oct 6, 2024

Comment:

It seems upstream introduced two new parameters to their build system:

  • BUILD_LIBTORCH_WHL
  • BUILD_PYTHON_ONLY

see: pytorch/pytorch@a25b28a

It seems that these made it to 2.4.0 and 2.4.1. It may be interesting to use these to alleviate some of the complexities of our existing build system.

However, these parameters seem very "pypi" centric where shared libraries are just shoved in the site-packages directory. We may want to massage things a little bit to make them work in a "conda" context.

cc: @baszalmstra

@hmaarrfk hmaarrfk added the help wanted Extra attention is needed label Oct 6, 2024
@baszalmstra
Copy link
Member

Appreciate you creating this issue. Ill dive into the issues with protobuf on Windows first but after that Id love to take a stab at this.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Oct 6, 2024

yeah no problem. the fact that we have CIs that can run most of the linux + osx stuff might help somebody else jump in too!

@rgommers
Copy link

Didn't @isuruf already implement a split build in this repo (xref gh-210)? I think the upstream split build came later. Is it a matter of simplifying @isuruf's original approach by relying on upstream support that's present now?

@hmaarrfk
Copy link
Contributor Author

relying on upstream support that's present now?

yes.

@mgorny
Copy link

mgorny commented Nov 13, 2024

I'd like to give this one a shot. For a start, I'm going to do it in place — but ideally, I think we would end up splitting the feedstocks, in order to be able to build them entirely separately. Is anyone opposed to that?

@baszalmstra
Copy link
Member

I wonder what the advantage of that is? Is it to decouple the two builds completely? I would assume most of the time they are quite coupled.

@mgorny
Copy link

mgorny commented Nov 13, 2024

I wonder what the advantage of that is? Is it to decouple the two builds completely? I would assume most of the time they are quite coupled.

It's primarily about build time (and space usage). Currently, building a single variant involves spending a lot of time building the common stuff ("libtorch"), and then a similar amount of time building PyTorch for all Python versions. From what I understand, splitting it would mean you can build libtorch first, and then use it to build PyTorch variants separately — effectively halving the time of each build.

@baszalmstra
Copy link
Member

But cant you achieve the same thing with the multi output recipe? Because from what I understand that is currently already the case

@h-vetinari
Copy link
Member

If there are reasonable intermediate quantities (like libtorch) to build, then absolutely we can split them off into a separate feedstock. Before the recent upstream changes there was no canonical way to do it, and since the builds were manual anyway, there wasn't a big benefit for a split.

But if we can (for example) only build libtorch once per platform (even if it needs a big server), and then build pytorch on top of that in regular CI, that would be a big win.

We discussed this (at some length) in #108, which I see you've discovered already. ;-)

@hmaarrfk
Copy link
Contributor Author

you can test and validate many theories in staged-recipes.

Don't let the current state stop you from exploring!

The megabuilds already avoid much of the build time by building things once.

@mgorny
Copy link

mgorny commented Nov 15, 2024

Ok, so after spending two days trying and hacking, it seems that upstream's split build either does not really work as of 2.5.1, or is incompatible with something our recipe does.

The main issue is this:

  • if we reconfigure after switching from BUILD_LIBTORCH_WHL=1 to BUILD_PYTHON_ONLY=1 (e.g. by passing --config-settings=--build-option=--cmake), all the common bits (vendored dependencies) are recompiled
  • and if we don't, then Python bits aren't built (i.e. we get another "libtorch" build)

On top of that it seems that torch_shm_manager is not built as part of "libtorch" (I have patched for this).

I don't think we can actually utilize this without further changes to CMake files, and (at least for someone as unaccustomed to the upstream build system) that seems like a lot of effort for a minimal gain.

BUILD_LIBTORCH_WHL=1 clearly works and is useful (as far as it avoids building libtorch_python). However, I can't think of a way of building libtorch_python afterwards that wouldn't trigger full rebuild (for some reason — I don't really understand why the build system considers these objects outdated).

My "cleanest" approach was:

diff --git a/recipe/build.sh b/recipe/build.sh
index f36cd2b..f7cad5e 100644
--- a/recipe/build.sh
+++ b/recipe/build.sh
@@ -106,15 +106,17 @@ else
 fi
 
 if [[ "$PKG_NAME" == "pytorch" ]]; then
-  PIP_ACTION=install
+  PIP_ACTION="install --config-settings=--build-option=--cmake"
   # Trick Cmake into thinking python hasn't changed
   sed "s/3\.12/$PY_VER/g" build/CMakeCache.txt.orig > build/CMakeCache.txt
   sed -i.bak "s/3;12/${PY_VER%.*};${PY_VER#*.}/g" build/CMakeCache.txt
   sed -i.bak "s/cpython-312/cpython-${PY_VER%.*}${PY_VER#*.}/g" build/CMakeCache.txt
+  export BUILD_PYTHON_ONLY=1
 else
   # For the main script we just build a wheel for so that the C++/CUDA
   # parts are built. Then they are reused in each python version.
   PIP_ACTION=wheel
+  export BUILD_LIBTORCH_WHL=1
 fi
 
 # MacOS build is simple, and will not be for CUDA
@@ -217,15 +219,14 @@ $PREFIX/bin/python -m pip $PIP_ACTION . --no-deps -vvv --no-clean \
 if [[ "$PKG_NAME" == "libtorch" ]]; then
   mkdir -p $SRC_DIR/dist
   pushd $SRC_DIR/dist
-  wheel unpack ../torch-*.whl
-  pushd torch-*
+  wheel unpack ../torch_no_python-*.whl
+  pushd torch_no_python-*
   mv torch/bin/* ${PREFIX}/bin
   mv torch/lib/* ${PREFIX}/lib
   mv torch/share/* ${PREFIX}/share
   for f in ATen caffe2 tensorpipe torch c10; do
     mv torch/include/$f ${PREFIX}/include/$f
   done
-  rm ${PREFIX}/lib/libtorch_python.*
   popd
   popd
 
diff --git a/recipe/meta.yaml b/recipe/meta.yaml
index 4b2b655..6863f1e 100644
--- a/recipe/meta.yaml
+++ b/recipe/meta.yaml
@@ -1,5 +1,5 @@
 {% set version = "2.5.1" %}
-{% set build = 3 %}
+{% set build = 4 %}
 
 {% if cuda_compiler_version != "None" %}
 {% set build = build + 200 %}
@@ -36,6 +36,7 @@ source:
     - patches/0008-Fix-pickler-error.patch
     # https://github.com/pytorch/pytorch/pull/137331
     - patches/137331.patch
+    - patches/0009-Build-torch_shm_manager-as-part-of-libtorch-wheel.patch
 
 build:
   number: {{ build }}
diff --git a/recipe/patches/0009-Build-torch_shm_manager-as-part-of-libtorch-wheel.patch b/recipe/patches/0009-Build-torch_shm_manager-as-part-of-libtorch-wheel.patch
new file mode 100644
index 0000000..182770e
--- /dev/null
+++ b/recipe/patches/0009-Build-torch_shm_manager-as-part-of-libtorch-wheel.patch
@@ -0,0 +1,35 @@
+From 1206a5d7299740ddf9ccac5828d62f29bd59fa98 Mon Sep 17 00:00:00 2001
+From: =?UTF-8?q?Micha=C5=82=20G=C3=B3rny?= <[email protected]>
+Date: Thu, 14 Nov 2024 16:41:11 +0100
+Subject: [PATCH] Build torch_shm_manager as part of "libtorch" wheel
+
+---
+ torch/CMakeLists.txt | 8 ++++----
+ 1 file changed, 4 insertions(+), 4 deletions(-)
+
+diff --git a/torch/CMakeLists.txt b/torch/CMakeLists.txt
+index c74b4543..2f94d2e1 100644
+--- a/torch/CMakeLists.txt
++++ b/torch/CMakeLists.txt
+@@ -9,10 +9,6 @@ if(NOT CAFFE2_CMAKE_BUILDING_WITH_MAIN_REPO)
+   set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
+ endif()
+ 
+-if(NOT BUILD_PYTHON)
+-  return()
+-endif()
+-
+ set(TORCH_SRC_DIR "${CMAKE_CURRENT_SOURCE_DIR}")
+ set(TORCH_ROOT "${TORCH_SRC_DIR}/..")
+ 
+@@ -29,6 +25,10 @@ endif()
+ set(LIBSHM_SRCDIR ${TORCH_SRC_DIR}/lib/${LIBSHM_SUBDIR})
+ add_subdirectory(${LIBSHM_SRCDIR})
+ 
++if(NOT BUILD_PYTHON)
++  return()
++endif()
++
+ 
+ # Generate files
+ set(TOOLS_PATH "${TORCH_ROOT}/tools")

@mgorny
Copy link

mgorny commented Nov 15, 2024

I was trying to find a build log on upstream CI, but apparently the only job actually testing split build is disabled right now per pytorch/pytorch#138750.

@mgorny
Copy link

mgorny commented Nov 16, 2024

Okay, I've done some more testing today, and if I build PyTorch manually from a git checkout, split build kinda works. Still a few vendored libraries seem to be rebuilt, but we're talking about 390 small ninja targets vs. ~6000 (IIRC). I've tested with calling setup.py install and pip install with minimal changes to CMake settings, so it's either something about how Conda works or how we build it. I suppose the next step in investigating would be to perform both builds in a single Conda subpackage.

That said, it works only as long as you preserve the build tree (otherwise everything gets rebuilt) and preserve or install libtorch in torch directory, so it won't allow for fully split Conda packages anyway.

@hmaarrfk
Copy link
Contributor Author

390 small ninja targets vs. ~6000 (IIRC)

This is oddly similar to what our logs report with our existing hacks.

With upstream abandoning their conda packages i wonder if this strategy should simply be abandonned?

They haven't really reviewed my shoe in 1 liner PR allowing one to override ld so I don't really have hope for anything more complicated....

@mgorny
Copy link

mgorny commented Nov 16, 2024

So, according to some quick testing, if I do "libwheel" and "python-only" build in a single "subpackage" (i.e. call one after the other), things work roughly as expected (~170 objects get built). However, I couldn't find a way to make work across different subpackages

To be honest, the only (potential) advantage I see from upstream split logic is that it allows us to build "libtorch" without building the Python modules, i.e. save a little time during the first stage. I suppose we could gain a similar effect by "microoptimizing" the build order — i.e. by having pytorch built for Python 3.12 first, so we wouldn't be building that variant twice (but I haven't verified that).

@hmaarrfk
Copy link
Contributor Author

I'm a little concerned about making the recipe so complicated that other maintains can't join.

@mgorny
Copy link

mgorny commented Nov 17, 2024

That's a fair point (though TBH I think we're beyond that point already xP). By my "optimization", I merely meant reordering Python versions — but I've just noticed we're not setting these explicitly in the recipe, so that's probably not an option.

@h-vetinari
Copy link
Member

I'm a little concerned about making the recipe so complicated that other maintains can't join.

I think the only thing we really need to avoid are massive patches (and other rebase-hazards). A couple lines of CMake modification to unvendor something should not be an issue. We can document the workflow for updating the patches so that contributors aren't stuck on that (here's an example for a feedstock where we did that; might want to move that to the knowledge base), but in any case: build time reductions make things more maintainable on the margin IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants