Use upstream's new split build #272

hmaarrfk · 2024-10-06T19:51:34Z

Comment:

It seems upstream introduced two new parameters to their build system:

BUILD_LIBTORCH_WHL
BUILD_PYTHON_ONLY

see: pytorch/pytorch@a25b28a

It seems that these made it to 2.4.0 and 2.4.1. It may be interesting to use these to alleviate some of the complexities of our existing build system.

However, these parameters seem very "pypi" centric where shared libraries are just shoved in the site-packages directory. We may want to massage things a little bit to make them work in a "conda" context.

cc: @baszalmstra

The text was updated successfully, but these errors were encountered:

baszalmstra · 2024-10-06T19:57:03Z

Appreciate you creating this issue. Ill dive into the issues with protobuf on Windows first but after that Id love to take a stab at this.

hmaarrfk · 2024-10-06T20:48:47Z

yeah no problem. the fact that we have CIs that can run most of the linux + osx stuff might help somebody else jump in too!

rgommers · 2024-10-24T17:53:20Z

Didn't @isuruf already implement a split build in this repo (xref gh-210)? I think the upstream split build came later. Is it a matter of simplifying @isuruf's original approach by relying on upstream support that's present now?

hmaarrfk · 2024-10-24T18:03:06Z

relying on upstream support that's present now?

yes.

mgorny · 2024-11-13T13:36:45Z

I'd like to give this one a shot. For a start, I'm going to do it in place — but ideally, I think we would end up splitting the feedstocks, in order to be able to build them entirely separately. Is anyone opposed to that?

baszalmstra · 2024-11-13T14:02:01Z

I wonder what the advantage of that is? Is it to decouple the two builds completely? I would assume most of the time they are quite coupled.

mgorny · 2024-11-13T14:23:02Z

I wonder what the advantage of that is? Is it to decouple the two builds completely? I would assume most of the time they are quite coupled.

It's primarily about build time (and space usage). Currently, building a single variant involves spending a lot of time building the common stuff ("libtorch"), and then a similar amount of time building PyTorch for all Python versions. From what I understand, splitting it would mean you can build libtorch first, and then use it to build PyTorch variants separately — effectively halving the time of each build.

baszalmstra · 2024-11-13T14:27:58Z

But cant you achieve the same thing with the multi output recipe? Because from what I understand that is currently already the case

h-vetinari · 2024-11-14T09:50:59Z

If there are reasonable intermediate quantities (like libtorch) to build, then absolutely we can split them off into a separate feedstock. Before the recent upstream changes there was no canonical way to do it, and since the builds were manual anyway, there wasn't a big benefit for a split.

But if we can (for example) only build libtorch once per platform (even if it needs a big server), and then build pytorch on top of that in regular CI, that would be a big win.

We discussed this (at some length) in #108, which I see you've discovered already. ;-)

hmaarrfk · 2024-11-14T12:00:21Z

you can test and validate many theories in staged-recipes.

Don't let the current state stop you from exploring!

The megabuilds already avoid much of the build time by building things once.

mgorny · 2024-11-15T15:35:12Z

Ok, so after spending two days trying and hacking, it seems that upstream's split build either does not really work as of 2.5.1, or is incompatible with something our recipe does.

The main issue is this:

if we reconfigure after switching from BUILD_LIBTORCH_WHL=1 to BUILD_PYTHON_ONLY=1 (e.g. by passing --config-settings=--build-option=--cmake), all the common bits (vendored dependencies) are recompiled
and if we don't, then Python bits aren't built (i.e. we get another "libtorch" build)

On top of that it seems that torch_shm_manager is not built as part of "libtorch" (I have patched for this).

I don't think we can actually utilize this without further changes to CMake files, and (at least for someone as unaccustomed to the upstream build system) that seems like a lot of effort for a minimal gain.

BUILD_LIBTORCH_WHL=1 clearly works and is useful (as far as it avoids building libtorch_python). However, I can't think of a way of building libtorch_python afterwards that wouldn't trigger full rebuild (for some reason — I don't really understand why the build system considers these objects outdated).

My "cleanest" approach was:

diff --git a/recipe/build.sh b/recipe/build.sh
index f36cd2b..f7cad5e 100644
--- a/recipe/build.sh
+++ b/recipe/build.sh
@@ -106,15 +106,17 @@ else
 fi
 
 if [[ "$PKG_NAME" == "pytorch" ]]; then
-  PIP_ACTION=install
+  PIP_ACTION="install --config-settings=--build-option=--cmake"
   # Trick Cmake into thinking python hasn't changed
   sed "s/3\.12/$PY_VER/g" build/CMakeCache.txt.orig > build/CMakeCache.txt
   sed -i.bak "s/3;12/${PY_VER%.*};${PY_VER#*.}/g" build/CMakeCache.txt
   sed -i.bak "s/cpython-312/cpython-${PY_VER%.*}${PY_VER#*.}/g" build/CMakeCache.txt
+  export BUILD_PYTHON_ONLY=1
 else
   # For the main script we just build a wheel for so that the C++/CUDA
   # parts are built. Then they are reused in each python version.
   PIP_ACTION=wheel
+  export BUILD_LIBTORCH_WHL=1
 fi
 
 # MacOS build is simple, and will not be for CUDA
@@ -217,15 +219,14 @@ $PREFIX/bin/python -m pip $PIP_ACTION . --no-deps -vvv --no-clean \
 if [[ "$PKG_NAME" == "libtorch" ]]; then
   mkdir -p $SRC_DIR/dist
   pushd $SRC_DIR/dist
-  wheel unpack ../torch-*.whl
-  pushd torch-*
+  wheel unpack ../torch_no_python-*.whl
+  pushd torch_no_python-*
   mv torch/bin/* ${PREFIX}/bin
   mv torch/lib/* ${PREFIX}/lib
   mv torch/share/* ${PREFIX}/share
   for f in ATen caffe2 tensorpipe torch c10; do
     mv torch/include/$f ${PREFIX}/include/$f
   done
-  rm ${PREFIX}/lib/libtorch_python.*
   popd
   popd
 
diff --git a/recipe/meta.yaml b/recipe/meta.yaml
index 4b2b655..6863f1e 100644
--- a/recipe/meta.yaml
+++ b/recipe/meta.yaml
@@ -1,5 +1,5 @@
 {% set version = "2.5.1" %}
-{% set build = 3 %}
+{% set build = 4 %}
 
 {% if cuda_compiler_version != "None" %}
 {% set build = build + 200 %}
@@ -36,6 +36,7 @@ source:
     - patches/0008-Fix-pickler-error.patch
     # https://github.com/pytorch/pytorch/pull/137331
     - patches/137331.patch
+    - patches/0009-Build-torch_shm_manager-as-part-of-libtorch-wheel.patch
 
 build:
   number: {{ build }}
diff --git a/recipe/patches/0009-Build-torch_shm_manager-as-part-of-libtorch-wheel.patch b/recipe/patches/0009-Build-torch_shm_manager-as-part-of-libtorch-wheel.patch
new file mode 100644
index 0000000..182770e
--- /dev/null
+++ b/recipe/patches/0009-Build-torch_shm_manager-as-part-of-libtorch-wheel.patch
@@ -0,0 +1,35 @@
+From 1206a5d7299740ddf9ccac5828d62f29bd59fa98 Mon Sep 17 00:00:00 2001
+From: =?UTF-8?q?Micha=C5=82=20G=C3=B3rny?= <[email protected]>
+Date: Thu, 14 Nov 2024 16:41:11 +0100
+Subject: [PATCH] Build torch_shm_manager as part of "libtorch" wheel
+
+---
+ torch/CMakeLists.txt | 8 ++++----
+ 1 file changed, 4 insertions(+), 4 deletions(-)
+
+diff --git a/torch/CMakeLists.txt b/torch/CMakeLists.txt
+index c74b4543..2f94d2e1 100644
+--- a/torch/CMakeLists.txt
++++ b/torch/CMakeLists.txt
+@@ -9,10 +9,6 @@ if(NOT CAFFE2_CMAKE_BUILDING_WITH_MAIN_REPO)
+   set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
+ endif()
+ 
+-if(NOT BUILD_PYTHON)
+-  return()
+-endif()
+-
+ set(TORCH_SRC_DIR "${CMAKE_CURRENT_SOURCE_DIR}")
+ set(TORCH_ROOT "${TORCH_SRC_DIR}/..")
+ 
+@@ -29,6 +25,10 @@ endif()
+ set(LIBSHM_SRCDIR ${TORCH_SRC_DIR}/lib/${LIBSHM_SUBDIR})
+ add_subdirectory(${LIBSHM_SRCDIR})
+ 
++if(NOT BUILD_PYTHON)
++  return()
++endif()
++
+ 
+ # Generate files
+ set(TOOLS_PATH "${TORCH_ROOT}/tools")

mgorny · 2024-11-15T16:15:26Z

I was trying to find a build log on upstream CI, but apparently the only job actually testing split build is disabled right now per pytorch/pytorch#138750.

mgorny · 2024-11-16T18:01:49Z

Okay, I've done some more testing today, and if I build PyTorch manually from a git checkout, split build kinda works. Still a few vendored libraries seem to be rebuilt, but we're talking about 390 small ninja targets vs. ~6000 (IIRC). I've tested with calling setup.py install and pip install with minimal changes to CMake settings, so it's either something about how Conda works or how we build it. I suppose the next step in investigating would be to perform both builds in a single Conda subpackage.

That said, it works only as long as you preserve the build tree (otherwise everything gets rebuilt) and preserve or install libtorch in torch directory, so it won't allow for fully split Conda packages anyway.

hmaarrfk · 2024-11-16T19:22:00Z

390 small ninja targets vs. ~6000 (IIRC)

This is oddly similar to what our logs report with our existing hacks.

With upstream abandoning their conda packages i wonder if this strategy should simply be abandonned?

They haven't really reviewed my shoe in 1 liner PR allowing one to override ld so I don't really have hope for anything more complicated....

mgorny · 2024-11-16T20:04:42Z

So, according to some quick testing, if I do "libwheel" and "python-only" build in a single "subpackage" (i.e. call one after the other), things work roughly as expected (~170 objects get built). However, I couldn't find a way to make work across different subpackages

To be honest, the only (potential) advantage I see from upstream split logic is that it allows us to build "libtorch" without building the Python modules, i.e. save a little time during the first stage. I suppose we could gain a similar effect by "microoptimizing" the build order — i.e. by having pytorch built for Python 3.12 first, so we wouldn't be building that variant twice (but I haven't verified that).

hmaarrfk · 2024-11-16T20:20:47Z

I'm a little concerned about making the recipe so complicated that other maintains can't join.

mgorny · 2024-11-17T02:41:45Z

That's a fair point (though TBH I think we're beyond that point already xP). By my "optimization", I merely meant reordering Python versions — but I've just noticed we're not setting these explicitly in the recipe, so that's probably not an option.

h-vetinari · 2024-11-18T13:04:11Z

I'm a little concerned about making the recipe so complicated that other maintains can't join.

I think the only thing we really need to avoid are massive patches (and other rebase-hazards). A couple lines of CMake modification to unvendor something should not be an issue. We can document the workflow for updating the patches so that contributors aren't stuck on that (here's an example for a feedstock where we did that; might want to move that to the knowledge base), but in any case: build time reductions make things more maintainable on the margin IMO.

hmaarrfk added the help wanted Extra attention is needed label Oct 6, 2024

hmaarrfk mentioned this issue Oct 8, 2024

Help-Wanted Priority List #273

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use upstream's new split build #272

Use upstream's new split build #272

hmaarrfk commented Oct 6, 2024

baszalmstra commented Oct 6, 2024

hmaarrfk commented Oct 6, 2024

rgommers commented Oct 24, 2024

hmaarrfk commented Oct 24, 2024

mgorny commented Nov 13, 2024

baszalmstra commented Nov 13, 2024

mgorny commented Nov 13, 2024

baszalmstra commented Nov 13, 2024

h-vetinari commented Nov 14, 2024

hmaarrfk commented Nov 14, 2024

mgorny commented Nov 15, 2024

mgorny commented Nov 15, 2024

mgorny commented Nov 16, 2024

hmaarrfk commented Nov 16, 2024

mgorny commented Nov 16, 2024

hmaarrfk commented Nov 16, 2024

mgorny commented Nov 17, 2024

h-vetinari commented Nov 18, 2024

Use upstream's new split build #272

Use upstream's new split build #272

Comments

hmaarrfk commented Oct 6, 2024

Comment:

baszalmstra commented Oct 6, 2024

hmaarrfk commented Oct 6, 2024

rgommers commented Oct 24, 2024

hmaarrfk commented Oct 24, 2024

mgorny commented Nov 13, 2024

baszalmstra commented Nov 13, 2024

mgorny commented Nov 13, 2024

baszalmstra commented Nov 13, 2024

h-vetinari commented Nov 14, 2024

hmaarrfk commented Nov 14, 2024

mgorny commented Nov 15, 2024

mgorny commented Nov 15, 2024

mgorny commented Nov 16, 2024

hmaarrfk commented Nov 16, 2024

mgorny commented Nov 16, 2024

hmaarrfk commented Nov 16, 2024

mgorny commented Nov 17, 2024

h-vetinari commented Nov 18, 2024