Building manylinux wheels #19

mwydmuch · 2024-06-08T23:31:52Z

This PR adds support for building manylinux wheels using cibuildwheel tool, with manylinux2014 compatibility for both x84_64 and aarch64 (arm) architectures. This will allow installation of NLE on almost any Linux distro as well as Google Colab or similar platforms that don't support building packages from source.

Changes

cibuildwheel config added to pyproject.toml
New jobs responsible for building wheels added to test_and_deploy.yml
CMakeLists.txt now searches for bzip2/bz2 lib using find_package
Because bzip2 is 3rd party dependency, auditwheel places it under nle.libs and ships inside wheels, to support envs without bzip2 installed.
Because nle/nethack/nethack.py makes an unlinked copy of libnethack.so, to allow proper linking to bzip2.so shipped with wheel, the patchelf is installed and used to fix rpath for the temporary copy.
~~I'm not a fan of this solution, but making the libnethack.so thread-safe is not a simple change, and the alternative of modifying env seems to be even less elegant.~~
~~Unfortunately, this solution does not work with memfd_create or O_TMPFILE.~~
Alternatively, bzip2 could be linked statically, but libnethack is not the only one linking against it in the project. <- We went with this one.

How to check the changes locally

To check the changes locally on Linux, one can run (requires docker and cibuildwheel installed) in the repo's root:

export CIBW_ENVIRONMENT="NLE_RELEASE_BUILD=1"  # For release build
cibuildwheel --platform linux --arch $(uname -m)
pip uninstall -y nle
pip install wheelhouse/nle-*3$(python3 -c "import sys; print(sys.version_info.minor)")*.whl

# Run tests
mkdir -p tmp
cd tmp
python3 -c 'import nle; import gymnasium as gym; e = gym.make("NetHack-v0"); e.reset(); e.step(0)'
python3 -m pytest --import-mode=append -svx ../nle/tests

What was tested

The following tests:

python3 -c 'import nle; import gymnasium as gym; e = gym.make("NetHack-v0"); e.reset(); e.step(0)'
python3 -m pytest --import-mode=append -svx ../nle/tests

where run on images of different Linux distros (Alma, Fedora, Rocky, Debian, Ubuntu, different versions) using a script similar to this one:
https://github.com/Farama-Foundation/stable-retro/blob/master/tests/test_cibuildwheel/test_cibuildwheel_linux.sh
All tests passed on all these distros.

Current TODOs

~~verify if these changes don't break anything that is not covered by tests~~ [DONE]

@BartekCupial will help with that.

Possible extensions

A small change to test_and_deploy.yml will allow macOS wheels to be built in the same way, but currently, GH does not provide macOS ARM runners for free accounts.

BartekCupial · 2024-06-09T11:53:51Z

Tested on my local machine and in colab, LGTM!

BartekCupial · 2024-06-09T11:54:42Z

@heiner @mklissa can you take a look?

mwydmuch · 2024-06-09T11:55:33Z

Tested on my local machine and in colab, LGTM!

Thanks @BartekCupial!

nle/nethack/nethack.py

heiner · 2024-06-09T12:16:47Z

Adding @StephenOman to get a second opinion.

heiner · 2024-06-09T12:57:57Z

Thanks a bunch for adding this! Looks very useful - installation woes were the biggest issue of NLE users.

Could you add some technical explanation, here or in a code comment, what this does and how?

Re: the memfd_create hack: The purpose of that is to not physically copy files (e.g., not write to a disk). Using sendfile under the hood, it should even manage to not even duplicate memory, but instead only reference the same pages (the DATA section of which will likely be copied by CoW when used).

mwydmuch · 2024-06-09T15:07:50Z

Could you add some technical explanation, here or in a code comment, what this does and how?

Sure, I've added some more comments that I hope explain the changes.

.github/workflows/test_and_deploy.yml

StephenOman · 2024-06-09T18:58:12Z

This is a great idea, removing friction for people getting NLE running in their environments.

We also need to check the total binary sizes that are generated as I think PyPI has some limits on both the size of the binary and the total project size allowed.

mwydmuch · 2024-06-09T19:33:09Z

We also need to check the total binary sizes that are generated as I think PyPI has some limits on both the size of the binary and the total project size allowed.

Actually, NLE wheels are very small. One wheel is ~3.0 MB (here is my GH action that already built all of them: https://github.com/mwydmuch/nle/actions/runs/9437511056, you can download the results and check). I'm not sure if it is up-to-date info, but in the past, the default limit for a single file was 100 MB and 10 GB for the whole project.

heiner · 2024-06-09T21:09:38Z

As I said above, I'm very much in favor of this.

However, I'm still a bit confused on 2 accounts:

Why do we need to call the patchelf tool? Isn't finding the .so a matter of LD_LIBRARY_PATH, or some other way of telling the linker where to look for dynamic libraries?
I understand what patchelf does is reading the .so file and writing it back to the same path with certain changes (right?). If so, we should rewrite it once, and copy the rewritten version thereafter. I don't think we should call an external process every time we open a new environemnt.

mwydmuch · 2024-06-09T22:15:39Z

Why do we need to call the patchelf tool? Isn't finding the .so a matter of LD_LIBRARY_PATH, or some other way of telling the linker where to look for dynamic libraries?

It is related, as stated here: https://en.wikipedia.org/wiki/Rpath, the linker first looks in places specified in rpath/runpath of a file, then checks LD_LIBRARY_PATH env variable, then ld.so.cache, and finally default locations like /usr/lib, /lib etc.

So why I use patchelf? Because I believe we don't want to ask a user to set the LD_LIBRARY_PATH env variable to a specific value before running the script using NLE. Setting LD_LIBRARY_PATH inside the running script doesn't work, as doing something as os.environ['LD_LIBRARY_PATH'] = <some path> only modifies the environment for its subprocess but not for the process itself. We can restart the process to apply the change, but it can have bad consequences for a user in some cases. Alternatively, we can spaw subprocess with NetHack but that requires more code modifications. And both solutions seemed to me less elegant than just patching rpath.

I understand what patchelf does is reading the .so file and writing it back to the same path with certain changes (right?). If so, we should rewrite it once, and copy the rewritten version thereafter. I don't think we should call an external process every time we open a new environment.

Yes, you are right, and we can do that. But I wasn't sure how to do it in the best way, cause:

Creating such a file in side-packages/nle may not be possible as it may not be writable (for example, if NLE was installed with sudo to be used system-wide).
Also, in some rare cases, the location of site-packages dir may change, so I believe it would be good practice to check from time to time if this modified .so file still uses the right rpath, which again requires a call to readelf/patchelf or usage of a similar tool. Sure this can be done once per script start.
So the only possibility I see is to create one patched copy in tmp (we duplicate the risk of leaving it there in case of unclean exit, but maybe it's fine) when nethack.py is imported to reuse it for all created NetHack objects. What do you think about it @heiner? Would that be better? Still I think that calling Patchelf every time is a simple and pretty good solution. I assume that the cost of calling the process is small compared to the cost of a later usage of the environment instance. But maybe I'm wrong here. @BartekCupial can you maybe tell us, if this version has some noticeable performance drop in real use-case, when multiple instances of the environment are used?

mwydmuch · 2024-06-09T22:39:29Z

Ok, so I ran a quick benchmark on my machine, and calling patchelf every time when creating a new NetHack object increases the construction time from ~0.003s to ~0.012s, so it is relatively costly. For comparison, a single env.reset takes ~0.002s, and a single env.step ~0.00004s (its fast!). So yeah, maybe it's a good idea to reduce patchelf calls.

heiner · 2024-06-10T13:56:50Z

Thanks for the explanation. One more question: Which specific file do we require the linker to find here? It wouldn't be the .so we patch with patchelf itself; instead, it would be a dependency loaded when it is loaded, right?

Now, I don't understand why such a dependency would reside in a special directory, but regardless these dependencies are not reloaded a second time anyway?

Re: How to modify it -- we could "simply" change the code to have a specific src instead of only a dest. At first, the src would be the system-installed copy, which we can copy once (e.g., via the memfd_create hack) and patch; afterwards we can copy that copy and don't require the patching. That should leave no risk of leftover files either (that is the point of the memfd_create hack).

heiner · 2024-06-12T21:06:48Z

I tried using linker namespaces for this back in ~2019 and moved away from it. I couldn't quite get it to work back then. A colleague at FB wrote a ELF interpreter that allowed isolation + resetting the data section (for restarts). We got that to work, including on MacOS via cross-compiling, but it's quite the machinery and the current system works and is efficient.

The other alternative of course is that we could enforce the singleton nature of NetHack by simply not allowing multiple versions of NLE to run on the same machine.

I'd not do that.

…s in setup.py and nethack.py

mwydmuch · 2024-06-12T21:48:43Z

Hi @heiner and @StephenOman, thank you for your comments!

I believe I overthought the initial solution to the problem. I should just go with static linking of bzip2, I see it's a ~100 KB library (while libnethack.so is over 3 MB), it's linked in two places: libnethack and _pyconverter, so this way we duplicate the code, but after all this discussion I think, this is a small price to pay and maybe overall better solution.

This way, nothing changes in nethack.py as libnethack.so it doesn't link dynamically to 3rd party libraries. I think it is unlikely that Nethack will start to require some other 3rd party library in the feature, so there is no need for a general solution for linking libraries bundled with binary wheels.

So now, this PR is only about the building changes. I added the newest version of bzip2 1.0.X to the source, as the static version of the library is not available as an RPM package and I also had to write a simple CMakeFile.txt for it.

I hope you agree with that solution.

heiner · 2024-06-12T21:51:40Z

If it's at all helpful, we could drop bzip2. It's not a requirement from NetHack; we added it because I thought compressing all the ttyrecs is a good idea. We could move that to Python or whatever.

heiner · 2024-06-12T21:53:03Z

Anyway, this looks great. 0 objections or questions for this diff.

heiner · 2024-06-12T22:09:39Z

Looks like in the case of Debian/Ubuntu, bzip2 comes with a libbz2.a. That might not be true on other distros though. Given that it's not the first third_party dependency, but a simple one, I like this solution.

Great work Marek!

heiner · 2024-06-12T22:17:32Z

It just occurred to me the files added here might as well be a vector like in the xz hack. How hard would it be to make this a git submodule instead? We can probably create the CMakeFile on the fly?

heiner · 2024-06-12T22:25:46Z

E.g., we could use https://github.com/heiner/bzip2 (clone of git://sourceware.org/git/bzip2.git, which we could also try to use directly).

mwydmuch · 2024-06-12T22:35:10Z

Looks like in the case of Debian/Ubuntu, bzip2 comes with a libbz2.a. That might not be true on other distros though. Given that it's not the first third_party dependency, but a simple one, I like this solution.

Great work Marek!

Thank you, @heiner!

Yeah, I find it a bit odd, but CentOS (and it seems that also other distros from the RedHat family) doesn't provide a static version of bzip2. Manylinux2014 image for building these wheels is based on CentOS, and this, unfortunately, cannot be changed, this version is the most commonly used due to it's high compatibility with even older distros.

It just occurred to me the files added here might as well be a vector like in the xz hack. How hard would it be to make this a git submodule instead? We can probably create the CMakeFile on the fly?

I think a lot of people add the source of bzip2 directly to their projects, as it's very lightweight, and it doesn't really need updating.

E.g., we could use https://github.com/heiner/bzip2 (clone of git://sourceware.org/git/bzip2.git, which we could also try to use directly).

If you prefer it this way, sure. But I think creating CMakeFiles.txt on the fly is not very elegant, so maybe we should a proper CMakeFile to your clone?

heiner · 2024-06-12T22:42:48Z

As you prefer. I'm not against including it manually but I'll compare the sha1s at some point to make sure wherever they are from they are the bzip2 we are expecting.

heiner · 2024-06-12T22:43:52Z

The other option I guess is that we include the bzip2 logic in the main CMakeFiles.txt?

mwydmuch · 2024-06-13T10:28:59Z

@heiner I've replaced the local version with a submodule pointing to the original bzip2 repo at git://sourceware.org/git/bzip2.git. I've added the building logic to the main CMakeFiles.txt as you suggested. I think this is overall a better solution than somehow injecting a file into the directory. However, some CMake purists may not like it ;)

heiner

Amazing!

StephenOman · 2024-06-14T16:41:05Z

The tests and build wheels actions all passed successfully this morning although the aarch64 builds are quite slow (and there are four of them).

Should we restrict the wheel build & test to just the latest Python and x86_64 during the PR test suite, leaving the rest of them for when we're targeting a release to PyPI?

mwydmuch · 2024-06-14T17:33:12Z

This is up to you. Indeed, building aarch64 wheels on QEMU is not very efficient. But maybe it is worth testing for at least one Python version since C++ changes may break building on only one architecture but not the other.

Should I change that?

mwydmuch · 2024-06-14T17:38:01Z

Also, you may consider adding something like that at the beginning of the workflow file:

on:
  push:
    paths-ignore:
      - 'DEVEL/**'
      - 'dat/**'
      - 'doc/**'
      - 'docker/**'
      - '**.md'
      - '**.nh'
    branches: [main]
    release:
      types: [released]

It will only trigger the workflow if files related to the building of the package are changed (I am not sure if I listed the right paths).

StephenOman · 2024-06-15T17:10:54Z

This is up to you. Indeed, building aarch64 wheels on QEMU is not very efficient. But maybe it is worth testing for at least one Python version since C++ changes may break building on only one architecture but not the other.

Ok, that's a good compromise. The x86 build is very fast and the Python 3.11 aarch64 build takes ten minutes, which isn't too bad. I just want to avoid the situation where you'd have nearly 50-minute builds after each push.

StephenOman · 2024-06-21T07:53:47Z

@mwydmuch In the interests of getting our version 1.0.0 out as soon as possible, I'm going to merge this as is (if you have no objections). We can put the suggested build workflow enhancements into a separate issue to be worked on afterwards.

mwydmuch · 2024-06-21T08:06:01Z

@StephenOman Sorry for my lack of activity; I've got quite busy this week. I can add this change to build only 3.11 if it's not a release, now.

mwydmuch · 2024-06-21T08:14:52Z

Ok, that should do. I think it is a good idea to do a separate PR for updating the workflows a little bit. E.g., test_package.yml is running the same tests as test_and_deploy.yml on macOS runners, for which one minute of the running time is worth 10 minutes of Ubuntu runners on GH. Here you can also save some resources.

mwydmuch added 9 commits June 7, 2024 21:02

Add cibuildwheel config to pyproject.toml

674d37e

Add a wheels building job to test_and_deploy.yml workflow

18d9ae4

Add patchelf to fix rpath of copies of libnethack.so

b043631

Remove patchelf dependency on macos, apply pre-commit

d74dcdd

Fix test_and_deplay.yml workflow

11779f5

Remove ninja-build from cibuildwheel config

323e2f2

Remove print from nethack.py

beedada

Fix some wheel paths in test_and_deplay.yml workflow

1936455

Add missing checkout to test_manylinux job in test_and_deploy.yml

7ae13f0

mwydmuch changed the title ~~[WIP] Building manylinux wheels~~ Building manylinux wheels Jun 9, 2024

heiner reviewed Jun 9, 2024

View reviewed changes

nle/nethack/nethack.py Outdated Show resolved Hide resolved

heiner requested a review from StephenOman June 9, 2024 12:16

mwydmuch added 2 commits June 9, 2024 14:43

Apply _patch_dl_rpath when using memfd_create or 0_TMPFILE

484f4ff

Remove a comment from nethack.py

fc078f8

Add more comments

1df7199

StephenOman reviewed Jun 9, 2024

View reviewed changes

.github/workflows/test_and_deploy.yml Outdated Show resolved Hide resolved

Remove Python 3.12 support for now

e5a383a

Fix typo in pyproject.toml

cbda546

mwydmuch added 2 commits June 12, 2024 23:32

Add bzip2 1.0.8 to third_party dir for static linking, revert change…

352b8c6

…s in setup.py and nethack.py

Remove bzip2/dlltest.c as it's not needed

b709dc7

Add missing builds to pyproject.toml

3c55a35

heiner approved these changes Jun 12, 2024

View reviewed changes

mwydmuch added 2 commits June 13, 2024 12:13

Remove local version of bzip2

9eb9a03

Add bzip2 library as a submodule

fb52e30

heiner approved these changes Jun 13, 2024

View reviewed changes

StephenOman approved these changes Jun 14, 2024

View reviewed changes

StephenOman added the enhancement New feature or request label Jun 18, 2024

Build only Python 3.11 wheels, when it is not a release build

559152b

StephenOman approved these changes Jun 21, 2024

View reviewed changes

StephenOman merged commit d4b7da6 into heiner:main Jun 22, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building manylinux wheels #19

Building manylinux wheels #19

mwydmuch commented Jun 8, 2024 •

edited

Loading

BartekCupial commented Jun 9, 2024

BartekCupial commented Jun 9, 2024

mwydmuch commented Jun 9, 2024

heiner commented Jun 9, 2024

heiner commented Jun 9, 2024

mwydmuch commented Jun 9, 2024

StephenOman commented Jun 9, 2024

mwydmuch commented Jun 9, 2024

heiner commented Jun 9, 2024

mwydmuch commented Jun 9, 2024 •

edited

Loading

mwydmuch commented Jun 9, 2024 •

edited

Loading

heiner commented Jun 10, 2024

heiner commented Jun 12, 2024

mwydmuch commented Jun 12, 2024

heiner commented Jun 12, 2024

heiner commented Jun 12, 2024

heiner commented Jun 12, 2024

heiner commented Jun 12, 2024

heiner commented Jun 12, 2024

mwydmuch commented Jun 12, 2024

heiner commented Jun 12, 2024

heiner commented Jun 12, 2024

mwydmuch commented Jun 13, 2024

heiner left a comment

StephenOman commented Jun 14, 2024

mwydmuch commented Jun 14, 2024

mwydmuch commented Jun 14, 2024 •

edited

Loading

StephenOman commented Jun 15, 2024

StephenOman commented Jun 21, 2024

mwydmuch commented Jun 21, 2024

mwydmuch commented Jun 21, 2024

Building manylinux wheels #19

Building manylinux wheels #19

Conversation

mwydmuch commented Jun 8, 2024 • edited Loading

Changes

How to check the changes locally

What was tested

Current TODOs

Possible extensions

BartekCupial commented Jun 9, 2024

BartekCupial commented Jun 9, 2024

mwydmuch commented Jun 9, 2024

heiner commented Jun 9, 2024

heiner commented Jun 9, 2024

mwydmuch commented Jun 9, 2024

StephenOman commented Jun 9, 2024

mwydmuch commented Jun 9, 2024

heiner commented Jun 9, 2024

mwydmuch commented Jun 9, 2024 • edited Loading

mwydmuch commented Jun 9, 2024 • edited Loading

heiner commented Jun 10, 2024

heiner commented Jun 12, 2024

mwydmuch commented Jun 12, 2024

heiner commented Jun 12, 2024

heiner commented Jun 12, 2024

heiner commented Jun 12, 2024

heiner commented Jun 12, 2024

heiner commented Jun 12, 2024

mwydmuch commented Jun 12, 2024

heiner commented Jun 12, 2024

heiner commented Jun 12, 2024

mwydmuch commented Jun 13, 2024

heiner left a comment

Choose a reason for hiding this comment

StephenOman commented Jun 14, 2024

mwydmuch commented Jun 14, 2024

mwydmuch commented Jun 14, 2024 • edited Loading

StephenOman commented Jun 15, 2024

StephenOman commented Jun 21, 2024

mwydmuch commented Jun 21, 2024

mwydmuch commented Jun 21, 2024

mwydmuch commented Jun 8, 2024 •

edited

Loading

mwydmuch commented Jun 9, 2024 •

edited

Loading

mwydmuch commented Jun 9, 2024 •

edited

Loading

mwydmuch commented Jun 14, 2024 •

edited

Loading