Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributing Pygrackle (and Grackle, itself) #220

Open
mabruzzo opened this issue Jul 8, 2024 · 7 comments
Open

Distributing Pygrackle (and Grackle, itself) #220

mabruzzo opened this issue Jul 8, 2024 · 7 comments

Comments

@mabruzzo
Copy link
Collaborator

mabruzzo commented Jul 8, 2024

Once #182 and #208 are merged, I think we will be in a good position to consider packaging (to let people install things without manually building from source).
This issue primarily exists to get the discussion going.

This Issue primarily focuses on packaging/distributing Pygrackle. It also touches on the idea of packaging Grackle, itself. The latter idea is mostly discussed as a hypothetical idea -- it is primarily mentioned to highlight that some challenges are common in both scenarios (and that we may want to pick solutions that would help resolve the issue in both scenarios).

The biggest challenge in most scenarios is the fact that Grackle requires the hdf5 library. A lesser, common issue is convient distribution of datafiles (that was acknowledged in #210).

Packaging/Distributing Pygrackle

I think there are 2 primary things we want to do here:

  1. Distribute through conda-forge.
  • The hdf5-dependency should not be an issue in this scenario (one of the reasons to conda exists is to help address this sort of dependency)
  • 2 questions exists:
    1. Is there any advantage to creating separate conda packages for
    2. Is there any advantage to waiting until we start distributing on PyPI
  1. Distribute through pypi
  • the hdf5-dependency is an issue in this scenario. The proper way to distribute Grackle in its current state is to distribute it with a copy of the hdf5 library on every platform -- this is exactly what the likes of h5py and PyTable do
  • The way I see it, we have 3 options:
    1. Suck it up and distribute a copy of hdf5 with pygrackle
    2. We could refactor the Grackle to support a mode where we use the dlopen to dynamically load the hdf5 library at runtime. This is what the hdf5plugins package does with h5py
      • For the uninitiated, this is the mechanism that python actually uses to load in extension modules. It is also the mechanism used by the ctypes module
      • What I envision:
        • we would refactor all hdf5 function calls scattered throughout Grackle to make use of function pointers that we store in some kind of internal struct.
        • Then, during initialization, we would provide grackle with the locations of the hdf5 library. The initialization routine would then use dlopen with this information to open the hdf5 library at runtime and would initialize the function-pointers in that internal struct
        • Pygrackle in particular, would use the version of hdf5 that ships with h5py
          - The main disadvantage: I think that dlopen has some weird platform dependent behavior
    3. We could do something extremely similar to option 2, but we could make the cython-layer responsible for determining the function pointers (or providing wrappers around the function pointers). In other words, we wouldn't need to directly invoke dlopen
  • It's worth mentioning, that options 1 and 2 may be relevant, if we decide we want to distribute precompiled Grackle binaries

Packaging/Distributing Grackle

This is all a much lower priority than shipping Pygrackle, but it's worth briefly mentioning that we could also distribute precompiled Grackle binaries. I've seen this sort of thing done alongside new software releases for other open-source projects on GitHub.

Since we will already be doing a lot of work to ship Grackle as a precompiled binary alongside Pygrackle, I actually think this wouldn't be hard to do. Moreover, the new CMake build-system should be compatible with the CPack tool.

  • I don't know a ton about CPack, but it is essentially a CMake module that ships with CMake that makes it possible to easily package CMake projects in formats commonly used formats (like .tar.gz or .deb or .rpm, or etc.).
  • Aside: If we went down this road, I don't think the Grackle project should necessarily be in the business of distributing anything other than a .tar.gz file for simplicity... (I think that would minimize any maintenance burden)
  • And all of the established best-practices that I observed while designing Grackle's CMake build-system were recommended by the CMake developers with a goal of easily packaging software with CPack. should make it easy to package).

Again, this is a hypothetical low-priority idea. But, there are 2 reasons I bring this up:

  1. The hdf5-dependency problem crops up here. In light of that, it might pick a solution that addresses this case AND the Pygrackle case (instead of requiring 2 separate solutions).
  2. Data-files are also an issue here. Consequently, it might make sense to write the python code that we would ship in Pygrackle for fetching the datafiles in a portable way that could be reused later (maybe we could implement the functionality in a self-contained file with a if __name__ == __main__ section that lets the file be installed alongside Grackle as a self-contained, executable script that serves as a command-line tool for fetching data files?1

Footnotes

  1. @matthewturk previously suggested using pooch. Using that module probably wouldn't be ideal for helping in the case of distribution Grackle (without Pygrackle). For portability, it might be better to rely upon functionality from the python standard-library, rather than the pooch package. OR since distributing Grackle itself is mostly a hypothetical, maybe we should use pooch and limit ourselves to a subset of pooch-functionality that we could re-implement later in terms of standard library functionality (if/when it becomes an issue).

@mabruzzo
Copy link
Collaborator Author

mabruzzo commented Jul 8, 2024

Upon reflection, I don't think the dlopen approach would work in a robust forwards compatible manner. The problem is that a number of macros are defined in headers that we need to invoke the HDF5 functions such as H5F_ACC_RDONLY, H5P_DEFAULT, H5S_ALL, H5T_C_S1, H5T_STD_I32BE, H5T_STD_I64BE, H5T_IEEE_F32BE, H5T_IEEE_F64BE, H5T_STD_B8BE, H5T_NATIVE_INT, H5T_NATIVE_LLONG, H5T_NATIVE_FLOAT, H5T_NATIVE_DOUBLE, H5T_NATIVE_LDOUBLE.
These are all API-stable, but not ABI stable -- the associated underlying values are technically allowed to change from 1 version to the next. 1

Consequently, resolving the hdf5-distribution problem actually has the following solutions:

  1. We don't try to resolve the issue. Instead we simply upload a sdist (source distributions) to PyPI (pip install pygrackle will download the sdist and will try to launch grackle and pygrackle locally)
    • if we wanted to be able to ship Grackle itself without resolving the issue, we would probably need to move all hdf5-related code out of the source file and put it into inline functions defined in the header file (consequently, compiling Grackle itself, wouldn't involve any HDF5 headers or linking)
  2. We simply bite the bullet and package hdf5 as part of pygrackle (and Grackle if we ever choose to distribute it)
  3. We adopt the solution from the original post where we pass in structs with function pointers (but don't use dlopen):
    • Pygrackle would query h5py to determine these function pointers (I believe h5py is written in a way that we could dynamically query the values associated with the various macro quantities)
    • We could ship a public header-file that includes the hdf5 header and provides an inline-function that constructs the struct of function pointers for the downstream application
    • EDIT: upon reflection, this approach introduces some challenges with using Fortran bindings. To do this, we probably need to ship a separate shared library (that directly links to hdf5) to use for this purpose. I'm very hesitant to do this (seems like more maintenance), but this actually isn't as crazy as it might sound -- if we want to easily support the "dynamic api" in Fortran, we would probably need to ship a separate shared library anyways

There are some other crazier solutions too (probably not worth going into)

For now I say we ship a SDist and punt on the rest of the hdf5 issues.

Footnotes

  1. In principle, we could maintain a small database where we record the values of each of these macros for each of hdf5's minor version (e.g. 1.6, 1.8, 1.10, 1.12, 1.14, etc.), but I don't really think we want to do that...

@mabruzzo
Copy link
Collaborator Author

Some additional questions:

  • what Grackle artifacts do we ship with Pygrackle beyond a shared library.
  • in general it doesn't seem very idiomatic to ship more than is necessary from Pygrackle (via PyPI).
    • The main modules that include headers seem to be cases where c-extensions are handcoded and the headers are used to link against the extension (basically the numpy scenario). But this presents challenges when it comes time to build
    • On the one hand, if treating Pygrackle as a distribution mechanism for Grackle helps, then I'm not sure it's a bad thing...
    • On the other hand, I'm hesitant to wind up in a scenario where people come to depend on Pygrackle as a distribution method for Grackle -- and then for us to later decide we want to do things "right" with shipping Pygrackle. At that point, we would be stuck distributing more than necessary
  • I guess one question is whether we distribute grackle_defs.pxd?
    • If yes, then we probably need to ship all of Grackle's public headers.
    • I tend to think should be conservative and initially avoid distributing grackle_defs.pxd (if it's something people want, we can revisit).
  • If we're already distributing the shared library and headers, a piece of me thinks we should also provide grackle.pc files and GrackleConfig.cmake files (but then you encounter issues with the fact that different ways of building Pyrackle wouldn't necessarily provide this metadata).

As I talk this out, I posit we should ship as little as possible in a normal wheel. To experiment with shipping more, we should distribute Grackle and pygrackle separately on conda-forge

@matthewturk
Copy link
Contributor

I'm late to this, and it seems that you have this largely under control. My main comment to make is that I agree with your assertion that dlopen is not a workable solution.

I believe that we should ship grackle_defs.pxd, but I am willing to concede that this may not be universal. I have in the past found myself wanting .pxd files and I think I remember actually using the grackle pxd's in a separate project. However, that is likely to be the extreme end and I have no problem not using grackle wheels in that unlikely event.

@mabruzzo
Copy link
Collaborator Author

mabruzzo commented Jul 12, 2024

(Note: this comment was heavily revised for clarity)

I believe that we should ship grackle_defs.pxd, but I am willing to concede that this may not be universal. I have in the past found myself wanting .pxd files and I think I remember actually using the grackle pxd's in a separate project. However, that is likely to be the extreme end and I have no problem not using grackle wheels in that unlikely event.

I'm happy to do that. That makes some other decisions easier.

As I understand it, (and please correct me if I'm wrong), the cythonize functionality needs to know where to find grackle_defs.pxd and any headers referenced therein. I think we would need to provide a function like numpy.get_include1 or mpi4py.get_include.

There is a wrinkle here. There are actually 2(ish) ways to build pygrackle with the new build-system and the locations of the Grackle header depends on which way is used. These approaches include

  1. an approach where we freshly compile Grackle and embed it inside the pygrackle repository. This is the default approach (if we ever upload pre-compiled binary wheels to PyPI, they would also be built this way)
  2. an approach where we link to an existing grackle-build (mostly for testing purposes). There are a couple of flavors to this approach: (i) link against a classic style build or (ii) linking against Grackle in a cmake-build-directory or (iii) linking against a Grackle installation produced by cmake. (From pygrackle's perspective, there is no difference between ii and iii)

The problem is that the Grackle headers will be in different locations for both cases (we can't just hardcode them into Pygrackle).

  • Since the build-system needs to infer these locations to compile Grackle, anyway, this is a surmountable problem.
  • An important distinction from the 2 linked cases (where get_include only returns a string), is that Pygrackle's analgous function, would needs to provide a list of header directories: the directory containing .pxd and the directory(ies) containing Grackle's public headers.

For concreteness, I'm talking about adding functionality that would allow an Extension Module compiled against Pygrackle to have a setup.py file that looks like:

import pygrackle
#...
Extension('my-extension-name', #...
          include_dirs=pygrackle.get_include_list())

At the moment, it's not clear the best way to do this sort of thing... If I generated a toml or ini file, do you know if this is generally a use-case for importlib.resources?

EDIT: it turns out scikit-build-core has some trouble right now with editable installs and importlib.resources. To implement this functionality, I would instead make a template-file called __config__.py.in and have the build-system inject build-data into this file. numpy and scipy do the exact same thing (they use meson functionality with a direct CMake counterpart).

Footnotes

  1. Aside: that webpage references the fact that they ship a pkg-config file for their numpy headers (i.e. numpy.pc). From inspecting the file, it looks like that doesn't provide much info. But, maybe we should also provide a pygrackle.pc file (especially since Introducing Experimental Supplementary CMake Build System #182 exposes machinery for a grackle.pc file).

@mabruzzo
Copy link
Collaborator Author

I'm late to this, and it seems that you have this largely under control. My main comment to make is that I agree with your assertion that dlopen is not a workable solution.

I believe that we should ship grackle_defs.pxd, but I am willing to concede that this may not be universal. I have in the past found myself wanting .pxd files and I think I remember actually using the grackle pxd's in a separate project. However, that is likely to be the extreme end and I have no problem not using grackle wheels in that unlikely event.

@matthewturk I've thought about this some more, and now I think I can articulate my concerns a little better.

Some Examples:

Off the top of my head, I can name 3 python popular packages that include .pxd files inside their source directory and are linked against standalone software. I think they correspond to 3 representative cases:

First, there is numpy:

  • under the hood, it privately links against one of multiple different linear algebra libraries (e.g. MKL, OpenBLAS, etc.). When building from source, these can be free-standing libraries. When you download a binary wheel from pypi, at least one of these is presumably bundled into numpy
  • it distributes .pxd with the goal of supporting numpy's C-API. The pxd files don't reference any of the header files from the external private dependencies. Instead they only reference the headers associated with numpy's C-API and python's headers.
  • Consequently, numpy.get_include only needs to provides the path to the directory with numpy's c-api headers.

Next, there is mpi4py:

  • it obviously links against mpi. To my knowledge, mpi4py NEVER ships a precompiled wheel that internally includes an mpi installation.
  • The wheel distributes .pyd and .pxi files that definitely reference the <mpi.h> header (as well as mpi4py's internal c-headers)
  • mpi4py.get_include only provides the path to the internal headers. End users trying to compile an extension module are entirely on their own with making sure that the appropriate paths for the mpi-headers are known to the build-system

Finally, there is h5py:

  • it obviously links against hdf5. If you build from source, you link can link against an external h5py. If you download a wheel from pypi, a copy of hdf5 comes that is bundled
  • while .pxd files are internally used to implement h5py, they are explicitly excluded from the wheel

Relating this to pygrackle

If the content's of Pygrackle's .pyd file(s) only referenced internal extension types, without mention of any grackle-headers, I would have 0 concern with including the .pyd file(s) in the wheel. In that case, we are effectively in the numpy scenario.

In reality, the contents of Pygrackle's .pyd directly wrap the grackle headers.

  • For that reason, we actually have a scenario more like the mpi4py or h5py cases.
  • If this were exactly like the mpi4py case, (i.e. the external library wrapped by the .pyd files was always externally provided -- and consequently the header location was clearly the installer's responsibility), I would be less concerned

However, Pygrackle firmly lies in the h5py case. For that reason, we are in this weird in-between:

  • when grackle is embedded within the Pygrackle wheel, we the developers are responsible for maintaining all headers within Pygrackle.
  • when Pygrackle is built against an external Grackle installation, we can provide the locations where the headers were located during installation, but it's ultimately the installer's responsibility to make sure that the headers don't get deleted or moved.

I guess I'm just worried about providing an inconsistent experience (when it comes to python-installation stuff I would like things to "just work"). I'm also a little worried that I may be missing other obvious cases (related to python packaging) where things could go wrong.

What do we do?

I would feel a lot more comfortable with all of this if we knew of a similar popular package (to the h5py scenario) that does distribute the pyd files in the wheel? Is anyone aware of one?

I'm also willing to accept that these concerns are overblown. One might argue that anyone who is actually linking an Extension Module should be able to understand all of these issues.1 Furthermore, unless someone is linking against an external classic-build, they need to go out of their way to delete/move the headers, while ensuring that Pygrackle continues to work.2

A few other thoughts:

  • we could reframe the documentation to state that the only officially supported way to consume Pygrackle is to use it as an embedded library. In other words, the support for linking against an external Grackle copy is only intended for internal use (i.e. for running integration tests)
  • crazy idea: since grackle_defs.pyd ONLY references grackle headers (and doesn't reference any Pygrackle functions/types), what if we ship as a part of grackle, inside the include directory? I'm not totally sure this really solves our problems... (we might also want to rename it grackle_defs.pxi to avoid problems).
  • we could write a slightly more sophisticated pygrackle.get_include_dirs function that tries to provide informative errors if things go wrong. Maybe we have the function throw an error by default if grackle isn't embedded within the wheel, with an informative error message that explains the potential ways that things can go wrong and how to override this behavior by maybe overriding a default function-argument?3 But I might just be over-complicating things.

Footnotes

  1. with that said, this doesn't really help an arbitrary end-user of that Extension Module.

  2. unless you really know what you're doing (in which case, you probably can understand these challenges), it seems extremely unlikely that you would want to build an external-grackle-copy, link that to pygrackle, and then delete everything other than pygrackle and the libgrackle.so.

  3. For any standard package-manager situation, where pygrackle and grackle are distributed as separate packages, (e.g. perhaps the conda packages?) we would probably want ways to avoid raising the error-messages.

@matthewturk
Copy link
Contributor

I think you make a compelling case that this is too complex to ship the pxds. I think that I am uncertain about the actual difficulties of distributing grackle via pypi, but I think it is worthwhile to say that linking requires extra effort on the part of the person doing the linking if they want to use the pypi package. I think going ahead with the simplest solution with the lowest probability of errors is the appropriate course.

@mabruzzo
Copy link
Collaborator Author

mabruzzo commented Aug 2, 2024

I wanted to record a few thoughts that I've had about this topic over the past few weeks (hopefully this is semi-coherent)

Area I: Distributing binary wheels from PYPI

Distributing binary wheels from PYPI may be a lot more doable than I originally thought! My primary concern has always been with cross-compiling hdf5 (I also have a much better understanding about the cross-compilation process than I used to)

I was concerned because the scripts used by h5py and pytables for this purpose are fairly complex (things seemed tricky in the julia packaging ecosystem as well)

With that said, it looks like a lot of effort has gone into making hdf5 version 1.14.y.z easier to cross-compile (e.g. see HDFGroup/hdf5#1203 and other issues linked therein)

Area II. What to include in a wheel

In light of the fact that we don't initially plan to ship the cython pxds and the general issues described in the next section related to using a precompiled grackle library for most simulations, I'm inclined to just ship the bare minimum in the wheels (at least to start). I'm definitely happy to revisit this (especially if somebody has a particular application in mind), but I think it makes sense to initially be conservative on this choice.

Area III. Obstacles to distributing precompiled libgrackle.[so|dylib|a] for general-use (e.g. for compiling a simulation code)

These obstacles are more obvious in hindsight, and I have already talked about some of this elsewhere in this thread. But I think it would be helpful to restate some of this for concreteness (especially now that I appreciate the full-scope of the issue).

General-use refers to using the pre-compiled library for compiling a simulation code.

The primary obstacle

The primary obstacle relates to the fact that a precompiled version of grackle needs to be linked against an ABI compatible version of it's runtime dependency hdf5 -- If you compile against hdf5 1.x.y.z, problems arise if you link against a library where some version-digit (other than z) has changed.1 (This is also true for the fortran runtime library, used by the internal routines, but that is less of an issue2)

Let's consider the simplest general-purpose grackle distribution mechanism that would work for the largest number of systems involves. Simply distribute a tarball containing:

  • the pre-compiled grackle library
  • the public grackle-headers (if you want to support general-usage of this distribution)
  • Since we can't know which version (if any) the end user has of libhdf5.so (or the fortran runtime library) on their system, we would also need to include libhdf5.so (and libgfortran.so) and any runtime-libraries that libhdf5.so (libgfortran.so) depends upon inside the tarball

In fact, this distribution mechanism would work for developing any pure-C/C++ downstream application that doesn't directly link to hdf5 or its dependencies (or link to dependencies that directly link to hdf5/hdf5's dependencies). It can also be used with a Fortran code that is compiled with a recent-ish gfortran version. In other words, it works with the examples shipped with Grackle or simulation codes that don't use an hdf5 output format.

Problems arise if you try to compile a downstream application that links against the precompiled grackle binary and links against an incompatible hdf5 version. In more detail, these problems arise because Linux and most (all?) mainstream unix-like operating systems3 construct a global table of symbol-names and it's problematic to be linked two 2 libraries that provide conflicting definitions of a symbol with a given name.

Overcoming these obstacles

This issue is hardly unique to Grackle. There are 2 options without refactoring (but neither are particularly attractive):

  1. Include everything necessary in the tarball needed to have downstream applications compile against the included version of libhdf5. There are 2 issues:

    • I don't think we should be in the business of providing a full HDF5 distribution -- a lot could go wrong.
    • A lot of custom behavior may need to be injected into makefile build-systems of the downstream applications (it would be straight-forward if downstream applications use cmake, but there are convenient alternatives in those cases).
  2. Distribute Grackle via package managers like apt/dpkg, brew, conda4, dnf, etc. (i.e. we don't provide a tarball). In this scenario, we precompile grackle against the version of hdf5 provided by the package manager we would require downstream simulations would try to compile against that same hdf5 version. Disadvantages include:

    • it would be too much work for us to maintain a bunch of individual packages (realistically, we would probably just pick 1 or 2).
    • issues might arise with respect to parallel-hdf55

If we ultimately decide this is something worth doing, we could refactor the source code to better support this scenario. Currently Grackle only has 2 functions that directly reference the HDF5 API; initialize_UVbackground_data, and initialize_cloudy_data.6 Basically we could move the definitions to one or two different locations:

  1. We could move these 2 function definitions into a separate library -- let's call it libgrackle-h5tools.so. We would always ship libgrackle-h5tools.so alongside libgrackle.[so|a] and we would have libgrackle.so access these 2 functions by using dlopen, which lets us avoid any symbol collisions.7

    • this is a solution that would be fairly portable to any posix-compliant system (on windows there is alternative, equivalent functionality).
    • Unlike the other mention of dlopen described earlier in this thread, this usage would be more robust since we always know the relative path of the shared library it is used to open
    • Advantage: The build-systems of simulation codes wouldn't need any modifications
    • EDIT: Actually, I think this gets a little messy if you compile grackle as a static library (related to specifying the path to the dlopened library)
    • Disadvantage: The way to specify the relative paths between the 2 directories may be somewhat platform-dependent, but I don't think addressing this would be terribly difficult.
    • Disadvantage: The naive/straightforward implementation would introduce an extra file-system for each process in a simulation that loads grackle (this may be meaningful on certain fragile parallel filesystems)...
  2. We could move these 2 function definitions into the public headers and make them inline-functions (so that they are defined within the binary of the simulation-code rather than in grackle). To be able to support fortran bindings, we would need to ship a second version of the grackle-library alongside the main one that fortran codes would explicitly link to (since the fortran codes can't explicitly define inline functions).

    • Advantage: This is guaranteed to be more portable across systems
    • Disadvantage: This requires modifications to the build-systems of most simulation codes that aren't built with cmake (if pkg-config is used, modifications may not be necessary). A fortran simulation code will need to change the -L linker flag. A c/c++ code that doesn't use hdf5 outside of grackle will need to pass a new -I flags to specify the hdf5 header locations and pass -lhdf5. Other c/c++ codes may also require similar changes
    • Disadvantage: The naive/straightforward implementation would force us to keep certain implementation details exposed (and would undermine efforts to hide these details and achieve a stable ABI).

Footnotes

  1. This is true whether you pre-compile grackle as a shared or static library. I have been

  2. At one time, I was more concerned about linking libgfortran.so. But, I have since learned that libgfortran.so employs symbol versioning, like the standard c runtime library on linux (glibc), in order to mitigate ABI compatibility issues. If a downstream application contains Fortran-code, that Fortran code would need to be compiled with gfortran 8 or newer (released 6 years ago), which is reasonable. It's worth mentioning that openmp runtime libraries pose a larger problem than hdf5, but that is a different topic that we can ignore right now.

  3. As I understand it, this isn't actually an issue on Windows

  4. In this case, there would be separate grackle and pygrackle conda packages

  5. I don't know enough about the internals of hdf5, but since grackle doesn't depend on mpi, it's plausible this isn't a major issue.

  6. This pending debug-tool introduces another one.

  7. The CPython interpretter uses dlopen to open extension modules. This is why we can currently use versions of h5py and pygrackle together that were compiled against different (incompatible hdf5 versions).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants