Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(bzlmod): support patching 'whl' distributions #1393

Merged
merged 3 commits into from
Oct 20, 2023

Conversation

aignas
Copy link
Collaborator

@aignas aignas commented Aug 29, 2023

Before that the users had to rely on patching the actual wheel files and
uploading them as different versions to internal artifact stores if they
needed to modify the wheel dependencies. This is very common when
breaking dependency cycles in pytorch or apache-airflow packages.
With this feature we can support patching external PyPI dependencies via
pip.override tag class to fix package dependencies and/or a broken
RECORD metadata file.

Overall design:

  • Split the whl_installer CLI into two parts - downloading and extracting.
    Merged in refactor(whl_library): split wheel downloading and extraction into separate executions #1487.
  • Add a starlark function which extracts the downloaded wheel applies patches
    and repackages a wheel (so that the extraction part works as before).
  • Add a override tag_class to the pip extension and allow users to pass patches
    to be applied to specific wheel files.
  • Only the root module is allowed to apply patches. This is to avoid far away modules
    modifying the code of other modules and conflicts between modules and their patches.

Patches have to be in unified-diff format.

Related #1076, #1166, #1120

@aignas
Copy link
Collaborator Author

aignas commented Aug 29, 2023

NOTE: This PR is to initiate the discussion of how the API for patching whl_library could look like. Tests and better examples with real patches could follow later if the direction is agreed upon.

@groodt, @rickeylev, could you please take a look?

@aignas
Copy link
Collaborator Author

aignas commented Aug 30, 2023

So I decided to not use the annotation API in bzlmod to make it better and more similar to how bazel-gazelle does it. This obviously is still missing tests and hardening, but I wanted to see how the examples/bzlmod/MODULE.bazel would look with this.

I am looking for feedback about which option would be better. @fmeum could you also please chime in on designing this extension/API here?

  • Using the old package_annotations API to pass in patches from the outside looks like this.
    • Drawbacks:
      • The user has to supply patches as an argument to each pip.parse extension call they would do.
      • We need to serialize the patches to JSON as we cannot route the labels to the whl_library.
    • Benefits:
      • We reuse the same API
  • Using a new pip.module_override API to pass in patches from the outside looks like this.
    • Drawbacks:
      • A new API that has to be documented and designed.
    • Benefits:
      • With a single call we can make overrides for any combination of {all python version,a single version} and {all hub repos,a single hub repo}.
      • We can add further validations that only the root module is calling.
      • We don't need a call to use_repo(pip, "whl_mods") because we aren't generating any extra files.
      • We can potentially pass patch files as labels from an external repo if needed.

@aignas aignas marked this pull request as draft August 31, 2023 00:25
@fmeum
Copy link
Contributor

fmeum commented Sep 1, 2023

I don't know much about Python, but the pip.module_override API sounds much nicer to me!

Is module the correct term here or should this rather be called wheel_override?

@groodt
Copy link
Collaborator

groodt commented Sep 4, 2023

I've not had time to look in depth, but the unified diff format patch files is a similar UX to what I've been advocating for and similar to what we use at my $dayjob. They're far more flexible and likely to support more use-cases with well understood semantics than a series of custom annotations for every indepent use-case like we've been doing in the past.

Are you applying the patches before or after wheels are unpacked / installed?

@aignas
Copy link
Collaborator Author

aignas commented Sep 5, 2023

@groodt, thanks for the comment, the patches are applied after the wheels are unpacked so that we can also patch the BUILD.bazel files. Do you think doing the patching before unpacking would make more sense?

@DavidZbarsky-at
Copy link

It can be nice to apply the patch before install time in case you are modifying the setup.py

@groodt
Copy link
Collaborator

groodt commented Sep 5, 2023

There’s generally benefit to being able to patch the following:

  1. bdist contents (wheels)
  2. sdist contents (tarballs of source)
  3. The third-party dependency closure itself (optional)
  4. The generated BUILD.bazel files (optional)

If there is a capability to patch as early as possible, then there’s almost no need for option 3 and 4, because if one is able to modify all source and metadata from the foreign artifacts, then one is able to influence the structure of the dependency closure and generated BUILD files as well (albeit a little awkwardly).

At $dayjob, we optimise for patching the wheels, because most of the ecosystem distributes wheels (and it’s easy enough to populate a private wheel repository where required).

I personally would prefer to patch earlier, but maybe we need a lifecycle of patching? Sounds complicated though. Patches for pre-install and post-install of the artifacts? I know what my preference is, but would love to hear opinions.

And then maybe this is slightly controversial. For BUILD files, I would prefer the semantics of http_archive. I would prefer to specify “build_file” or “build_file_content” to fully replace the generated default. Hear me out. 😂 The defaults should be simpler and often work well enough so that it’s rare to swap them out. Also, with the capability to patch the foreign artifacts, it’s also possible to influence the generated BUILD.bazel. So, in a world where we simplify and remove a lot of the code generated in repository rules (move the convenience items to user land), we often won’t need to patch BUILD.bazel files. I think having to reason about patching generated code is just too complicated otherwise.

Thoughts?

@aignas
Copy link
Collaborator Author

aignas commented Sep 6, 2023

My thinking is:

  1. bdist Patching this would be ideal, but we would have to refactor the wheel_installer.py and its usage heavily, but maybe it is worth it.
  2. sdist Patching would be good, but we would have to refactor the call of pip wheel which fetches sdist and builds it to become a separate call to fetch sdist, extract it, patch it and then build it. This is quite involved and I would love to keep it out of rules_python until there is a better, more hermetic way to build from sdist. People can always build sdist themselves and upload the wheel somewhere and we are back to 1..
  3. Dependency closure itself should be managed via unified diffs, however, this could also handle a case where you decide to use bazel to build from sdist and need to do some dependency rewiring.
  4. Generated BUILD.bazel files might need to be patched to expose extra things, e.g. numpy headers, which is used by pytorch if I remember correctly. Right now they can use package_annotations for this and supply additional_content.

I agree that having 1. and 3. would be ideal, but right now the PR implements only 3. because of the complexity involved in getting 1. to work. I think you have hinted a great name for a possible API:

# extract the `whl`, modify it's contents and regenerate the `RECORD` metadata file and zip everything back together
pip.whl_override(
    hub_name = "pip",
    patch_strip = 1,
    patches = [
        "@//patches:METADATA.patch", # patching dependencies
    ],
    python_version = "3.10",
    whl_name = "requests",
    # extra parameters that may help us match the wheel filename, defaults could be:
    # not sure about the API here yet...
    # platform = "*",
)

# modify BUILD.bazel contents and the extracted `whl`
pip.whl_library_override(
    hub_name = "pip",
    patch_strip = 1,
    patches = [
        "@//patches:empty.patch",
        "@//patches:requests.patch",
    ],
    python_version = "3.10",
    whl_name = "requests",
)

For bdist_patching I am thinking that we could employ the hack outlined in pytorch/pytorch#99622 (comment). This could be scripted and could be compatible with the work that @philsc is doing as hinted by #1403.

@aignas aignas force-pushed the feat/1076/pip-parse-patch branch from 165c3f9 to 142230f Compare September 8, 2023 06:16
@aignas aignas changed the title feat(pip_parse): support patching 'whl_library' feat(pip_parse): support patching 'whl' and 'whl_library' Sep 8, 2023
@aignas aignas marked this pull request as ready for review September 8, 2023 08:00
@aignas
Copy link
Collaborator Author

aignas commented Sep 8, 2023

Ready for initial review. Thanks @groodt for the initial comments on the API. I think the current state of the PR which provides bdist patching and whl_library patching is enough to cover most of the needs of modifying things. However, there are a few things that I am not sure about so am looking forward to comments on the API of pip.whl_override tag class.

@rickeylev
Copy link
Collaborator

My main concern is ending up having 5 different apis to do patching:

  1. pip.whl_mods (exists today; note that the whl_mods API implements "annotations", but was also loosely designed to support arbitrary patch files; I'm not suggesting to continue the a-special-arg-for-each-use-case pattern)
  2. pip.whl_override (this pr)
  3. pip.whl_library_override (this pr)
  4. Another API to patch the extracted bdist/sdist (what Greg wants)
  5. Two more patch APIs as part of Phillip's work (one to patch dependencies, another to patch the possible downloaded files from http_file)

Also, part of the goal with pip.whl_mods was to have less configuration directly in the MODULE file and move it to a separate file (hence the json config file). The rationale was two fold:

  1. Large MODULE files are difficult to manage (modules can't be decomposed into smaller units like a bzl/BUILD file can)
  2. API changes to module_extensions are very painful, and the MODULE language and capabilities is purposefully limited. So I'd like to minimize that API footprint. For example, figuring out what patch should apply requires matching some set of the (python version, library version, library name, "life cycle phase"?, etc) to a patch file, which doesn't have a concise expression in MODULE-language.

re: naming:

  • -1 on the "override" name -- override implies replacing, while patching is a more selective action.
  • I agree "module" is an overloaded and ambiguous term in this context, so lets avoid that.

maybe we need a lifecycle of patching? Sounds complicated though

I think so, at least for discussion purposes. It'd certainly help me with constructing a mental model of when patches could be applied, which might make it easier to figure out how to express "apply patch P to X if Y". I think the 4-point list and the "5 apis" captures most cases? I think a missing case is patching the "alias" repo that glues together the platform-specific repos (e.g. if you patch numpy bdists to expose headers, you have to patch the alias repo to expose the platform-specific build targets)

@aignas
Copy link
Collaborator Author

aignas commented Sep 14, 2023

So the last implementation that we have here was modelled after bazel-gazelle where they have 3 types of override tag classes which allows to override different parts of the dependency closure (e.g. the source of the package, applying patches to the package, etc) and I have found that having a tag_class per one things makes the API more maintainable because the API definition becomes much simpler.

In my limited experience with the whl_mods tag_classe I have the following observations:

  1. The user needs to repeat a lot of things - they first create a modification, then they need to pass that modification to the pip.parse rule and the size of the MODULE.bazel thus increases unless we use list comprehensions and/or define static variables which decreases the readability.
  2. The API that would accept a list of patches should ideally accept it as a attr.label_list, but when we do str(label) and then serialize to JSON in the whl_mods and then deserialize it in the context of whl_library repository rule execution the semantic of what that label points to may be lost due to repository remapping. The rules_python repo does not have access to the _main repository files and when you try to pass in such a label to the repository_ctx.patch function it fails with an error that the file could not be found, which means that the patch file has to live in a place accessible to rules_python and the only way I could think of to do it was to read the patch file via module_ctx.read in the whl_mods extension, put the contents of the patch to the annotation.json and then create the file just before using it in repository_ctx.patch. This extra indirection was not great in terms of maintainability.
  3. The fact that as a user I need to use_repo(whl_mods) just to pass some patch file labels to the whl_library looks like an API design smell.

I agree with the sentiment that large MODULE.bazel files are hard to maintain, but I do think that having a JSON file that does not have a concept of a Label which we can pass around may be even harder due to the repo mapping in bzlmod, but maybe I have the wrong picture here.

Regarding naming - I am open to change the _override to something better, but I just took the same names as bazel-gazelle uses for now.

Regarding apply patch P to X if Y, this is something that I found it would be good to have, but the combinatorial explosion may be an annoying thing here. We may need to have for each set of patches something like:

pip.whl_patch(
    hub_name = "pip", # if Y: hub_name
    version = "3.10", # if Y: python version
    platform = "manylinux_2014", # if Y: platform
    patches = ["@//:foo.patch"],
    strip = 1,
    apply_to = "bdist" # if Y: one of [bdist, sdist, library]
)

However, I am wondering if this is better than having multiple tag classes, where we are solving the polymorphism of applying patches not via function parameters but via different tag_classes instead.

I guess the interesting part here is that creating patches should be reasonably easy in order to make the API useful and I am not sure yet how we could achieve that.

@fmeum
Copy link
Contributor

fmeum commented Sep 14, 2023

The foo_override name pattern in Gazelle is directly borrowed from Bazel itself and its archive_override, git_override, single_version_override, ... All of those are named such that foo_override overrides a thing (in this case Bazel modules) with a "foo", which is a convention I tried to follow in Gazelle.

Regarding managing large MODULE.bazel files, please assume a solution to bazelbuild/bazel#17880 and design with that in mind. Whatever the result is, it will be here to stay and basing it on non-Starlark helper files will haunt us (JSON doesn't have comments, requires separate formatting, etc.). I am pretty sure that we can resolve that issue quickly if needed, so please comment on it with your requirements.

CC @Wyverald

@rickeylev
Copy link
Collaborator

You make some fair points @aignas. It sounds like we're all facing about the same direction, at least :).

So the last implementation that we have here was modelled after bazel-gazelle

Can you link me to an example? I briefly looked at bazel-gazelle, but didn't see an example, and digging through the source wasn't produtive.

Regarding apply patch P to X if Y, ... the combinatorial explosion may be an annoying thing here.

Yeah, totally agree; this is the crux of the issue.

The pip.whl_patch() thing you proposed looks like it has some promise.

Maybe pip.patch() or pip.patch_<step>? After all, we're patching the pip integration in general, not necessarily wheels. There's a bug/FR that requires being able to patch the "alias" build files, for example.

I'm pretty mixed about tag-class-per-step vs arg-per-step. As a user, I think, "I want to patch numpy", not "I want to patch the generate build file step". The latter assumes I know the internals and already know what part I need to modify, which seems unlikely. Similarly, patching a package might cross multiple steps -- having that split up between different tag classes seems a chore (my immediate thought is: "i need to go comment out a mishmash of lines" vs "here's a single contiguous block of numpy stuff"). This makes me prefer e.g. pip.patch(package="numpy", ...).

Also: Part of the design goal of pip.parse(whl_modifications=...) was to make it clear and explicit what patches a particular pip.parse() is having applied to it, hence why pip.parse() has the whl_modifications arg. Otherwise, you have to look through the whole MODULE config to identify what is being applied. With the mapping between patch files and when/what they're applied to being non-trival, this seems important. The pip hub namespacing issues also complicates this: Another module could call e.g. pip.do_patch(hub_name = "yourhubname"), and now you're getting a patch from some dependency, much to your surprise.

Stepping back, if we can make the mapping trivial, then maybe a lot of this goes away. I have trouble seeing how to do that, though. For example, in conversations with Greg and Phillip, they want to have distribution-level granularity of patches (i.e. the specific foo-1.0-cp10-abi3-linux_x86.whl file), which basically equates to having an arg for each portion of the wheel file name format, and probably some way to express ranges or multiple values -- it starts to feel like a requirements.txt level of constraints expression ("foo =>1.0 !=1.2 <4.1; python_version>3.10" etc).

In my limited experience with the whl_mods tag_classe I have the following observations:

Sorry, I should clarify and correct what I said. I thought pip.whl_mods() processed config files, but that's actually done by pip.parse(whl_modifications=...). That's the part of interest. pip.whl_mods is a way to generate a "modifications to make" config file, and is largely just to provide a migration path for pip_parse annotations for bzlmod.

The whl_modifications dict is a mapping of a file to the package the file applies to. I purposefully say the vague term "file" because we can decide what that file is and how it's interpreted.

The general idea is something like:

# MODULE.bazel
pip.parse(
  hub_name = "mypip",
  whl_modifications = {
    "@//patches/numpy:patches.json": "numpy",
  }
)

# File patches/numpy/patches.json
{
  "patches": [
    {"patch": "//patches/numpy:expose_headers_linux.patch",
     "version": "==1.0", "platform_matches": "linux*"},
    ...
  ]
}

And then the module or repo processes the patches.json file.

Whether you have 1 patch or 100, your MODULE file is about the same size. In the above scheme, it grows about linearly with the number of patched packages, but it'd be easy to modify it to do e.g. whl_modifications = {"@//patches:all_patches.json": "*"} or some-such.

Issues with resolving labels

How much of a problem is this in practice, though? My thinking is, given whl_modification = {"@something//:patches.json": "..."}, we have the mapping of what @something refers to. It seems reasonable to say that patches the config file reference must be within the config file's repo, or don't use label-notation and say the paths are relative to the config file.

(on a side note, this makes me wonder if Philip's proposed way to express patching would run into this same issue under bzlmod)

@rickeylev
Copy link
Collaborator

please assume a solution to bazelbuild/bazel#17880 and design with that in mind.

Oh, that would obviate much of my concern!

@aignas
Copy link
Collaborator Author

aignas commented Sep 14, 2023

Gazelle code is here and it is important to note that they also have a list of default overrides that are applied by default here.

I personally wonder if instead of asking all of the airflow 2.6 users to patch the wheels themselves to remove the dependency we should potentially have a similar file that allows us to do this for them. Similar with pytorch/triton circular dependency, or even the numpy headers. Just by default exposing headers for numpy in both, aliases and others would make the usability of rules_python better out of the box. However, that is a big scope creep in terms of responsibilities of the project. And we should discuss this topic outside the PR and the related ticket.

As for the rest of your comment, I'll try to come up with a set of proposals and potentially document them as a Markdown file in this PR, so that we can discuss in the PR as review comments.

@rickeylev
Copy link
Collaborator

Ignas and I met to discuss design a bit. The highlights:

Start with a whl-file level of specificity. This allows for the most fine-grained level of patching, something Greg, Phillip, and Ignas all agree is necessary. This also avoids having to figure out an API to conditionally match patches based on other criteria (e.g. platform, distribution, etc). The gist of the API is something like:

pip.patch(
    hub_name = "mypip",
    file="triton-2.0-py3-cp39-manylinux_foo.whl",
    patches=["@//foo:my.patch"],
    ...
)

I think this might even handle source distributions? You just use e.g. foo-1.2.3.tar.gz for the file name?

The downside of this API is you might have to repeat the same patch multiple times (e.g., once for each platform), which is a hassle if they're the same patch. This can be mitigated by using list-comprehensions in the MODULE file. Or maybe we change file to files and allow multiple files to be specified.

We decided to defer on how to specify the different "steps" of patching for now (tag class per step vs arg per step); we couldn't identify a technical argument that favors either.

We decided to defer on how to patch the generated build files (the distribution build file and alias build file). A combination of patching distributions and pip.whl_mods should, at the least, provide a work around, for now. At this point, I think the only thing blocking deprecating the annotations api in favor of a patching api is figuring out how to specify patches to the different steps of the pip process.

Ignas also said he was going to try and include something to help generate patch files (the basic idea being to give the user two directories (the original and desired), so they can modify code and run diff between the dirs to generate patches).

@aignas aignas force-pushed the feat/1076/pip-parse-patch branch from 3507d5f to 2d3eedc Compare September 25, 2023 13:13
@aignas aignas changed the title feat(pip_parse): support patching 'whl' and 'whl_library' feat(bzlmod): support patching 'whl' distributions Sep 25, 2023
@aignas aignas marked this pull request as draft September 25, 2023 13:14
@aignas
Copy link
Collaborator Author

aignas commented Sep 25, 2023

Marking this as a draft, because it still needs a little bit of cleanup.

@aignas aignas force-pushed the feat/1076/pip-parse-patch branch 2 times, most recently from c2dee2c to d755952 Compare September 27, 2023 23:47
@aignas
Copy link
Collaborator Author

aignas commented Oct 12, 2023

Marking as draft because there are PRs that should be merged before this.

@aignas aignas marked this pull request as draft October 12, 2023 15:54
@@ -0,0 +1,180 @@
"""
Regenerate a whl file after patching and cleanup the patched contents.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to reimplement this ourselves rather than using something like:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will explain my reasons, but some of them might be invalid anymore.

  1. We need to move files from one directory to another because the repository_ctx.patch can only apply patches to the root directory. I could not find a cross-platform way to list all files and move particular ones to a separate dir.
  2. Having a zero-dependencies script is a nice thing for writing this helper, because we are dealing with repository rules. Maybe this is actually an actual smell from the fact that we are doing it in the repository_ctx.
  3. I opted in for reusing the wheelmaker.py that we already have in the repo, this way we can have consistent output.

I think I could use the function you've linked, but I would prefer to retain this RECORD diffing part for bookkeeping as to what actually is patched. This script has a nice property that it does not change anything that exists in the directory and just zips all of the files ensuring that the RECORD file is consistent.

Copy link
Collaborator

@groodt groodt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass of the draft is LGTM

Looking forward to this landing!

@aignas aignas force-pushed the feat/1076/pip-parse-patch branch 5 times, most recently from e364f68 to 83b1a02 Compare October 17, 2023 07:43
@aignas aignas marked this pull request as ready for review October 17, 2023 07:44
Copy link
Collaborator

@groodt groodt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aignas aignas requested a review from rickeylev October 19, 2023 02:12
Before that the users had to rely on patching the actual wheel files and
uploading them as different versions to internal artifact stores if they
needed to modify the wheel dependencies. This is very common when
breaking dependency cycles in `pytorch` or `apache-airflow` packages.
With this feature we can support patching external PyPI dependencies via
pip.whl_override tag class to fix package dependencies and/or a broken
`RECORD` metadata file.

add the wheelmaker to the tool list that is used by whl_library

use whlmaker to repack a wheel
@aignas aignas force-pushed the feat/1076/pip-parse-patch branch from 42ed332 to 1e853cc Compare October 20, 2023 00:16
@aignas aignas enabled auto-merge October 20, 2023 00:17
@aignas aignas added this pull request to the merge queue Oct 20, 2023
Merged via the queue into bazelbuild:main with commit c0e18ed Oct 20, 2023
1 check passed
@aignas aignas deleted the feat/1076/pip-parse-patch branch May 13, 2024 06:48
@ph03
Copy link

ph03 commented Jul 16, 2024

General question: is there a way to expose / use the functionality of pip.override() for the non-bzlmod / WORKSPACE APIs of rules_python?

@philsc
Copy link
Contributor

philsc commented Jul 17, 2024

@ph03 , this hack seems to work for me for now:

diff --git a/python/private/pypi/requirements.bzl.tmpl.workspace b/python/private/pypi/requirements.bzl.tmpl.workspace
index 2f4bcd69..0d6c13a0 100644
--- a/python/private/pypi/requirements.bzl.tmpl.workspace
+++ b/python/private/pypi/requirements.bzl.tmpl.workspace
@@ -35,7 +35,11 @@ def _get_annotation(requirement):
     name = requirement.split(" ")[0].split("=")[0].split("[")[0]
     return _annotations.get(name)

-def install_deps(**whl_library_kwargs):
+def _get_patches(patch_spec, requirement):
+    name = requirement.split(" ")[0].split("=")[0].split("[")[0]
+    return patch_spec.get(name)
+
+def install_deps(patch_spec = {}, **whl_library_kwargs):
     """Repository rule macro. Install dependencies from `pip_parse`.

     Args:
@@ -68,5 +72,6 @@ def install_deps(**whl_library_kwargs):
             group_name = group_name,
             group_deps = group_deps,
             annotation = _get_annotation(requirement),
+            whl_patches = _get_patches(patch_spec, requirement),
             **whl_config
         )
diff --git a/python/private/pypi/whl_library.bzl b/python/private/pypi/whl_library.bzl
index 77cbd4e2..98bacf2d 100644
--- a/python/private/pypi/whl_library.bzl
+++ b/python/private/pypi/whl_library.bzl
@@ -308,8 +308,7 @@ def _whl_library_impl(rctx):
         patches = {}
         for patch_file, json_args in rctx.attr.whl_patches.items():
             patch_dst = struct(**json.decode(json_args))
-            if whl_path.basename in patch_dst.whls:
-                patches[patch_file] = patch_dst.patch_strip
+            patches[patch_file] = patch_dst.patch_strip

         whl_path = patch_whl(
             rctx,

Then in WORKSPACE it looks like so:

load(
    "@pip_deps//:requirements.bzl",
    install_pip_deps = "install_deps",
)

install_pip_deps(
    patch_spec =  {
        "matplotlib": {
            "//third_party:python/matplotlib/init.patch": {
                "patch_strip": 1,
            },
        },
        "pygobject": {
            "//third_party:python/pygobject/init.patch": {
                "patch_strip": 1,
            },
        }
    }
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants