-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(bzlmod): support patching 'whl' distributions #1393
Conversation
NOTE: This PR is to initiate the discussion of how the API for patching @groodt, @rickeylev, could you please take a look? |
f1481ba
to
9454b0a
Compare
So I decided to not use the I am looking for feedback about which option would be better. @fmeum could you also please chime in on designing this extension/API here?
|
I don't know much about Python, but the Is |
I've not had time to look in depth, but the unified diff format patch files is a similar UX to what I've been advocating for and similar to what we use at my $dayjob. They're far more flexible and likely to support more use-cases with well understood semantics than a series of custom annotations for every indepent use-case like we've been doing in the past. Are you applying the patches before or after wheels are unpacked / installed? |
@groodt, thanks for the comment, the patches are applied after the wheels are unpacked so that we can also patch the BUILD.bazel files. Do you think doing the patching before unpacking would make more sense? |
It can be nice to apply the patch before install time in case you are modifying the setup.py |
There’s generally benefit to being able to patch the following:
If there is a capability to patch as early as possible, then there’s almost no need for option 3 and 4, because if one is able to modify all source and metadata from the foreign artifacts, then one is able to influence the structure of the dependency closure and generated BUILD files as well (albeit a little awkwardly). At $dayjob, we optimise for patching the wheels, because most of the ecosystem distributes wheels (and it’s easy enough to populate a private wheel repository where required). I personally would prefer to patch earlier, but maybe we need a lifecycle of patching? Sounds complicated though. Patches for pre-install and post-install of the artifacts? I know what my preference is, but would love to hear opinions. And then maybe this is slightly controversial. For BUILD files, I would prefer the semantics of http_archive. I would prefer to specify “build_file” or “build_file_content” to fully replace the generated default. Hear me out. 😂 The defaults should be simpler and often work well enough so that it’s rare to swap them out. Also, with the capability to patch the foreign artifacts, it’s also possible to influence the generated BUILD.bazel. So, in a world where we simplify and remove a lot of the code generated in repository rules (move the convenience items to user land), we often won’t need to patch BUILD.bazel files. I think having to reason about patching generated code is just too complicated otherwise. Thoughts? |
My thinking is:
I agree that having
For |
165c3f9
to
142230f
Compare
Ready for initial review. Thanks @groodt for the initial comments on the API. I think the current state of the PR which provides bdist patching and whl_library patching is enough to cover most of the needs of modifying things. However, there are a few things that I am not sure about so am looking forward to comments on the API of pip.whl_override tag class. |
My main concern is ending up having 5 different apis to do patching:
Also, part of the goal with pip.whl_mods was to have less configuration directly in the MODULE file and move it to a separate file (hence the json config file). The rationale was two fold:
re: naming:
I think so, at least for discussion purposes. It'd certainly help me with constructing a mental model of when patches could be applied, which might make it easier to figure out how to express "apply patch P to X if Y". I think the 4-point list and the "5 apis" captures most cases? I think a missing case is patching the "alias" repo that glues together the platform-specific repos (e.g. if you patch numpy bdists to expose headers, you have to patch the alias repo to expose the platform-specific build targets) |
So the last implementation that we have here was modelled after In my limited experience with the
I agree with the sentiment that large Regarding naming - I am open to change the Regarding
However, I am wondering if this is better than having multiple tag classes, where we are solving the polymorphism of applying patches not via function parameters but via different I guess the interesting part here is that creating patches should be reasonably easy in order to make the API useful and I am not sure yet how we could achieve that. |
The Regarding managing large MODULE.bazel files, please assume a solution to bazelbuild/bazel#17880 and design with that in mind. Whatever the result is, it will be here to stay and basing it on non-Starlark helper files will haunt us (JSON doesn't have comments, requires separate formatting, etc.). I am pretty sure that we can resolve that issue quickly if needed, so please comment on it with your requirements. CC @Wyverald |
You make some fair points @aignas. It sounds like we're all facing about the same direction, at least :).
Can you link me to an example? I briefly looked at bazel-gazelle, but didn't see an example, and digging through the source wasn't produtive.
Yeah, totally agree; this is the crux of the issue. The Maybe I'm pretty mixed about tag-class-per-step vs arg-per-step. As a user, I think, "I want to patch numpy", not "I want to patch the generate build file step". The latter assumes I know the internals and already know what part I need to modify, which seems unlikely. Similarly, patching a package might cross multiple steps -- having that split up between different tag classes seems a chore (my immediate thought is: "i need to go comment out a mishmash of lines" vs "here's a single contiguous block of numpy stuff"). This makes me prefer e.g. Also: Part of the design goal of Stepping back, if we can make the mapping trivial, then maybe a lot of this goes away. I have trouble seeing how to do that, though. For example, in conversations with Greg and Phillip, they want to have distribution-level granularity of patches (i.e. the specific
Sorry, I should clarify and correct what I said. I thought The The general idea is something like:
And then the module or repo processes the Whether you have 1 patch or 100, your MODULE file is about the same size. In the above scheme, it grows about linearly with the number of patched packages, but it'd be easy to modify it to do e.g.
How much of a problem is this in practice, though? My thinking is, given (on a side note, this makes me wonder if Philip's proposed way to express patching would run into this same issue under bzlmod) |
Oh, that would obviate much of my concern! |
Gazelle code is here and it is important to note that they also have a list of default overrides that are applied by default here. I personally wonder if instead of asking all of the As for the rest of your comment, I'll try to come up with a set of proposals and potentially document them as a Markdown file in this PR, so that we can discuss in the PR as review comments. |
Ignas and I met to discuss design a bit. The highlights: Start with a whl-file level of specificity. This allows for the most fine-grained level of patching, something Greg, Phillip, and Ignas all agree is necessary. This also avoids having to figure out an API to conditionally match patches based on other criteria (e.g. platform, distribution, etc). The gist of the API is something like:
I think this might even handle source distributions? You just use e.g. The downside of this API is you might have to repeat the same patch multiple times (e.g., once for each platform), which is a hassle if they're the same patch. This can be mitigated by using list-comprehensions in the MODULE file. Or maybe we change We decided to defer on how to specify the different "steps" of patching for now (tag class per step vs arg per step); we couldn't identify a technical argument that favors either. We decided to defer on how to patch the generated build files (the distribution build file and alias build file). A combination of patching distributions and pip.whl_mods should, at the least, provide a work around, for now. At this point, I think the only thing blocking deprecating the annotations api in favor of a patching api is figuring out how to specify patches to the different steps of the pip process. Ignas also said he was going to try and include something to help generate patch files (the basic idea being to give the user two directories (the original and desired), so they can modify code and run diff between the dirs to generate patches). |
3507d5f
to
2d3eedc
Compare
Marking this as a draft, because it still needs a little bit of cleanup. |
c2dee2c
to
d755952
Compare
Marking as draft because there are PRs that should be merged before this. |
python/private/patch_whl.py
Outdated
@@ -0,0 +1,180 @@ | |||
""" | |||
Regenerate a whl file after patching and cleanup the patched contents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to reimplement this ourselves rather than using something like:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will explain my reasons, but some of them might be invalid anymore.
- We need to move files from one directory to another because the
repository_ctx.patch
can only apply patches to the root directory. I could not find a cross-platform way tolist all files
and move particular ones to a separate dir. - Having a zero-dependencies script is a nice thing for writing this helper, because we are dealing with repository rules. Maybe this is actually an actual smell from the fact that we are doing it in the
repository_ctx
. - I opted in for reusing the
wheelmaker.py
that we already have in the repo, this way we can have consistent output.
I think I could use the function you've linked, but I would prefer to retain this RECORD
diffing part for bookkeeping as to what actually is patched. This script has a nice property that it does not change anything that exists in the directory and just zips all of the files ensuring that the RECORD
file is consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass of the draft is LGTM
Looking forward to this landing!
e364f68
to
83b1a02
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Before that the users had to rely on patching the actual wheel files and uploading them as different versions to internal artifact stores if they needed to modify the wheel dependencies. This is very common when breaking dependency cycles in `pytorch` or `apache-airflow` packages. With this feature we can support patching external PyPI dependencies via pip.whl_override tag class to fix package dependencies and/or a broken `RECORD` metadata file. add the wheelmaker to the tool list that is used by whl_library use whlmaker to repack a wheel
42ed332
to
1e853cc
Compare
General question: is there a way to expose / use the functionality of |
@ph03 , this hack seems to work for me for now: diff --git a/python/private/pypi/requirements.bzl.tmpl.workspace b/python/private/pypi/requirements.bzl.tmpl.workspace
index 2f4bcd69..0d6c13a0 100644
--- a/python/private/pypi/requirements.bzl.tmpl.workspace
+++ b/python/private/pypi/requirements.bzl.tmpl.workspace
@@ -35,7 +35,11 @@ def _get_annotation(requirement):
name = requirement.split(" ")[0].split("=")[0].split("[")[0]
return _annotations.get(name)
-def install_deps(**whl_library_kwargs):
+def _get_patches(patch_spec, requirement):
+ name = requirement.split(" ")[0].split("=")[0].split("[")[0]
+ return patch_spec.get(name)
+
+def install_deps(patch_spec = {}, **whl_library_kwargs):
"""Repository rule macro. Install dependencies from `pip_parse`.
Args:
@@ -68,5 +72,6 @@ def install_deps(**whl_library_kwargs):
group_name = group_name,
group_deps = group_deps,
annotation = _get_annotation(requirement),
+ whl_patches = _get_patches(patch_spec, requirement),
**whl_config
)
diff --git a/python/private/pypi/whl_library.bzl b/python/private/pypi/whl_library.bzl
index 77cbd4e2..98bacf2d 100644
--- a/python/private/pypi/whl_library.bzl
+++ b/python/private/pypi/whl_library.bzl
@@ -308,8 +308,7 @@ def _whl_library_impl(rctx):
patches = {}
for patch_file, json_args in rctx.attr.whl_patches.items():
patch_dst = struct(**json.decode(json_args))
- if whl_path.basename in patch_dst.whls:
- patches[patch_file] = patch_dst.patch_strip
+ patches[patch_file] = patch_dst.patch_strip
whl_path = patch_whl(
rctx, Then in WORKSPACE it looks like so: load(
"@pip_deps//:requirements.bzl",
install_pip_deps = "install_deps",
)
install_pip_deps(
patch_spec = {
"matplotlib": {
"//third_party:python/matplotlib/init.patch": {
"patch_strip": 1,
},
},
"pygobject": {
"//third_party:python/pygobject/init.patch": {
"patch_strip": 1,
},
}
}
) |
Before that the users had to rely on patching the actual wheel files and
uploading them as different versions to internal artifact stores if they
needed to modify the wheel dependencies. This is very common when
breaking dependency cycles in
pytorch
orapache-airflow
packages.With this feature we can support patching external PyPI dependencies via
pip.override tag class to fix package dependencies and/or a broken
RECORD
metadata file.Overall design:
whl_installer
CLI into two parts - downloading and extracting.Merged in refactor(whl_library): split wheel downloading and extraction into separate executions #1487.
and repackages a wheel (so that the extraction part works as before).
override
tag_class to thepip
extension and allow users to pass patchesto be applied to specific wheel files.
modifying the code of other modules and conflicts between modules and their patches.
Patches have to be in
unified-diff
format.Related #1076, #1166, #1120