Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEW: Better package splitting for multi-output recipes with negative glob in outputs/files #5216

Merged
merged 44 commits into from
Jun 7, 2024

Conversation

carterbox
Copy link
Contributor

@carterbox carterbox commented Mar 6, 2024

Description

Ressurects #4197. Closes #4196.

The purpose of this PR is to make it easier to split packages into multiple outputs using glob expressions. This is accomplished in two ways:

  1. A negative pattern match is provided. This makes it easier to use fewer glob expressions if you need to include an entire directory tree except for a single file (type). For example, you might want to glob lib/**/libfoo*, but not lib/**/*.a
  2. Only the files installed to the PREFIX in the top-level build are considered. This removes the need to craft your glob expressions to avoid the artifacts installed to the PREFIX by host dependencies. For example, you can now glob include/**/* without pulling in the headers of other packages. It is already the behavior of a single-output recipe to ignore files added to the prefix by host dependencies, so I'm not sure why this feature didn't make it into multi-output recipes.

These new behaviors are only used if the include/exclude keywords are used under the files key. The previous behavior is retained.

outputs:
  - name: foo
    files:
      - bin/*  # old behavior; includes artifacts from other packages
  - name: bar
    files:
      include:
        - bin/*  # new behavior; only matches artifacts from this recipe
      exclude:  # optional
        - bin/*.exe # new behavior
  - name: zee
    script: install.py  # old behavior

Checklist - did you ...

  • Add a file to the news directory (using the template) for the next release's release notes?
  • Add / update necessary tests?
  • Add / update outdated documentation?

@conda-bot
Copy link
Contributor

We require contributors to sign our Contributor License Agreement and we don't have one on file for @carterbox.

In order for us to review and merge your code, please e-sign the Contributor License Agreement PDF. We then need to manually verify your signature, merge the PR (conda/infrastructure#891), and ping the bot to refresh the PR.

@carterbox carterbox marked this pull request as ready for review March 8, 2024 19:21
@carterbox carterbox requested a review from a team as a code owner March 8, 2024 19:21
@carterbox carterbox changed the title NEW: Negative matching 'files' for outputs NEW: Better package splitting for multi-output recipes with negative glob in outputs/files Mar 8, 2024
@conda-bot conda-bot added the cla-signed [bot] added once the contributor has signed the CLA label Mar 20, 2024
Copy link

codspeed-hq bot commented Mar 20, 2024

CodSpeed Performance Report

Merging #5216 will not alter performance

Comparing carterbox:files-exclude (c92f1a1) with main (cdca0b4)

Summary

✅ 3 untouched benchmarks

@carterbox
Copy link
Contributor Author

I'm not sure that the failure for linux 3.8 23.5.0 serial is related to this PR.

----------------------------- Captured stderr call -----------------------------
Traceback (most recent call last):
  File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/conda/exception_handler.py", line 16, in __call__
  File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/conda/cli/main.py", line 66, in main_subshell
  File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/conda/cli/conda_argparse.py", line 31, in <module>
  File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/conda/base/context.py", line 24, in <module>
ModuleNotFoundError: No module named 'conda._vendor.appdirs'

Copy link
Contributor

@jaimergp jaimergp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot. The PR is super clear and the documentation is on-point. I only added a couple comments to clarify my own doubts. The new positional argument could be considered API breaking, so let's see how the rest of the team feels about it. I don't have strong feelings, just applying an abundance of caution Just In Case ✨ .

Once we have discussed that, I'll be super happy to approve. 👍

metadata: MetaData,
env,
stats,
new_prefix_files: set[str],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is adding a new positional argument to a (let's assume) public signature, what about making it a None kwarg. This way downstream users won't run into number of arguments errors:

Suggested change
new_prefix_files: set[str],
new_prefix_files: set[str] = None,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point; it is API breaking assuming these functions are part of the public API. I need to mark the new argument as typing.Optional and/or set appropriate defaults.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conda_build/build.py Show resolved Hide resolved
metadata: MetaData,
env,
stats,
new_prefix_files: set[str],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about the API breakage. Not sure how prominent the usage of wheel outputs is, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1299,9 +1299,10 @@ You can specify files to be included in the package in 1 of
Explicit file lists are relative paths from the root of the
build prefix. Explicit file lists support glob expressions.
Directory names are also supported, and they recursively include
contents.
contents. Files installed to the prefix by host dependencies will
be matched by glob expressions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe even highlight this further with an admonition or something similar. I wasn't aware of that design issue!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with an admonition or something similar.

I don't know what you mean. You want stronger language? For example: "WARNING! Files installed to the prefix by host dependencies also will be matched by glob expressions."

I wasn't aware of that design issue!

Yeah, I'm not sure whether this is a bug/oversight or a feature. It's definitely unexpected to me. The most active contributor in this section is @msarahan; maybe they know?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant an admonition in the RST sense. Triple-backtick + warning kind of syntax. Equivalent to this:

Warning

Files installed to the prefix by host dependencies also will be matched by glob expressions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 642906b

news/files-exclude.rst Outdated Show resolved Hide resolved
Comment on lines 11 to 17
files:
- subpackage_file1
- somedir
- "*.ext"
include:
- subpackage_file1
- somedir
- "*.ext"
exclude:
- "*3.ext"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we replace the old test with a new one instead of adding a new case? Maybe I would add a new output in this meta.yaml and then test the different behaviors differently, if possible (e.g. how the list-of-str behaviour does copy anything in PREFIX, but include/exclude doesn't?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the test recipe to include multiple outputs: one uses the old file matching and one uses the new file matching.

Comment on lines 1315 to 1318
Files can be excluded by specifying `files` as a dictionary separating
files to `include` from those to `exclude`. Files installed to the prefix
by host dependencies are automatically excluded when the include/exclude
syntax is used:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to mention the precedence here, i.e., exclude entries have higher priority as per the implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mentions the new behavior, i.e., "Files [...] excluded when the include/exclude syntax is used".
But since this is a deviation from the previous "include" logic (which would be more of a force-include since it allows packaging pre-existing files.), I think we should make the difference more prevalent/explain the behavior for the old non-dict case more explicitly.

Copy link
Contributor Author

@carterbox carterbox Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to mention the precedence here, i.e., exclude entries have higher priority as per the implementation.

exclude has a higher priority than which operation? It's unclear to me what ambiguity we would be removing by mentioning priority.

This mentions the new behavior, i.e., "Files [...] excluded when the include/exclude syntax is used". But since this is a deviation from the previous "include" logic (which would be more of a force-include since it allows packaging pre-existing files.), I think we should make the difference more prevalent/explain the behavior for the old non-dict case more explicitly.

Suggested change
Files can be excluded by specifying `files` as a dictionary separating
files to `include` from those to `exclude`. Files installed to the prefix
by host dependencies are automatically excluded when the include/exclude
syntax is used:
When defining `outputs/files` as a list, any file in the prefix (including those
installed by host dependencies) matching one of the glob expressions is
included in the output. Greater control over file matching may be
achieved by defining `files` as a dictionary separating files to
`include` from those to `exclude`.
When using include/exclude, only files installed by
the current recipe are considered. i.e. files in the prefix installed
by host dependencies are not matched. include/exclude may not be used
simultaneously with glob expressions listed directly in `outputs/files`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to mention the precedence here, i.e., exclude entries have higher priority as per the implementation.

exclude has a higher priority than which operation?

The kind of precedence where if you do

  files:
    include:
      - lib/libfoo.so
    exclude:
      - lib/libfoo.so

the exclusion wins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now explained that when files match both exclude and include, they are excluded. d1b12a1

Comment on lines 1931 to 1932
include = files.get("include", [])
exclude = files.get("exclude", [])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This syntax will be affected by the "key is set to an empty block, which means None in YAML" kind of issue. We should change it to:

Suggested change
include = files.get("include", [])
exclude = files.get("exclude", [])
include = files.get("include") or []
exclude = files.get("exclude") or []

It might also be a good opportunity to sneak @jakirkham's changes added by #4971.

Copy link
Contributor

@jaimergp jaimergp Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See line 1818:

- files = output.get("files", [])
+ files = output.get("files") or []

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that files.get("include") may return None, but include must be a list. So we use the falsiness of None to replace it with []. Equivalent to:

include = files['include'] if ('include' in files and files['include'] != None) else []

Copy link
Contributor

@jaimergp jaimergp Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might have

- output: something
  files:
    include:
      - path
    exclude:
      - not-this-path  # [linux]

and then you'll have files["exclude"] = None on macOS, for example (linux is false). Not sure at what point we are ensuring that include is a list.

Under that circumstance files.get("exclude", []) will be None instead of the expected []. Is that what you mean? I think we are saying the same 😬

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We are saying the same thing. I'm just trying to figure out if we should be doing something stronger than converting Falsy values into the empty list. Perhaps convert any non-list into a list? Something like:

include = files.get("include") if isinstance(files.get("include"), list) else []

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might silence errors when users populate it with a dict or something accidentally. I think we can just play it simple with falsy promotions.

@@ -2817,7 +2848,7 @@ def build(
# This is wrong, files has not been expanded at this time and could contain
# wildcards. Also well, I just do not understand this, because when this
# does contain wildcards, the files in to_remove will slip back in.
if "files" in output_d:
if "files" in output_d and not isinstance(output_d["files"], dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if "files" in output_d and not isinstance(output_d["files"], dict):
if not isinstance(output_d.get("files") or (), dict):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this condition to:

                    if (
                        "files" in output_d
                        and output_d["files"] is not None
                        and not isinstance(output_d["files"], dict)
                    ):

@jaimergp
Copy link
Contributor

tests/test_subpackages.py::test_subpackage_recipes[copying_files] is failing on macOS and Windows @carterbox

For some reason, target_platform is not set.
@wolfv
Copy link
Contributor

wolfv commented Jun 1, 2024

@carterbox I've started to implement this idea in rattler-build, too: prefix-dev/rattler-build#819

Would love your feedback once it's in a testable state. This is also in preparation of a "top-level" cache build (that can be split up into multiple packages).

@carterbox
Copy link
Contributor Author

Thanks, @isuruf for fixing my platform detection logic in the tests.

@jaimergp, Looks like the failed test on Linux is in a test unrelated to this PR. Maybe someone can restart the tests to see if it is ephemeral or not?

isuruf
isuruf previously approved these changes Jun 3, 2024
jaimergp
jaimergp previously approved these changes Jun 3, 2024
kenodegard
kenodegard previously approved these changes Jun 5, 2024
Copy link
Contributor

@kenodegard kenodegard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good, just a minor suggestion and a question about deprecating the old behavior

conda_build/build.py Outdated Show resolved Hide resolved
Comment on lines +1827 to +1831
else:
keep_files = {
os.path.normpath(pth)
for pth in utils.expand_globs(files, metadata.config.host_prefix)
}
Copy link
Contributor

@kenodegard kenodegard Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the change in behavior above with only allowing new files to be included should the old behavior of including any file in the prefix be deprecated?

Suggested change
else:
keep_files = {
os.path.normpath(pth)
for pth in utils.expand_globs(files, metadata.config.host_prefix)
}
else:
keep_files = {
os.path.normpath(pth)
for pth in utils.expand_globs(files, metadata.config.host_prefix)
}
if keep_files - new_prefix_files:
deprecated.topic(...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still useful for eg when building gcc_bootstrap packages. Deprecating the implicit behaviour and adding a new flag would be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that the current behavior of including all files from the prefix for multi-output and only installed files from the prefix for single-output recipes is inconsistent and thus non-intuitive. rattler-build is adding a section called "always_include_files" for recipes that want to repackage artifacts from their host dependencies.

I'm favor of deprecating the old behavior, but I'm uncertain about any removal or replacement schedule.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rattler-build is adding a section called "always_include_files" for recipes that want to repackage artifacts from their host dependencies.

This exists in conda already, and is in use in conda-forge. Main use-case for that from my POV is to be explicit about overwriting some CMake metadata if an incremental build adds more targets to an already-existing CMake file (example).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can deprecate that behaviour in a separate PR too. This PR has been open for too long and I wouldn't want to test your patience further @carterbox <3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 LGTM! Just wanted to understand the thoughts/plans for the old behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open an issue to capture this feedback, though I think with the documentation this PR added we are in a better place already.

@carterbox carterbox dismissed stale reviews from kenodegard and jaimergp via 5d8efe9 June 5, 2024 17:55
@jaimergp jaimergp merged commit 4ec4ac7 into conda:main Jun 7, 2024
29 checks passed
@jaimergp
Copy link
Contributor

jaimergp commented Jun 7, 2024

Thanks @carterbox and everyone involved! Three months and 64 comments later, this one is in!

@carterbox carterbox deleted the files-exclude branch June 7, 2024 18:12
mbargull added a commit to isuruf/r-base-feedstock that referenced this pull request Jun 21, 2024
@beeankha beeankha mentioned this pull request Jul 16, 2024
55 tasks
carterbox added a commit to carterbox/conda-build that referenced this pull request Aug 30, 2024
The expand_globs function from conda_build.utils logs an ERROR
when a glob expression returns no matches, this is overly alarming
because the user may now use negative glob expressions which they
don't care if it returns empty or the user may want to use the
same set of glob expressions for multiple platforms some of which
may return empty on some platforms.

conda#5216
conda#5455
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed [bot] added once the contributor has signed the CLA
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Negative matching 'files' for outputs