Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

package data in subdirectory causes warning #3340

Open
isuruf opened this issue May 29, 2022 · 54 comments
Open

package data in subdirectory causes warning #3340

isuruf opened this issue May 29, 2022 · 54 comments

Comments

@isuruf
Copy link
Contributor

isuruf commented May 29, 2022

setuptools version

62.3.2

Python version

3.10

OS

Debian with conda

Additional environment information

No response

Description

pyopencl has OpenCL files and some headers in a subdirectory pyopencl/cl and they are included as package_data so that the python module can find them.

package_data={
                    "pyopencl": [
                        "cl/*.cl",
                        "cl/*.h",
                        "cl/pyopencl-random123/*.cl",
                        "cl/pyopencl-random123/*.h",
                        ]
                    },

With new setuptools, there is a warning saying


    ############################
    # Package would be ignored #
    ############################
    Python recognizes 'pyopencl.cl' as an importable package, however it is
    included in the distribution as "data".
    This behavior is likely to change in future versions of setuptools (and
    therefore is considered deprecated).

    Please make sure that 'pyopencl.cl' is included as a package by using
    setuptools' `packages` configuration field or the proper discovery methods
    (for example by using `find_namespace_packages(...)`/`find_namespace:`
    instead of `find_packages(...)`/`find:`).

    You can read more about "package discovery" and "data files" on setuptools
    documentation page.

cc @inducer

Expected behavior

No warning

How to Reproduce

  1. clone https://github.com/inducer/pyopencl
  2. install numpy
  3. Run python setup.py install

Output

$ python setup.py install
running install
/home/idf2/miniforge3/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/home/idf2/miniforge3/lib/python3.10/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running bdist_egg
running egg_info
writing pyopencl.egg-info/PKG-INFO
writing dependency_links to pyopencl.egg-info/dependency_links.txt
writing requirements to pyopencl.egg-info/requires.txt
writing top-level names to pyopencl.egg-info/top_level.txt
reading manifest file 'pyopencl.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
adding license file 'LICENSE'
writing manifest file 'pyopencl.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
/home/idf2/miniforge3/lib/python3.10/site-packages/setuptools/command/build_py.py:153: SetuptoolsDeprecationWarning:     Installing 'pyopencl.cl' as data is deprecated, please list it in `packages`.
    !!


    ############################
    # Package would be ignored #
    ############################
    Python recognizes 'pyopencl.cl' as an importable package, however it is
    included in the distribution as "data".
    This behavior is likely to change in future versions of setuptools (and
    therefore is considered deprecated).

    Please make sure that 'pyopencl.cl' is included as a package by using
    setuptools' `packages` configuration field or the proper discovery methods
    (for example by using `find_namespace_packages(...)`/`find_namespace:`
    instead of `find_packages(...)`/`find:`).

    You can read more about "package discovery" and "data files" on setuptools
    documentation page.


!!

  check.warn(importable)
running build_ext
@isuruf isuruf added bug Needs Triage Issues that need to be evaluated for severity and status. labels May 29, 2022
@abravalheri
Copy link
Contributor

abravalheri commented May 29, 2022

Hi @isuruf thank you for bringing this subject for discussion.

This change (and the warning) was intentional, so in the future we will be able to solve a related problem with find_packages(..., exclude=...) and include_package_data.

Since PEP 420, there is no way for really differentiating a folder from a package in Python... People that install pyopencl will be able to successfully run import pyopencl.cl, which means that the interpreter does consider properly.cl a package. Nevertheless configuration file seems to want to consider cl a "data directory", which is not currently a well stablished concept in the ecosystem...

Would it possible for you guys to use find_namespace_packages() as indicated in the warning message?

@jwodder
Copy link
Contributor

jwodder commented Jun 3, 2022

I'm facing the same problem, and I'm not clear on the nature of the change suggested by @abravalheri. Are you saying that directories in the package hierarchy that contain only data files and no Python code should be included in the project's list of packages despite not actually being Python packages?

@abravalheri
Copy link
Contributor

abravalheri commented Jun 4, 2022

Hi @jwodder, thank you very much for the input. Please find my comments bellow:

directories in the package hierarchy that contain only data files and no Python ... not actually being Python packages

I think that this assumption1 no longer holds.

According to PEP 420, package is a term that "refers to Python packages as defined by Python’s import statement".
An evidence that a directory without Python code is treated by the Python import statement as a package is given by the following snippets:

rm -rf /tmp/example
mkdir -p /tmp/example/mypkg1
touch /tmp/example/mypkg1/file.txt
mkdir -p /tmp/example/mypkg2/subpkg
touch /tmp/example/mypkg2/__init__.py
touch /tmp/example/mypkg2/subpkg/file.txt
python3.10
>>> import sys
>>> sys.path.append("/tmp/example")
>>> import mypkg1
>>> mypkg1
<module 'mypkg1' (<_frozen_importlib_external._NamespaceLoader object at 0x7f3683fac2e0>)>
>>> mypkg1.__path__
_NamespacePath(['/tmp/example/mypkg1'])
>>> mypkg1.__loader__.is_package('mypkg1')
>>> import mypkg2.subpkg
>>> mypkg2.subpkg.__path__
_NamespacePath(['/tmp/example/mypkg2/subpkg'])
>>> mypkg2.subpkg.__loader__.is_package('mypkg2.subpkg')
True

Note in this example that both mypkg1 and mypkg2.subpkg can be imported, and are treated as regular packages (both are imported as module objects with the attribute __path__ set), despite not containing any Python file.
I added a mypkg.subpkg to the example to demonstrate that it does not matter whether or not the directory is nested inside another "traditional package".

Moreover, the ecosystem also considers those folders, as demonstrated by the implementation of importlib.resources:

>>> import importlib.resources
>>> importlib.resources.contents('mypkg1')
['file.txt']
>>> importlib.resources.contents('mypkg2.subpkg')
['file.txt']

I understand that not all tools in the ecosystem embrace PEP 420 by default and that sometimes it might be frustrating to see tools evolving some interfaces evolving. That is the reason why I decided to add a deprecation warning first and a transition period, instead of simply changing things in a major version bump.

This change in setuptools was implemented due to a single, very pragmatic, objective: to fix internal inconsistencies (as presented in #3260), which are caused by this division between packages and directories that don't contain Python code (which is not a division recognized by Python's import system).

For the time being, I don't plan on removing the deprecation warning, since the only way I know how to fix the internal inconsistencies requires us to eliminate this arbitrary division. However if anyone in the community finds a different way of solving the problem and are willing to provide a backward-compatible PR, I am more than happy to consider an alternative.

Footnotes

  1. Namely the assumption that a directory with no Python code is not a package.

@abravalheri abravalheri removed bug Needs Triage Issues that need to be evaluated for severity and status. labels Jun 4, 2022
@abravalheri abravalheri changed the title [BUG] package data in subdirectory causes warning package data in subdirectory causes warning Jun 4, 2022
@jwodder
Copy link
Contributor

jwodder commented Jun 4, 2022

@abravalheri So you are saying that "data-only packages" should be included in the project's list of packages? Is that the only change you're currently recommending to authors of such projects? Does it matter whether the data files are included via package_data or MANIFEST.in+include_package_data?

@abravalheri
Copy link
Contributor

Yes, I recommend adding all the sub-directories to the list of packages, even if they only include data files. On the bright side that can be done with find_namespace_packages (setup.py) or find_namespace: (setup.cfg).

It should make no difference if you are using package_data or MANIFEST.in (although I am not sure if users are going to see the warning if they don't use include_package_data=True).

@abravalheri
Copy link
Contributor

Do you have a suggestion about how to improve the warning message to make it more clear?

@jwodder
Copy link
Contributor

jwodder commented Jun 4, 2022

@abravalheri I find the first paragraph of the warning to be a bit confusing. I would rewrite it from:

Python recognizes '...' as an importable package, however it is included in the distribution as "data". This behavior is likely to change in future versions of setuptools (and therefore is considered deprecated).

to something more like:

Python recognizes '...' as an importable package, but it is not listed in "packages". It is included in the distribution because it contains package data, but this behavior is likely to change in future versions of setuptools (and therefore is considered deprecated).

jwodder added a commit to wheelodex/wheelodex that referenced this issue Jun 4, 2022
jwodder added a commit to jwodder/pyrepo that referenced this issue Jun 4, 2022
@cdfarrow
Copy link
Contributor

cdfarrow commented Jun 9, 2022

I was following this issue because I was very confused (as a newbie) by the warning that came up. Can I suggest a further edit to the warning message:
Currently {importable!r} is only added to the distribution because it may
contain data files...
-->
Currently {importable!r} has been automatically added to the distribution because it may
contain data files...
This is my understanding of what is actually happening (i.e. automatic inclusion). If this suggested change is wrong, then I'm still not following how this works...

@abravalheri
Copy link
Contributor

Thank you very much @cdfarrow, this suggestion looks good to me. I will implement it probably later this week.

@mhkline
Copy link

mhkline commented Jun 17, 2022

How about the case where we want *.py files in a subdirectory included as data files? This occurs in a couple projects for me that define a plug-in interface. Basically there are a couple python files in a package subdirectory that I provide as examples of plug-in implementation. These are also loaded through the common plug-in interface via importlib.

Will the inclusion as a namespace package still work in this case? Anything special to be aware of?

If it makes a difference, I'm using the MANIFEST.in method with include_package_data=True.

Also, I created this SO question about this issue when I first encountered it. I guess I should have come here first! :)

@abravalheri
Copy link
Contributor

abravalheri commented Jun 17, 2022

Hi @mhkline. The warning is not about the files themselves, but about the directory.
Right now there is no concept of a "data directory" for the package ecosystem.

Since PEP 420, effectively all directories are packages regardless of containing a __init__.py file or not. With this warning, my intention is to align the expectations of the users with the behaviour we observe in Python.

If you want the directory to be included in the distribution, you can include it via the packages= configuration. find_namespace_packages() in setup.py or find_namespace: in setup.cfg will do that for you, and probably make the warning go away.

jwodder added a commit to jwodder/pyrepo that referenced this issue Jun 20, 2022
- Give `pyrepo release` a `--date` flag
- Give `pyrepo begin-dev` a `--no-next-version` flag
- `pyrepo release`: Don't upload projects with "Private" classifiers to PyPI
- Support projects that use versioningit
    - Add a `uses_versioningit: bool` template variable
- Support CHANGELOG headers of the form `vDATEVERSION`
- Update for setuptools 61.0.0
- Update twine dependency to 4.0
- Templates:
    - Update `sphinx-copybutton` version to `~=0.5.0`
    - Update `.readthedocs.yml` to use a `build:` section and rename the file to `.readthedocs.yaml`
    - Update `.pre-commit-config.yaml`
    - Update `Sphinx` version to `~=5.0`
    - `setup.cfg`: Use `find_namespace:` instead of `find:` (See <pypa/setuptools#3340>)
- Internal API:
    - Store project details in a `ProjectDetails` class
    - Add `Templater` and `TemplateWriter` classes
    - Include classifiers in project details
lemon24 added a commit to lemon24/reader that referenced this issue Jun 28, 2022
@kandersolar
Copy link

The above discussion covers how to go about building a correct distribution without using deprecated setuptools functionality. Thanks all for that!

My question here is more philosophical than practical. It seems like setup(packages=...) and MANIFEST.in are at least partially overlapping in functionality. In our MANIFEST.in we specify exactly what files and directories to include in distribution bundles. It seems redundant to have to specify similar information (the directories part) via the packages= configuration. Couldn't that list of packages theoretically be inferred from the MANIFEST.in contents?

@leycec
Copy link

leycec commented May 23, 2023

Insane lunatics who are publicly disembowelling setuptools while distraught children are franticly screaming, please stop publicly disembowelling setuptools while distraught children are franticly screaming.

This means you, @abravalheri – and everyone else with setuptools push authority who mistakenly believed that Masaki Kobayashi's seminal masterpiece "Harakiri" was not, in fact, a brutal dissection of authoritarian militancy under entrenched hierarchy but instead an exemplary paragon of software best practices in the enlightened post-Python 2.7 era.

So, @leycec. Bro. What's The Big Deal, Yo?

Saliently, replacing find_packages() with find_namespace_packages() is not a valid solution for most projects. Why? Because find_namespace_packages() erroneously matches all repository subdirectories1 in the root repository as importable packages.

1. ...possibly excluding dot directories, not like it particularly matters. </shrug>

Clearly, most repository subdirectories in the root repository are not importable packages; they're superfluous workflow subdirectories like {package_name}.egg-info/, build/, dist/, docs/, pip-wheel-metadata/, and the list just goes on and on. Moreover, the set of these subdirectories significantly changes over time and is almost entirely outside the control of downstream developers. Explicitly listing these subdirectories in the exclude parameter to find_namespace_packages() is thus infeasible. Therefore, replacing find_packages() with find_namespace_packages() is not a valid solution for most projects.

So, @leycec. Bro. How Did You Fix This?

Thanks. I'm so glad you asked. The obvious solution is just to abandon setuptools.

That's what everybody else has done. But I'm curmudgeonly. I have a grubby beard and live in a mildew-infested cabin in the Canadian wilderness. People like me are disinclined to do what we should. Instead, I did what I shouldn't.

I continue using setuptools despite its repeated outbursts of insanity. In this case, I compelled setuptools to obey my perfidious will via ludicrous boilerplate which I will now copy-and-paste into every Python project I maintain – much to the shared agony of junior developers and my wife, who must now suffer my delusions in silence. This is that boilerplate:

# In "setup.py":
PACKAGE_NAME = 'your_package_name_here'  # <-- edit this, you who are sadly reading this and now contemplating watching "Harakiri" on a Friday at 2:32AM despite knowing that to be a very bad idea

# Ludicrous boilerplate: I summon you!
import os, setuptools

# Do you remember when setuptools just worked? Because @leycec remembers.
_PACKAGE_NONDATA_NAMES = setuptools.find_packages(exclude=(
    'test',
    'test.*',
))
'''
List of the fully-qualified names of all **non-data Python packages** (i.e.,
directories containing the standard ``"__init__.py"`` file and zero or more
Python modules) to be installed, including the top-level application package and
all subpackages of this package but excluding the top-level test package and
all subpackages of that package.
'''

# This assumes your package data lives in a "data/" subdirectory of your package.
# If your package data lives elsewhere, it probably shouldn't.
_PACKAGE_DATA_NAMES = [
    f'{PACKAGE_NAME}.data.{package_data_name}'
    for package_data_name in setuptools.find_namespace_packages(
        where=os.path.join(PACKAGE_NAME, 'data'))
]
'''
List of the fully-qualified names of all **data Python pseudo-packages** (i.e.,
directories containing *no* standard ``"__init__.py"`` file and zero or more
data paths) to be installed.

Note that this is largely nonsensical. Ideally, a project subdirectory
containing *no* standard ``"__init__.py"`` file would be transparently treated
by both Python itself and :mod:`setuptools` as an unimportable non-package
directory rather than an importable package. Ideally, *only* directories
containing ``"__init__.py"`` files would be treated as importable packages.
Sadly, :pep:`420` (i.e., "Implicit Namespace Packages") fundamentally broke this
reasonable expectation by unconditionally forcing *all* project subdirectories
to be importable packages regardless of developer wants, needs, or expectations.

:mod:`setuptools` now "complies" with this nonsense by requiring that data
directories by explicitly listed as namespace packages. Of course, data
directories are *not* namespace packages -- but nobody in either the official
PyPA or CPython communities appears to care. If this is *not* done,
:mod:`setuptools` now emits one deprecation warning for each data subdirectory
and file resembling:

    Installing '{data_path}' as data is deprecated, please list it in `packages`.

Lastly, note that we could also avoid this unctuous list comprehension
altogether by simply replacing the above call to
:func:`setuptools.find_packages` with
:func:`setuptools.find_namespace_packages`. Then why do we not do so? Because
doing so would make things even worse. Why? Because then :mod:`setuptools` would
erroneously match *all* subdirectories of this root repository directory as
importable packages to be installed -- including obviously irrelevant root
subdirectories like ``"{package_name}.egg-info"``, ``".github"``, ``".github"``,
``"doc"``, and ``"pip-wheel-metadata"``. Since the set of all such
subdirectories frequently changes with upstream revisions beyond our control,
explicitly specifying this set by listing these ignorable subdirectories in an
``exclude`` parameter is infeasible. In short, this is the least bad thing.

See Also
----------
https://github.com/pypa/setuptools/issues/3340
    Upstream :mod:`setuptools` issue where :mod:`setuptools` casually admit to
    breaking their entire toolchain for no demonstrably good reason.
'''

# You're welcome.
setup(
    ...
    packages=_PACKAGE_DATA_NAMES + _PACKAGE_NONDATA_NAMES,
)

I didn't make insanity. I only break it over my arthritic knee.

So, @leycec. Bro. Could You Like Stop Talking?

The party ends abruptly when @leycec walks through the door. The silence is deafening. I'm pretty sure the silence gave me tinnitus. Since everyone fled, I'll say one last thing to the empty room:

Continually waving your hands about while screeching "PEP 420 made us stab ourselves in our eyeballs!!!!!" is no valid justification for stabbing everyone else in their eyeballs, too.

leycec added a commit to betsee/betse that referenced this issue May 23, 2023
This commit resolves upstream issue pypa/setuptools#3340 in which
`setuptools` fundamentally breaks backward compatibility with the entire
ecosystem of existing `setuptools`-based packages that require
non-Python data files. Specifically, this commit forces `setuptools` to
package non-Python data files against its will through modestly clever
(but ultimately futile) abuse of the
`setuptools.find_namespace_packages()` function. Since words fail us, we
shall now resort to insensitive doggerel.
(*Fuming perfume, Panther! Bantha poodoo or that voodoo you do!?*)
@abravalheri
Copy link
Contributor

For those who have a long (potentially evolving) list of exclude in find_namespace_packages, using include is probably an easier solution.

@di
Copy link
Member

di commented May 24, 2023

Hi @leycec, I understand you're frustrated with the changes here, but please try be more more respectful and considerate in your communication in the future.

As a project of the PyPA, everyone in this issue tracker is expected to follow the Python Community Code of Conduct. The expectation is that everyone interacting here should be courteous when raising issues and disagreements.

Specifically, calling the setuptools maintainers "insane lunatics" is unacceptable: it's not constructive, it's not welcoming or inclusive, and it easily qualifies as harassment.

Additionally, your previous comments in this issue tracker ("Heads need rolling (especially those currently attached to the still-functioning torsos of managerial project leads)") is a clear example of violent language directed against another person, and is also unacceptable.

In short: we don't do this here. You are welcome to continue participating in this project, but if you continue to violate the code of conduct here, you will no longer be permitted to participate.

@webknjaz
Copy link
Member

@abravalheri hey, I skimmed through the comments above and I think I may have a case that hasn't been brought up explicitly (apologies if I missed something!)

It's about directories that aren't supposed to end up in wheels and be installed, but should be in sdists. Specifically, multidict has C header files in a subdirectory of a Python importable package. And setuptools complains about it.
These files are obviously included from C extension sources and are used by the compiler when building wheels from Git checkout or sdist.
However, I wouldn't expect them to end up in the end-users' site-packages/ since they aren't technically data files in the sense that there's no need for them in runtime whatsoever — they are part of the sources for sdist and that's it.

In aio-libs/multidict#829 somebody suggested listing that as a package, which doesn't appear to be a correct solution due to the reasons I mentioned above.

Do you have any go-to suggestion for suppressing the warning in such cases?

@webknjaz
Copy link
Member

Do you have any go-to suggestion for suppressing the warning in such cases?

UPD: turned out that it was include_package_data=True that confused setuptools. There was really no need for it and removing the setting fixed the problem.

@abravalheri I think that adding a hint suggesting to delete include_package_data=True and replace it with a fine-grained package-data mapping would be useful.

In my case, a more straightforward suggestion would've made it clearer what to do, because I wasn't the one to set up initial packaging in the project and had to attempt figuring out how and why things were set up that way.

Another source of confusion was the fact that the error message was unhappy with multidict._multilib, suggesting to use helpers like find_namespace_packages() while it did list the packages explicitly with no discovery needed, via a static list: packages=["multidict"]. This also confused a contributor who attempted adding "multidict._multilib" to that list, not understanding immediately that C-headers aren't to be installed in site-packages/ (or distributed in wheels, for that matter).

@shakfu
Copy link

shakfu commented Jan 10, 2024

@abravalheri

I appreciate your efforts to solve what is a quite challenging situation due to recent changes (PEP 420). You said very clearly early in this issue: "Since PEP 420, there is no way for really differentiating a folder from a package in Python."

But this problematic new state is not one that developers should learn to live with, they should not be compelled to adopt insecure quick fixes. The packaging distinction between code and data should be preserved. Because to do otherwise, to conflate between the two, and to make data executable by default, is to introduce additional security risks and to basically undermine the concept of having an "api".

The proposed fix which resolves the deprecation is apparently to use find_namespace_packages. But this is just wrong. Since namespace packages are meant to contain importable executable code.

In my current project which wraps a third party library. The directory structure is slightly more complex to accommodate two build variants: a default dynamically-linked build and an optional statically-linked build variant.

The default build's code is in src/cyfaust and the static build code is in src/static/cyfaust. The project consists of cython extensions and a significant amount of data which is placed in a resources directory hierarchy in the project root and softlinked into each build variant's folder so it can be picked up as below:

.
├── docs
├── include
├── resources
│    ├── architecture
│    │  ├── AU
│    │  ├── VST
│    │  ├── api
│    │  ├── daisy
│    │  ├── faust
│    │  ├── juce
│    │  ├── max-msp
│    │  ├── sam
│    │  ├── teensy
│    │  └── vcvrack
│    └── libraries
│        └── examples
├── scripts
├── src
│    ├── cyfaust
│    │  └── resources -> ../../resources
│    └── static
│        └── cyfaust
│               └── resources -> ../../../resources
└── tests

Within the resources folder structure (which is specified by the library I'm wrapping) there are 775 files, 10.4 MB in total. Most of these are templated source files and all are part of the thirdparty library itself. But all are precisely data, not to be imported in the project package, only to be used for the purposes of code generation.

The relevant part of setup.py captures this package code, data directory distinction simply and effectively:

# ...

if STATIC:

    # forces cythonize in this case
    subprocess.call("cythonize cyfaust.pyx", cwd="src/static/cyfaust", shell=True)

    with open("MANIFEST.in", "w") as f:
        f.write("graft src/static/cyfaust/resources\n")
        f.write("exclude src/static/cyfaust/*.cpp\n")

    extensions = [
        mk_extension("cyfaust.cyfaust", 
            sources=["src/static/cyfaust/cyfaust.pyx"] + RTAUDIO_SRC,
            define_macros=DEFINE_MACROS,
        ),
    ]

    setup(
        name='cyfaust',
        version=VERSION,
        ext_modules=cythonize(
            extensions,
            language_level="3",
        ),
        package_dir = {"cyfaust": "src/static/cyfaust"},
        packages=['cyfaust'],
        include_package_data=True
    )

# ----------------------------------------------------------------------------

else:

    with open("MANIFEST.in", "w") as f:
        f.write("graft src/cyfaust/resources\n")
        f.write("exclude src/cyfaust/*.cpp\n")

    extensions = [
        mk_extension("cyfaust.interp", 
            sources=["src/cyfaust/interp.pyx"] + RTAUDIO_SRC,
            define_macros=DEFINE_MACROS,
        ),
        mk_extension("cyfaust.common", ["src/cyfaust/common.pyx"]),
        mk_extension("cyfaust.signal",["src/cyfaust/signal.pyx"]),
        mk_extension("cyfaust.box", ["src/cyfaust/box.pyx"]),
    ]

    setup(
        name='cyfaust',
        version=VERSION,
        ext_modules=cythonize(
            extensions,
            language_level="3",
        ),
        package_dir = {"cyfaust": "src/cyfaust"},
        packages=['cyfaust'],
        include_package_data=True
    )

Right now, and everything just works as expected. The only problem I have is the deprecation warning and the fact that I'm put off by the proposed solution to use find_namespace_packages for the reasons given above.

May I suggest that the deprecation warning be dropped or even abbreviated somewhat until the data / code packaging distinction can be restored and preserved by some new PEP or other. It makes no sense to promote solutions, in the interim, which incorrectly conflate the two.

S

@abravalheri
Copy link
Contributor

abravalheri commented Jan 11, 2024

Hi @shakfu, thank you very much for sharing your thoughts. Please see my comments below:

[...] quite challenging situation due to recent changes (PEP 420) [...] But this problematic new state is not one that developers should learn to live with [...]

Please note that the "directory is a package" behaviour introduced in PEP 420 is actually quite old. The PEP has been approved in 19/Apr/2012 and the specified behaviour implemented in Python 3.3, over 10 years ago which is a lot in "software development years".
It is safe to assume that this behaviour is stable and that at some point in their carreers Python developers will indeed learn what a Python package means and how adding a directory nested somewhere under one of the entries of sys.path corresponds effectively to create a Python package that can be regularly imported as any other package via an import statement.

[...] they should not be compelled to adopt insecure quick fixes.

Could you please clarify in what sense the solutions presented in the warning message are insecure?
The error message presents to the user a couple of suggestions (to be chosen accordingly to what fits better their use case) which include to manually add the missing entries to the packages configuration option, or to use a convenience function provided by setuptools. In both cases, the procedure is very mature and stable.

The proposed fix which resolves the deprecation is apparently to use find_namespace_packages. But this is just wrong. Since namespace packages are meant to contain importable executable code.

I understand that this is a popular interpretation of what the concept of packages (and/or namespace packages) might mean for Python and I see where it comes from. But I don't think this interpretation is backed by the Python implementation and the way it works...

You can import directories that don't contain .py files, and having packages for holding non-Python files is actually a very useful feature1! I never found official documentation saying that packages/namespace packages are meant to contain importable executable code and cannot be used to contain only non-Python files, I don't think there is an official stance on that.

Within the resources folder structure (which is specified by the library I'm wrapping) there are 775 files, 10.4 MB in total. Most of these are templated source files and all are part of the thirdparty library itself. But all are precisely data, not to be imported in the project package, only to be used for the purposes of code generation.

If these non-Python files are not meant to be installed in the end-user's machine, I believe it is a matter of properly configuring packages/package_data/include_package_data/exclude_package_data/MANIFEST.in so that they are part of the sdist but not part of the wheel. Otherwise, if they end up nested somewhere under an entry of sys.path, they will be import packages, effectively.

If you really don't like the idea of having these directories as importable packages, then the alternative is to use data-files, which will translate into a special directory in the wheel file ({name}-{version}.data/data/). In turn, pip will stall them in a different location that will not be nested somewhere in sys.path. It is a lot of effort to align expectation and implementation, for most people it might just be worth to adapt their expectations.

May I suggest that the deprecation warning be dropped or even abbreviated somewhat until the data / code packaging distinction can be restored and preserved by some new PEP or other. It makes no sense to promote solutions, in the interim, which incorrectly conflate the two.

I think that simply dropping the warning would be unwise.
The purpose of the warning is for developers to align their configuration to their expectations.
If they want certain directories to be installed somewhere under sys.path, this effectively mean that they are asking setupotools to include certains packages/subpackages into the wheel. Within setuptools, that desire is captured by the packages configuration option.

We need users to start clarifying their configuration, because the next step is to fix other related bugs (see #3340 (comment)), and it would be bad if suddenly some folders are missing from packages.

I don't know of anyone currently attempting to introduce a data / code distinction (and what that will mean for Python packages and import system) via a new PEP. Personally, I like the status quo and I think it works quite well. Since there is no concrete plans for such change in the ecosystem, there is no foreseeable risk of conflation, and we don't need to treat this situation as "interim". As far as we know this is the stable behaviour that we should be targetting to achieve after 10 years of transition.


There is another approach2 that I have absolute no problems in considering and actually would welcome with open arms: if a member of the community is willing to contribute (i.e. design, discuss, find consensus, implement, document, fix, support ...) a different way of configuring setuptools that is more conceptually self-evident and less prone to ambiguity than packages/package_data/include_package_data/exclude_package_data/MANIFEST.in3. Extra requirements for such solution are: backward compatibility and easy maintenance.

It is a tough challenge which I don't have the resources to tackle myself, but I would be very grateful if someone else can.

Footnotes

  1. They make it really easy to find non-Python files them runtime using importlib.resources. You can also implement extensions/plugin systems on top of it, and etc...

  2. But that is orthogonal to the warning and next steps discussed here.

  3. In some sense, the automatic discovery (when the user does not specify packages) that was introduced a couple of years ago is meant to be easier and less confusing. But automatic discovery is not a fit for all and edge cases still require playing with packages/package_data/include_package_data/exclude_package_data.

@abravalheri
Copy link
Contributor

abravalheri commented Jan 11, 2024

@abravalheri I think that adding a hint suggesting to delete include_package_data=True and replace it with a fine-grained package-data mapping would be useful.

That is nice idea. Depending on what the developer is trying to achieve that will indeed solve the problem. Currently the message does mention include-package-data=False:

[...]
If you don't want {importable!r} to be distributed and are
already explicitly excluding {importable!r} via
`find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
you can try to use `exclude_package_data`, or `include-package-data=False` in
combination with a more fine grained `package-data` configuration.

But probably that can be rephrased to be more clear. Would you like to give it a shot?
(P.S.: I consider include-package-data=False a better text for the warning message than omitting, because in pyproject.toml the default value of this config is True, so it just saves the time to explain the nuances).

Another source of confusion was the fact that the error message was unhappy with multidict._multilib, suggesting to use helpers like find_namespace_packages() while it did list the packages explicitly with no discovery needed, via a static list: packages=["multidict"]. This also confused a contributor who attempted adding "multidict._multilib" to that list, not understanding immediately that C-headers aren't to be installed in site-packages/ (or distributed in wheels, for that matter).

I understand that, but unfortunately there is no single fix for all scenarios. For some people include_package_data=False is the fix, for others find_namespace_packages() will do what they are trying to achieve...

And that is the problem with the ambiguous1 configuration, setuptools should not have to guess what the user wants to do, that is why we are asking the user to tell us.

Footnotes

  1. "Ambiguous" while setuptools does not take a final action on the "directories with non-Python files are not packages" controversy... Once the warning manage to compel devs to clarify their config, we can move past that problem and actually allow people that are intentionally omitting entries in packages to use include-package-data.

@shakfu
Copy link

shakfu commented Jan 12, 2024

Hi @abravalheri, appreciate your extended and thoughtful reply to my early post (please see my comments below):

Please note that the "directory is a package" behaviour introduced in PEP 420 is actually quite old. The PEP has been approved in 19/Apr/2012 and the specified behaviour implemented in Python 3.3, over 10 years ago which is a lot in "software development years".

It is safe to assume that this behaviour is stable and that at some point in their carreers Python developers will indeed learn what a Python package means and how adding a directory nested somewhere under one of the entries of sys.path corresponds effectively to create a Python package that can be regularly imported as any other package via an import statement.

I've been working in Python since a little bit before version 1.5.2, so this is certainly not new to me. Nonetheless, I admittedly did not express myself clearly: I have no issue with the principle of "directory is a package" or that packages are importable. This is fundamental to Python since way back.

I'm also happy with having a data folder within a python package such that a module within that package can programmatically always find this data folder by retrieving its path relative to __file__. Indeed this is precisely what I'm doing in the project I described earlier: in the cyfaust.common module I have the following code:

def get_package_resources() -> tuple[str, str, str, str]:
    """provides the paths of package architecture and library folders."""
    resources = os.path.join(os.path.dirname(__file__), "resources")
    archs = os.path.join(resources, "architecture")
    libs = os.path.join(resources, "libraries")
    return ("-A", archs, "-I", libs)

PACKAGE_RESOURCES = get_package_resources()

After reading your prior reply, I appreciate that the current situation (and your understanding of it) is a lot more nuanced than I previously considered, and that the confusion is probably stemming from a couple of things here:

1. My failure to understand the full implications of PEP 420 Namespace Packages:

This PEP makes everything in a package potentially importable, even when that is not the intent!

For a hypothetical example in my project, a user of my package can (currently) import a python file that is only intended to be used as a template.

But due to PEP 420, this is possible:

from cyfaust.resources.architecture.templates import abctemplate 

Now, this can be prevented by renaming resources to a non-python friendly name such as pkg-resources.

So now this is impossible:

from cyfaust.pkg-resources.architecture.templates import abctemplate 

I did not realize that this was the default situation now... and it has nothing to do with setuptools. That's my bad.

2. The Documentation + the Deprecation + the Multiplicity of Configuration options + Different ways of doing the same thing are confusing:

In my case, this is how I tried to resolve the deprecation:

  1. My initial working setup.py:
setup(
    name='cyfaust',
    version=VERSION,
    ext_modules=cythonize(
        extensions,
        language_level="3",
    ),
    packages=['cyfaust'],
    package_dir = {"cyfaust": "src/cyfaust"},
    include_package_data=True
)

Then I saw the deprecation notice, and to figure out how to resolve it I read the relevant datafiles section of the setuptools documentation.

I chose the initial recommended include_package_data method.

This led to the following:

setup(
    name='cyfaust',
    version=VERSION,
    ext_modules=cythonize(
        extensions,
        language_level="3",
    ),
    packages=find_packages(where="src"),
    package_dir = {"cyfaust": "src/cyfaust"},
    include_package_data=True
)

And it works, it produces the same outcomes as my prior working case, but... the deprecation notice is still there.

So after some further reading and research on the web, the advice is use find_namespace_packages so I choose the "Subdirectory for Data Files" method

I naively just changed find_packages to find_namespace_packages since it is seemingly the simplest solution

setup(
    name='cyfaust',
    version=VERSION,
    ext_modules=cythonize(
        extensions,
        language_level="3",
    ),
    packages=find_namespace_packages(where="src"),
    package_dir = {"cyfaust": "src/cyfaust"},
    include_package_data=True
)

and then I get an error:

running build
running build_py
error: package directory 'static' does not exist

At this point the frustration of palpable. All I want to do is specify a subdirectory of my package as a data directory to be included in whole.

The more complex way in this method is to add a package_data entry, but filling it out seems laborious given that I have 775 files data files accross many directories with different types etc..

At this stage, I decide to just live with the deprecation!

You asked for suggestions on how to improve this api. Here are my two cents:

Ideally it should be as simple as something like this:

setup(
    name='cyfaust',
    version=VERSION,
    ext_modules=cythonize(
        extensions,
        language_level="3",
    ),
    packages = {
        'cyfaust': {
            'path': "src/cyfaust",
            'data_dirs': ["cyfaust/pkg-resources"],
        }
    }
 )

or just:

setup(
    name='cyfaust',
    version=VERSION,
    ext_modules=cythonize(
        extensions,
        language_level="3",
    ),
    packages=['cyfaust'],
    package_dir = {"cyfaust": "src/cyfaust"},
    include_package_data=True
)

In the above ideal cases, MANIFEST.in or similar is only configured if you need more granular control over what is included or excluded. It is not required in the simplest case where you want everything included.

@shakfu
Copy link

shakfu commented Jan 13, 2024

@abravalheri

As an addendum to my last post I finally found a way to remove the deprecation warning. I could not get it done with find_namespace_packages and the only way was to run os.walk in my resources folder and then manually add folders which had python friendly names to the packages list as per the below code. Note that this was the only solution I could arrive at to remove the deprecation warning. If there is a better way, please let me know otherwise, my setup.py file looks a little ridiculous. 😟

setup(
    name='cyfaust',
    version=VERSION,
    ext_modules=cythonize(
        extensions,
        language_level="3",
    ),
    package_dir = {"cyfaust": "src/cyfaust"},
    packages = [
        "cyfaust",
            # NOTE: The entries below are not really packages but data folders
            # They have been added to this list to remove a recent deprecation
            # warning caused by the deprecatation of the category of data files
            # within packages. These are now just called packages 
            # (even if that is not the intention).
            # DO NOT IMPORT, even though they are importable!!
        "cyfaust.resources.libraries",
        "cyfaust.resources.libraries.examples",
        "cyfaust.resources.libraries.examples.ambisonics",
        "cyfaust.resources.libraries.examples.analysis",
        "cyfaust.resources.libraries.examples.autodiff",
        "cyfaust.resources.libraries.examples.autodiff.delay",
        "cyfaust.resources.libraries.examples.autodiff.gain",
        "cyfaust.resources.libraries.examples.autodiff.gain_dc",
        "cyfaust.resources.libraries.examples.autodiff.gain_exp",
        "cyfaust.resources.libraries.examples.autodiff.gain_pow",
        "cyfaust.resources.libraries.examples.autodiff.gain_pow_trig",
        "cyfaust.resources.libraries.examples.autodiff.gain_sq",
        "cyfaust.resources.libraries.examples.autodiff.mem",
        "cyfaust.resources.libraries.examples.autodiff.one_zero",
        "cyfaust.resources.libraries.examples.autodiff.recursion",
        "cyfaust.resources.libraries.examples.autodiff.tremolo",
        "cyfaust.resources.libraries.examples.delayEcho",
        "cyfaust.resources.libraries.examples.dynamic",
        "cyfaust.resources.libraries.examples.filtering",
        "cyfaust.resources.libraries.examples.gameaudio",
        "cyfaust.resources.libraries.examples.generator",
        "cyfaust.resources.libraries.examples.misc",
        "cyfaust.resources.libraries.examples.old",
        "cyfaust.resources.libraries.examples.old.rewriting",
        "cyfaust.resources.libraries.examples.phasing",
        "cyfaust.resources.libraries.examples.physicalModeling",
        "cyfaust.resources.libraries.examples.physicalModeling.fds",
        "cyfaust.resources.libraries.examples.physicalModeling.old",
        "cyfaust.resources.libraries.examples.pitchShifting",
        "cyfaust.resources.libraries.examples.psychoacoustic",
        "cyfaust.resources.libraries.examples.quantizing",
        "cyfaust.resources.libraries.examples.reverb",
        "cyfaust.resources.libraries.examples.smartKeyboard",
        "cyfaust.resources.libraries.examples.smartKeyboard.associatedEffects",
        "cyfaust.resources.libraries.examples.spat",
        "cyfaust.resources.architecture",
        "cyfaust.resources.architecture.api",
        "cyfaust.resources.architecture.api.doc",
        "cyfaust.resources.architecture.AU",
        "cyfaust.resources.architecture.AU.AUPublic",
        "cyfaust.resources.architecture.AU.AUPublic.AUBase",
        "cyfaust.resources.architecture.AU.AUPublic.AUEffectBase",
        "cyfaust.resources.architecture.AU.AUPublic.AUInstrumentBase",
        "cyfaust.resources.architecture.AU.AUPublic.Utility",
        "cyfaust.resources.architecture.AU.PublicUtility",
        "cyfaust.resources.architecture.AU.Source",
        "cyfaust.resources.architecture.AU.Source.AUSource",
        "cyfaust.resources.architecture.AU.Source.CocoaUI",
        "cyfaust.resources.architecture.daisy",
        "cyfaust.resources.architecture.faust",
        "cyfaust.resources.architecture.faust.au",
        "cyfaust.resources.architecture.faust.audio",
        "cyfaust.resources.architecture.faust.dsp",
        "cyfaust.resources.architecture.faust.gui",
        "cyfaust.resources.architecture.faust.gui.Styles",
        "cyfaust.resources.architecture.faust.midi",
        "cyfaust.resources.architecture.faust.unity",
        "cyfaust.resources.architecture.faust.vst",
        "cyfaust.resources.architecture.juce",
        "cyfaust.resources.architecture.juce.plugin",
        "cyfaust.resources.architecture.juce.standalone",
        "cyfaust.resources.architecture.sam",
        "cyfaust.resources.architecture.teensy",
        "cyfaust.resources.architecture.vcvrack",
        "cyfaust.resources.architecture.vcvrack.template",
        "cyfaust.resources.architecture.vcvrack.template.res",
        "cyfaust.resources.architecture.vcvrack.template.src",
        "cyfaust.resources.architecture.VST",
    ],
    include_package_data=True
)

@abravalheri
Copy link
Contributor

abravalheri commented Jan 13, 2024

@shakfu, have you tried something like the following?

common = {
   ...,
   "include_package_data": True,
   "exclude_package_data": {"": "*.cpp"},
}

if STATIC:
   setup(
       **common,
       ext_modules=[...],
       package_dir={"": "src/static"},
       packages=find_namespace_packages(where="src/static"),
   )
else:
   setup(
       **common,
       ext_modules=[...],
       package_dir={"": "src"},
       packages=find_namespace_packages(where="src", include=["cyfaust*"]),
   )

The example above requires MANIFEST.in to be configured so that all the files necessary for the build process are present in the sdist. Alternatively if you are OK with all the files in your VCS to also be present in your sdist you can use a plugin like setuptools-scm (which potentially simplifies a lot the configuration).

@shakfu
Copy link

shakfu commented Jan 13, 2024

@abravalheri Thanks for the tip, I'll try it out to see if I can improve it as per your suggestions.

Incidentally, because I have two build variant in my setup.py file, I generate the MANIFEST.in file in each case:

    with open("MANIFEST.in", "w") as f:
        f.write("graft src/cyfaust/resources\n")
        f.write("exclude src/cyfaust/*.cpp\n")

@shakfu
Copy link

shakfu commented Jan 14, 2024

@abravalheri

Incidentally, I deleted my prior post as it included an error that should not have been flagged here. In any case, the setup.py file is working, checked-in and the deprecation is gone. Thanks very much for your help!

@merwok
Copy link
Contributor

merwok commented Jan 14, 2024

  packages=find_namespace_packages(where="src", include=["cyfaust*"]),

It does not seem good if docs have to recommend the definition of namespace packages for projects that are not using namespace packages.

@shakfu
Copy link

shakfu commented Jan 14, 2024

It does not seem good if docs have to recommend the definition of namespace packages for projects that are not using namespace packages.

This is the essence of the issue: the deprecation makes it necessary to use a more complex solution (find_namespace_packages) to a problem (bundling a data folder in one's package) that was previously relatively simple to resolve.

My case above is illustrative.

@abravalheri
Copy link
Contributor

abravalheri commented Jan 14, 2024

It does not seem good if docs have to recommend the definition of namespace packages for projects that are not using namespace packages.

@merwok, please note that if a distribution install a directory nested somewhere under sys.path and this directory that does not contain an __init__.py, then that is indeed a namespace package, and the project is, effectively, using namespace packages. You can use your REPL to check that the directory is importable and the REPL will show you something like <module '...' (namespace)> if you print the object you imported.

The assumption that certain directories that don't contain Python files are exempt of this behaviour is not backed by the implementation, and as far as I know, by the Python docs (the PEP introducing implicit namespaces pretty much says the opposite).


The warning message discussed in this issue has to deal with 2 situations:

  1. When the developer does want such directories to be included in the installation and is trying to rely on include_package_data to do so, but is forgetting to list such directories in packages (therefore, it is providing an incorrect configuration that fails match file structure and intent).
  2. When the developer does not want certain directories to be included and is specifying the correct configuration to do so (either by manually listing packages or by using find_namespace_packages(..) with the include/exclude options, or by consciously relying on the fact that find_packages(...) skip directories without __init__.py). At the same time, the developer is correctly specifying include_package_data such that files in other directories are included.

Currently there is a bug in setuptools that will incorrectly handle both situations.

For situation 1, setuptools will include packages that are not listed in packages, and therefore fail to fulfil the configuration that has been passed.

For situation 2, setuptools will not skip certain directories that are intentionally omitted from packages, and therefore also fail to fulfil the configuration that has been passed.

To fix this bug we need the warning message to bring awareness for users that fall into category 1: that their configuration does not match the intent, and for users that fall into category 2: that they will need to use a workaround while the fix has not been implemented yet.

@merwok
Copy link
Contributor

merwok commented Jan 15, 2024

the project is, effectively, using namespace packages

I understand that, but my viewpoint is about author intent.
In the examples here the developers have regular packages, source files and data files.
They need a clean way to have their packages packaged up and installed, and the data files also packaged up.

@abravalheri
Copy link
Contributor

They need a clean way to have their packages packaged up and installed, and the data files also packaged up.

The clean way is to list any package they are using to host data files in the packages directive. This mental model is aligned with the Python implementation, docs, and work great with importlib.resources.
There is no need for differentiating between types of directories, that is only conceptual overhead that does not really match the way Python works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests