-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] pyproject.toml
: Unicode character in a license file prevents dependencies from being installed
#4033
Comments
What's the output when using a valid build-system section, and what warnings get shown when using build to make the wheel? |
Add
Install and build the wheel with broken license file:
So no new warnings or errors as far as I can tell. |
I meant the build package,which has easier opt out of build output hiding |
|
FWIW, omitting |
Works fine for me on Windows:
(note pytest is installed, so the |
Verbose pip install output: https://gist.github.com/yurinnick/9128d68e5f9be9b0b50362122d79346c |
I did some research and my findings so far:
[1] >>> d.metadata.values()
['2.1', 'unicode-bug', '0.1.0', '\n ', '>=3.10', 'dev', 'LICENSE', 'requests', 'pytest ; extra == "dev"']
>>> d.metadata.keys()
['Metadata-Version', 'Name', 'Version', 'License', 'Requires-Python', 'Provides-Extra', 'License-File', 'Requires-Dist', 'Requires-Dist'] [2]
|
This sounds like a setuptools bug to me. The dist-info directory in the temporary directory is generated by the build backend (setuptools) and pip only reads the result. |
Add an initial pyproject.toml file to be able to build and install the api Python package. Thanks to Nikolay Yurin for finding the bug about the license file: https://github.com/pypa/pip/issues/12251 This project uses a standard license so it's better to just rely on the LGPL-2.1-or-later SPDX indentifier instead of providing the path to the whole license file. Development dependencies have been omitted and all the required dependencies have a fixed version in order to build reproducible Docker images for production deployment. Some follow-up changes can be made to have a better way to handle dependencies for development. Signed-off-by: Guillaume Tucker <[email protected]>
This is definitely a setuptools bug. Transferring the issue over. :) |
Hi @yurinnick thank you very much for reporting, I will have a look on this.
This is interesting. Is the PKG-INFO file in the top of the sdist OK? I had a quick view in the code and it seems that we do use UTF-8 to read the file pointed out in One thing that might be related is that setuptools produces The code contains some places that we use If you build the project skipping the |
Here are the files in the hex form: https://gist.github.com/yurinnick/94fc3d9b19e0bf6270f7dd332f4b0e61 |
One thing confuses me. You keep referring to a "unicode character", but the reproducer you gave (which, as I said, works fine for me on Windows) uses the character Is the problem here caused by the fact that this character is whitespace, and as such is getting stripped at some point due to something removing trailing whitespace from lines? |
The one in the license I believe is |
I added some debug statements to see when the transformation is happening and it seems to come from https://github.com/pypa/wheel/blob/0.41.2/src/wheel/bdist_wheel.py#L587: ...
serialization_policy = EmailPolicy( # email.policy.EmailPolicy
utf8=True,
mangle_from_=False,
max_line_length=0,
)
with open(pkg_info_path, "w", encoding="utf-8") as out:
Generator(out, policy=serialization_policy).flatten(pkg_info) # email.generator.Generator
... This is how I obtained the info: (patch format)diff .venv/lib/python3.10/site-packages/wheel/orig_bdist_wheel.py .venv/lib/python3.10/site-packages/wheel/bdist_wheel.py
558a559,563
> with open(pkginfo_path, "rb") as fp:
> print(f"\n\n*** {__file__=} {sys._getframe().f_code.co_name=} ***")
> print("---------------------PKG-INFO-------------------------->>")
> print(repr(fp.read()))
> print("<<---------------------PKG-INFO--------------------------")
559a565,569
> metadata = pkg_info
>
> print("\n\n---------------------LICENSE-------------------------->>")
> print(f"{metadata['License']=!r}")
> print("<<---------------------LICENSE--------------------------")
587a598,602
>
> with open(pkg_info_path, "rb") as fp:
> print("\n\n---------------------METADATA-------------------------->>")
> print(repr(fp.read()))
> print("<<---------------------METADATA--------------------------\n\n") And this is the output (when I run ...
*** __file__='/tmp/myproj/.venv/lib64/python3.10/site-packages/wheel/bdist_wheel.py' sys._getframe().f_code.co_name='egg2dist' ***
---------------------PKG-INFO-------------------------->>
b'Metadata-Version: 2.1\nName: unicode-bug\nVersion: 0.1.0\nLicense: \x0b\n \nRequires-Python: >=3.10\nProvides-Extra: dev\nLicense-File: LICENSE\n'
<<---------------------PKG-INFO--------------------------
---------------------LICENSE-------------------------->>
metadata['License']='\x0b\n '
<<---------------------LICENSE--------------------------
---------------------METADATA-------------------------->>
b"Metadata-Version: 2.1\nName: unicode-bug\nVersion: 0.1.0\nLicense: \n\n \nRequires-Python: >=3.10\nLicense-File: LICENSE\nRequires-Dist: requests\nProvides-Extra: dev\nRequires-Dist: pytest ; extra == 'dev'\n\n"
<<---------------------METADATA--------------------------
... Now, I am not sure why stdlib's Maybe it is worth to discuss with Something very similar happens if you replace Footnotes
|
Hang on. It looks like bdist_wheel is failing to "properly"1 escape that value, resulting in a metadata file with an unescaped Footnotes
|
The The Footnotes
|
@agronholm do you have any thoughts in the matter? (see #4033 (comment)). The escaping behaviour seems to be inherited by |
Ah. I missed that. Looks like it's translating Yes:
produces
So that policy doesn't handle arbitrary control characters correctly... It's worth noting that the specification is (deliberately) vague on how to serialise metadata to a file:
So there's no formal answer to the question "how do we handle data like this?" - other than "don't do that, then" 🙁 |
The following is worth noting:
The email module can parse data where header values contain values like |
One minor comment is that the If we replace the Footnotes
|
Like it or not, compat32 (for reading) is what the standard says. I omitted an additional part of the spec:
And what does (mostly) work is to read the file using UTF-8 into a string, and then parse with
You can't round-trip like this, and the email module (and the spec!) gives you little or no help in writing such a metadata file, so I agree that IMO, the temporary solution here is probably for setuptools to just refuse to allow control characters in a license file. The UTF-8 question is something of a red herring here, as the current behaviour with UTF-8 is fine, it's only control character handling that seems broken. Longer term, license file handling will have to be dealt with as part of PEP 639, and if the problem of writing arbitrary user-entered text into metadata values other than license and description becomes a problem, someone will need to bite the bullet and propose a standard for how to encode such data (or we switch to a different metadata serialisation format, like JSON...) |
So is there some metadata getting used by wheel that should be omitted? |
@agronholm The problem is basically that if This is because neither the metadata spec nor the email standards really say what to do with control characters in header values, and the stdlib My recommendation is to reject characters like |
Are newlines still allowed though? |
The current stdlib implementation uses >>> import itertools, pprint
>>> pprint.pprint({f"0x{i:02x}": (chr(i), f"{chr(i)}\n".splitlines()) for i in itertools.chain(range(0x1f), [0x7f])})
{'0x00': ('\x00', ['\x00']),
'0x01': ('\x01', ['\x01']),
'0x02': ('\x02', ['\x02']),
'0x03': ('\x03', ['\x03']),
'0x04': ('\x04', ['\x04']),
'0x05': ('\x05', ['\x05']),
'0x06': ('\x06', ['\x06']),
'0x07': ('\x07', ['\x07']),
'0x08': ('\x08', ['\x08']),
'0x09': ('\t', ['\t']),
'0x0a': ('\n', ['', '']),
'0x0b': ('\x0b', ['', '']),
'0x0c': ('\x0c', ['', '']),
'0x0d': ('\r', ['']),
'0x0e': ('\x0e', ['\x0e']),
'0x0f': ('\x0f', ['\x0f']),
'0x10': ('\x10', ['\x10']),
'0x11': ('\x11', ['\x11']),
'0x12': ('\x12', ['\x12']),
'0x13': ('\x13', ['\x13']),
'0x14': ('\x14', ['\x14']),
'0x15': ('\x15', ['\x15']),
'0x16': ('\x16', ['\x16']),
'0x17': ('\x17', ['\x17']),
'0x18': ('\x18', ['\x18']),
'0x19': ('\x19', ['\x19']),
'0x1a': ('\x1a', ['\x1a']),
'0x1b': ('\x1b', ['\x1b']),
'0x1c': ('\x1c', ['', '']),
'0x1d': ('\x1d', ['', '']),
'0x1e': ('\x1e', ['', '']),
'0x7f': ('\x7f', ['\x7f'])} |
Having control characters in the LICENSE file prevents setuptools from installing dependencies when used in PEP-517 compliant mode. See: pypa/setuptools#4033
Ran into this issue; it's of note that some standard LICENSE texts will contain characters that end up being incorrectly escaped thus generating a PKG-INFO that's 'broken'. |
Having control characters in the LICENSE file prevents setuptools from installing dependencies when used in PEP-517 compliant mode. See: pypa/setuptools#4033
Description
Some licenses, in this specific example, LGPL-2.1, contains Unicode characters. In case of LGPL 2.1 it is
U+000c
- Form Feed character.In
pyproject.toml
having a license file with those characters silently breaks package dependencies installation.Expected behavior
pip install -e .
installs package dependenciespip install -e .[dev]
installs optionaldev
dependencies grouppip version
23.2.1
Python version
3.10, 3.11
OS
Fedora Linux 38
How to Reproduce
pyproject.toml
LICENSE
file with unicode characterOutput
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: