Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] pyproject.toml: Unicode character in a license file prevents dependencies from being installed #4033

Open
1 task done
yurinnick opened this issue Aug 29, 2023 · 27 comments

Comments

@yurinnick
Copy link

yurinnick commented Aug 29, 2023

Description

Some licenses, in this specific example, LGPL-2.1, contains Unicode characters. In case of LGPL 2.1 it is U+000c - Form Feed character.

In pyproject.toml having a license file with those characters silently breaks package dependencies installation.

Expected behavior

  • pip install -e . installs package dependencies
  • pip install -e .[dev] installs optional dev dependencies group

pip version

23.2.1

Python version

3.10, 3.11

OS

Fedora Linux 38

How to Reproduce

  1. Create pyproject.toml
[project]
name = "unicode-bug"
version = "0.1.0"
requires-python = ">=3.10"
license = {file = "LICENSE"}
dependencies = [
    "requests"
]
[project.optional-dependencies]
dev = [
  "pytest"
]
  1. Create LICENSE file with unicode character
$ echo -e '\U000b' > LICENSE
  1. Try to install the package and optional dependencies - doesn't install dependencies
$ pip install -e .
...
Successfully built unicode-bug
Installing collected packages: unicode-bug
  Attempting uninstall: unicode-bug
    Found existing installation: unicode-bug 0.1.0
    Uninstalling unicode-bug-0.1.0:
      Successfully uninstalled unicode-bug-0.1.0
Successfully installed unicode-bug-0.1.0

$ pip install -e .[dev]
...
WARNING: unicode-bug 0.1.0 does not provide the extra 'dev'
...
  1. Remove unicode character
$ echo "" > LICENSE
  1. Try to install the package and optional dependencies - installation works
$ pip install -e .
...
Successfully built unicode-bug
Installing collected packages: urllib3, idna, charset-normalizer, certifi, requests, unicode-bug
  Attempting uninstall: unicode-bug
    Found existing installation: unicode-bug 0.1.0
    Uninstalling unicode-bug-0.1.0:
      Successfully uninstalled unicode-bug-0.1.0
Successfully installed certifi-2023.7.22 charset-normalizer-3.2.0 idna-3.4 requests-2.31.0 unicode-bug-0.1.0 urllib3-2.0.4

$ pip install -e .[dev]
Successfully built unicode-bug
Installing collected packages: pluggy, packaging, iniconfig, unicode-bug, pytest
  Attempting uninstall: unicode-bug
    Found existing installation: unicode-bug 0.1.0
    Uninstalling unicode-bug-0.1.0:
      Successfully uninstalled unicode-bug-0.1.0
Successfully installed iniconfig-2.0.0 packaging-23.1 pluggy-1.3.0 pytest-7.4.0 unicode-bug-0.1.0

Output

No response

Code of Conduct

@RonnyPfannschmidt
Copy link
Contributor

What's the output when using a valid build-system section, and what warnings get shown when using build to make the wheel?

@yurinnick
Copy link
Author

yurinnick commented Aug 29, 2023

Add build-system section:

...
[build-system]
requires = ["setuptools", "wheel"]
build-backend = "setuptools.build_meta"

Install and build the wheel with broken license file:

$ pip install -e .
Obtaining file:///home/yurinnick/Documents/test
  Installing build dependencies ... done
  Checking if build backend supports build_editable ... done
  Getting requirements to build editable ... done
  Preparing editable metadata (pyproject.toml) ... done
Building wheels for collected packages: unicode-bug
  Building editable for unicode-bug (pyproject.toml) ... done
  Created wheel for unicode-bug: filename=unicode_bug-0.1.0-0.editable-py3-none-any.whl size=2922 sha256=4277b75c0ebbf7978dc5a79089f9d5b8cce84d2179e10aa837c2069139ff977d
  Stored in directory: /tmp/pip-ephem-wheel-cache-6sqe4pu4/wheels/5e/bb/78/14fd4ec013baed754dd266aec43868a061011260e628caafc4
Successfully built unicode-bug
Installing collected packages: unicode-bug
  Attempting uninstall: unicode-bug
    Found existing installation: unicode-bug 0.1.0
    Uninstalling unicode-bug-0.1.0:
      Successfully uninstalled unicode-bug-0.1.0
Successfully installed unicode-bug-0.1.0

$ pip wheel .
Processing /home/yurinnick/Documents/test
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: unicode-bug
  Building wheel for unicode-bug (pyproject.toml) ... done
  Created wheel for unicode-bug: filename=unicode_bug-0.1.0-py3-none-any.whl size=1425 sha256=ff6fb74aa5124491036b8ab8f8788910f7a6ba9b5d65285db8568f45e99975a9
  Stored in directory: /tmp/pip-ephem-wheel-cache-mk_5fqz9/wheels/5e/bb/78/14fd4ec013baed754dd266aec43868a061011260e628caafc4
Successfully built unicode-bug

So no new warnings or errors as far as I can tell.

@RonnyPfannschmidt
Copy link
Contributor

I meant the build package,which has easier opt out of build output hiding

@yurinnick
Copy link
Author

python -m build yields this output: https://gist.github.com/yurinnick/bf06c9d0f97f9c361b38038f5fe8e072

@pradyunsg
Copy link
Member

FWIW, omitting build-system is definitely valid, and passing -v to pip will present the build output. There is no need to use build for it.

@pfmoore
Copy link
Member

pfmoore commented Aug 29, 2023

Works fine for me on Windows:

❯ pip install -vvv -e ".[dev]"
Using pip 23.2.1 from C:\Work\Scratch\foof\.venv\Lib\site-packages\pip (python 3.11)
Non-user install because user site-packages disabled
...
Successfully installed certifi-2023.7.22 charset-normalizer-3.2.0 colorama-0.4.6 idna-3.4 iniconfig-2.0.0 packaging-23.1 pluggy-1.3.0 pytest-7.4.0 requests-2.31.0 unicode-bug-0.1.0 urllib3-2.0.4
Remote version of pip: 23.2.1
Local version of pip:  23.2.1
Was pip installed by pip? True
Removed build tracker: 'C:\\Users\\xxx\\AppData\\Local\\Temp\\pip-build-tracker-7gbkreko'

❯ pip list
Package            Version   Editable project location
------------------ --------- -------------------------
certifi            2023.7.22
charset-normalizer 3.2.0
colorama           0.4.6
idna               3.4
iniconfig          2.0.0
packaging          23.1
pip                23.2.1
pluggy             1.3.0
pytest             7.4.0
requests           2.31.0
unicode-bug        0.1.0     C:\Work\Scratch\foof
urllib3            2.0.4

(note pytest is installed, so the dev extra was recognised).

@yurinnick
Copy link
Author

@yurinnick
Copy link
Author

I did some research and my findings so far:

  • The problem should be somewhere in the metadata parsing logic in pip._internal.metadata.importlib._dists.Distribution class
  • Distribution loaded from unicode_bug.egg_info directory has valid metadata [1]
  • Discribution loaded from temporary /tmp/pip-modern-metadata-0q88s1hj/unicode_bug-0.1.0.dist-info has invalid metadata [2]

[1]

>>> d.metadata.values()
['2.1', 'unicode-bug', '0.1.0', '\n        ', '>=3.10', 'dev', 'LICENSE', 'requests', 'pytest ; extra == "dev"']
>>> d.metadata.keys()
['Metadata-Version', 'Name', 'Version', 'License', 'Requires-Python', 'Provides-Extra', 'License-File', 'Requires-Dist', 'Requires-Dist']

[2]

[('Metadata-Version', '2.1'), ('Name', 'unicode-bug'), ('Version', '0.1.0'), ('License', ''), ('Description', "        \nRequires-Python: >=3.10\nLicense-File: LICENSE\nRequires-Dist: requests\nProvides-Extra: dev\nRequires-Dist: pytest ; extra == 'dev'\n\n")]

@uranusjr
Copy link
Member

This sounds like a setuptools bug to me. The dist-info directory in the temporary directory is generated by the build backend (setuptools) and pip only reads the result.

gctucker referenced this issue in gctucker/kernelci-api Aug 30, 2023
Add an initial pyproject.toml file to be able to build and install the
api Python package.

Thanks to Nikolay Yurin for finding the bug about the license file:
https://github.com/pypa/pip/issues/12251

This project uses a standard license so it's better to just rely on
the LGPL-2.1-or-later SPDX indentifier instead of providing the path
to the whole license file.

Development dependencies have been omitted and all the required
dependencies have a fixed version in order to build reproducible
Docker images for production deployment.  Some follow-up changes can
be made to have a better way to handle dependencies for development.

Signed-off-by: Guillaume Tucker <[email protected]>
@pradyunsg
Copy link
Member

This is definitely a setuptools bug. Transferring the issue over. :)

@pradyunsg pradyunsg transferred this issue from pypa/pip Aug 30, 2023
@abravalheri
Copy link
Contributor

abravalheri commented Aug 30, 2023

Hi @yurinnick thank you very much for reporting, I will have a look on this.

Distribution loaded from unicode_bug.egg_info directory has valid metadata [1]\nDiscribution loaded from temporary /tmp/pip-modern-metadata-0q88s1hj/unicode_bug-0.1.0.dist-info has invalid metadata [2]

This is interesting. Is the PKG-INFO file in the top of the sdist OK? I had a quick view in the code and it seems that we do use UTF-8 to read the file pointed out in pyproject.toml. And then we use UTF-8 again to write PKG-INFO.

One thing that might be related is that setuptools produces PKG-INFO files and those are later transformed into METADATA by pypa/wheel. But then pypa/wheel also seems to deal with the files using UTF-8.

The code contains some places that we use email.message_from_file (via distutils) to read PKG-INFO when building from the sdist, this might be where UTF-8 is not explicitly used. But I am not sure this is used, I need to investigate better.

If you build the project skipping the sdist step (e.g. python -m build --wheel) is the metadata file still ill-formed?

@yurinnick
Copy link
Author

PKG-INFO in egg-info directory has the unicode character, but gets parsed correctly.
LICENSE file from dist-info directory also contains unicode, but METADATA file doesn't.

Here are the files in the hex form: https://gist.github.com/yurinnick/94fc3d9b19e0bf6270f7dd332f4b0e61

@pfmoore
Copy link
Member

pfmoore commented Aug 30, 2023

One thing confuses me. You keep referring to a "unicode character", but the reproducer you gave (which, as I said, works fine for me on Windows) uses the character \U000b, which is a perfectly ordinary ASCII character - 'LINE TABULATION' (U+000B).

Is the problem here caused by the fact that this character is whitespace, and as such is getting stripped at some point due to something removing trailing whitespace from lines?

@yurinnick
Copy link
Author

yurinnick commented Aug 30, 2023

The one in the license I believe is \U000c, I may mixed something up during my experiments though. I am not knowledgeable in encodings, so I assumed it's a unicode symbol: https://www.fileformat.info/info/unicode/char/000c/index.htm

@abravalheri
Copy link
Contributor

abravalheri commented Aug 30, 2023

I added some debug statements to see when the transformation is happening and it seems to come from https://github.com/pypa/wheel/blob/0.41.2/src/wheel/bdist_wheel.py#L587:

...
       serialization_policy = EmailPolicy(  # email.policy.EmailPolicy
            utf8=True,
            mangle_from_=False,
            max_line_length=0,
        )
        with open(pkg_info_path, "w", encoding="utf-8") as out:
            Generator(out, policy=serialization_policy).flatten(pkg_info)  # email.generator.Generator
...

This is how I obtained the info:

image

(patch format)
diff .venv/lib/python3.10/site-packages/wheel/orig_bdist_wheel.py .venv/lib/python3.10/site-packages/wheel/bdist_wheel.py
558a559,563
>             with open(pkginfo_path, "rb") as fp:
>                 print(f"\n\n*** {__file__=} {sys._getframe().f_code.co_name=} ***")
>                 print("---------------------PKG-INFO-------------------------->>")
>                 print(repr(fp.read()))
>                 print("<<---------------------PKG-INFO--------------------------")
559a565,569
>             metadata = pkg_info
>
>             print("\n\n---------------------LICENSE-------------------------->>")
>             print(f"{metadata['License']=!r}")
>             print("<<---------------------LICENSE--------------------------")
587a598,602
>
>         with open(pkg_info_path, "rb") as fp:
>             print("\n\n---------------------METADATA-------------------------->>")
>             print(repr(fp.read()))
>             print("<<---------------------METADATA--------------------------\n\n")

And this is the output (when I run python -m build --no-isolation1):

...

*** __file__='/tmp/myproj/.venv/lib64/python3.10/site-packages/wheel/bdist_wheel.py' sys._getframe().f_code.co_name='egg2dist' ***
---------------------PKG-INFO-------------------------->>
b'Metadata-Version: 2.1\nName: unicode-bug\nVersion: 0.1.0\nLicense: \x0b\n        \nRequires-Python: >=3.10\nProvides-Extra: dev\nLicense-File: LICENSE\n'
<<---------------------PKG-INFO--------------------------


---------------------LICENSE-------------------------->>
metadata['License']='\x0b\n        '
<<---------------------LICENSE--------------------------


---------------------METADATA-------------------------->>
b"Metadata-Version: 2.1\nName: unicode-bug\nVersion: 0.1.0\nLicense: \n\n        \nRequires-Python: >=3.10\nLicense-File: LICENSE\nRequires-Dist: requests\nProvides-Extra: dev\nRequires-Dist: pytest ; extra == 'dev'\n\n"
<<---------------------METADATA--------------------------

...

Now, I am not sure why stdlib's email.policy.EmailPolicy is replacing the UTF-8 character if utf8=True... Maybe that is part of the spec in the email message RFC? I am not sure how to tackle this.

Maybe it is worth to discuss with pypa/wheel?
Do the pip maintainers have any suggestions?

Something very similar happens if you replace \U000b with \U000c.

Footnotes

  1. I use --no-isolation to force the "patched" version of wheel to be used instead of build downloading a new one.

@pfmoore
Copy link
Member

pfmoore commented Aug 30, 2023

Hang on. License-Path isn't valid metadata. So presumably it's a setuptools-specific thing, and bdist_wheel is trying to make it into a valid value for the License field?

It looks like bdist_wheel is failing to "properly"1 escape that value, resulting in a metadata file with an unescaped \n\n in the license field, which is then treated as the end of the headers with everything following being dumped into the description (because it's now the message body).

Footnotes

  1. I have no idea what proper escaping would be for a multi-line value, it's not at all well-defined if I recall correctly.

@abravalheri
Copy link
Contributor

abravalheri commented Aug 30, 2023

The License-File is left unprocessed (which is unspecified, but was implemented in setuptools by a contribution in the context of PEP 639 a few years ago1).

The License field is already originally present in the PKG-INFO file (and it contains the UTF-8 char).
When project.licence.file is given in pyproject.toml, setuptools will read the file (using UTF-8) and populate the License core-metadata (that seems to be what is meant to happen by PEP 621).

Footnotes

  1. I initially thought it was a contribution of the original author of PEP 639, but I was mistaken.

@abravalheri
Copy link
Contributor

@agronholm do you have any thoughts in the matter? (see #4033 (comment)).

The escaping behaviour seems to be inherited by bdist_wheel from the standard library, but I am not sure if it is intentional or accidental (in the stdlib, since utf8=True is passed to the EmailPolicy)

@pfmoore
Copy link
Member

pfmoore commented Aug 30, 2023

Ah. I missed that. Looks like it's translating \x0b to \n` in that case.

Yes:

from email.policy import EmailPolicy
policy = EmailPolicy(utf8=True, mangle_from_=False, max_line_length=0)
print(repr(policy.fold("License", "a\x0bb\n")))

produces

❯ py .\em.py
'License: a\nb\n'

So that policy doesn't handle arbitrary control characters correctly...

It's worth noting that the specification is (deliberately) vague on how to serialise metadata to a file:

However, email formats have been revised several times, and exactly which email RFC applies to packaging metadata is not specified. In the absence of a precise definition, the practical standard is set by what the standard library email.parser module can parse using the compat32 policy.

So there's no formal answer to the question "how do we handle data like this?" - other than "don't do that, then" 🙁

@pfmoore
Copy link
Member

pfmoore commented Aug 30, 2023

The following is worth noting:

>>> m = email.message_from_string("Foo: a\x0b\x0bc\nBar: 12\n\nBody") 
>>> m.as_string()
'Foo: a\nc\nBar: 12\n\nBody'
>>> m["Foo"]
'a\x0b\x0bc'

The email module can parse data where header values contain values like \x0b, but it can't produce them. So it doesn't round-trip correctly.

@abravalheri
Copy link
Contributor

abravalheri commented Aug 30, 2023

One minor comment is that the compat32 policy will do a different escaping that could potentially work (I haven't tested)1. The compat32 is however a bit problematic because it does not support UTF-8.

If we replace the EmailPolicy(utf8=True, ...) with compat32 we are likely to start receiving reports of people's names/readmes not rendering properly on PyPI...

Footnotes

  1. It still removes \x0b but it does not seem to add a second \n:
    print(repr(compat32.fold("License", '\x0b\n '))) => 'License: \n'

@pfmoore
Copy link
Member

pfmoore commented Aug 30, 2023

Like it or not, compat32 (for reading) is what the standard says. I omitted an additional part of the spec:

Whenever metadata is serialised to a byte stream (for example, to save to a file), strings must be serialised using the UTF-8 encoding.

And what does (mostly) work is to read the file using UTF-8 into a string, and then parse with email.message_from_string.

>>> m = email.message_from_string("Name: André")                      
>>> m["Name"]
'André'

You can't round-trip like this, and the email module (and the spec!) gives you little or no help in writing such a metadata file, so I agree that bdist_wheel has a problem with creating a file like this, but if you can create one, it's readable.

IMO, the temporary solution here is probably for setuptools to just refuse to allow control characters in a license file. The UTF-8 question is something of a red herring here, as the current behaviour with UTF-8 is fine, it's only control character handling that seems broken.

Longer term, license file handling will have to be dealt with as part of PEP 639, and if the problem of writing arbitrary user-entered text into metadata values other than license and description becomes a problem, someone will need to bite the bullet and propose a standard for how to encode such data (or we switch to a different metadata serialisation format, like JSON...)

@agronholm
Copy link
Contributor

So is there some metadata getting used by wheel that should be omitted?

@pfmoore
Copy link
Member

pfmoore commented Aug 30, 2023

@agronholm The problem is basically that if PKG-INFO contains text like b"License: \x0b\n \n", bdist_wheel isn't encoding it correctly (i.e., in a way that can be read back).

This is because neither the metadata spec nor the email standards really say what to do with control characters in header values, and the stdlib email package appears to handle it inconsistently.

My recommendation is to reject characters like \x0b - either just fail with a helpful error, or strip them out somehow.

@agronholm
Copy link
Contributor

Are newlines still allowed though?

@abravalheri
Copy link
Contributor

abravalheri commented Aug 30, 2023

This is because neither the metadata spec nor the email standards really say what to do with control characters in header values,

The current stdlib implementation uses .splitlines() in the header value and then re-joins it. Some control characters will be interpreted as a newline:

>>> import itertools, pprint
>>> pprint.pprint({f"0x{i:02x}": (chr(i), f"{chr(i)}\n".splitlines()) for i in itertools.chain(range(0x1f), [0x7f])})
{'0x00': ('\x00', ['\x00']),
 '0x01': ('\x01', ['\x01']),
 '0x02': ('\x02', ['\x02']),
 '0x03': ('\x03', ['\x03']),
 '0x04': ('\x04', ['\x04']),
 '0x05': ('\x05', ['\x05']),
 '0x06': ('\x06', ['\x06']),
 '0x07': ('\x07', ['\x07']),
 '0x08': ('\x08', ['\x08']),
 '0x09': ('\t', ['\t']),
 '0x0a': ('\n', ['', '']),
 '0x0b': ('\x0b', ['', '']),
 '0x0c': ('\x0c', ['', '']),
 '0x0d': ('\r', ['']),
 '0x0e': ('\x0e', ['\x0e']),
 '0x0f': ('\x0f', ['\x0f']),
 '0x10': ('\x10', ['\x10']),
 '0x11': ('\x11', ['\x11']),
 '0x12': ('\x12', ['\x12']),
 '0x13': ('\x13', ['\x13']),
 '0x14': ('\x14', ['\x14']),
 '0x15': ('\x15', ['\x15']),
 '0x16': ('\x16', ['\x16']),
 '0x17': ('\x17', ['\x17']),
 '0x18': ('\x18', ['\x18']),
 '0x19': ('\x19', ['\x19']),
 '0x1a': ('\x1a', ['\x1a']),
 '0x1b': ('\x1b', ['\x1b']),
 '0x1c': ('\x1c', ['', '']),
 '0x1d': ('\x1d', ['', '']),
 '0x1e': ('\x1e', ['', '']),
 '0x7f': ('\x7f', ['\x7f'])}

supakeen added a commit to supakeen/pykickstart that referenced this issue Nov 9, 2023
Having control characters in the LICENSE file prevents setuptools from
installing dependencies when used in PEP-517 compliant mode. See:
pypa/setuptools#4033
@supakeen
Copy link

supakeen commented Nov 9, 2023

Ran into this issue; it's of note that some standard LICENSE texts will contain characters that end up being incorrectly escaped thus generating a PKG-INFO that's 'broken'.

bcl pushed a commit to pykickstart/pykickstart that referenced this issue Nov 9, 2023
Having control characters in the LICENSE file prevents setuptools from
installing dependencies when used in PEP-517 compliant mode. See:
pypa/setuptools#4033
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants