Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.pth files cannot contain folders with utf-8 names #77102

Closed
einaren mannequin opened this issue Feb 23, 2018 · 21 comments
Closed

.pth files cannot contain folders with utf-8 names #77102

einaren mannequin opened this issue Feb 23, 2018 · 21 comments
Assignees
Labels
OS-windows topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@einaren
Copy link
Mannequin

einaren mannequin commented Feb 23, 2018

BPO 32921
Nosy @pfmoore, @vstinner, @tjguk, @ezio-melotti, @zware, @zooba

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2018-02-23.13:03:18.090>
labels = ['type-bug', 'expert-unicode', 'OS-windows']
title = '.pth files cannot contain folders with utf-8 names'
updated_at = <Date 2018-03-05.17:53:23.182>
user = 'https://bugs.python.org/einaren'

bugs.python.org fields:

activity = <Date 2018-03-05.17:53:23.182>
actor = 'steve.dower'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Unicode', 'Windows']
creation = <Date 2018-02-23.13:03:18.090>
creator = 'einaren'
dependencies = []
files = []
hgrepos = []
issue_num = 32921
keywords = []
message_count = 2.0
messages = ['312635', '313273']
nosy_count = 7.0
nosy_names = ['paul.moore', 'vstinner', 'tim.golden', 'ezio.melotti', 'zach.ware', 'steve.dower', 'einaren']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue32921'
versions = ['Python 3.6']

Linked PRs

@einaren
Copy link
Mannequin Author

einaren mannequin commented Feb 23, 2018

Add "G:\русский язык" to a pth file and start python. it fails with

--------------

Failed to import the site module
Traceback (most recent call last):
  File "C:\Program Files\ROXAR\RMS dev_release\windows-amd64-vc_14_0-release\bin\lib\site.py", line 546, in <module>
    main()
  File "C:\Program Files\ROXAR\RMS dev_release\windows-amd64-vc_14_0-release\bin\lib\site.py", line 532, in main
    known_paths = addusersitepackages(known_paths)
  File "C:\Program Files\ROXAR\RMS dev_release\windows-amd64-vc_14_0-release\bin\lib\site.py", line 287, in addusersitepackages
    addsitedir(user_site, known_paths)
  File "C:\Program Files\ROXAR\RMS dev_release\windows-amd64-vc_14_0-release\bin\lib\site.py", line 209, in addsitedir
    addpackage(sitedir, name, known_paths)
  File "C:\Program Files\ROXAR\RMS dev_release\windows-amd64-vc_14_0-release\bin\lib\site.py", line 165, in addpackage
    for n, line in enumerate(f):
  File "C:\Program Files\ROXAR\RMS dev_release\windows-amd64-vc_14_0-release\bin\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 8: character maps to <undefined>

This might very well have sideeffects, but adding "encoding='utf-8'" to the open() call in site.py def addpackage seems to fix the issue for me

@einaren einaren mannequin added topic-unicode type-bug An unexpected behavior, bug, or error labels Feb 23, 2018
@zooba
Copy link
Member

zooba commented Mar 5, 2018

Yes, it'll have significant side effects. The default file encoding on Windows is your configured code page (1252, in your case), and there's no good way around that default. The easiest immediate fix is to re-encode that file yourself.

Perhaps what we could do instead is allow the first line of a .pth file to be a coding comment? Then site.py can reopen the file with the specified encoding.

(FWIW, when I added the ._pth file, I explicitly made it UTF-8. But it had no history at that time so it was safe to do so.)

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@abravalheri
Copy link

abravalheri commented May 30, 2023

This seems to be related to the problem in pypa/setuptools#3937.

I wonder what is the position of the core maintainers about it. Is this a valid bug, or is it considered a wontfix/not a bug at all?

Considering the Python uses UTF-8 by default for source files, and that .pth are kind of source files (they can even contain valid Python code), my first reaction was to find the current behaviour weird...

One important point is that most of the times the .pth file is not produced by the users themselves but by the packaging system for editable installs.

While the packaging system could use locale.getpreferredencoding() to create the .pth files in the native encoding1, if at a later stage the users set PYTHONUTF8 or invoke python -X utf8, these files would stop working. Ideally, we would like to produce valid .pth files that can be used regardless if the user decides to run Python in UTF-8 mode or not.

@zooba do you have any advice on how to handle this?

Footnotes

  1. According to the docs locale.getpreferredencoding() has side effects and is not thread safe (it can be invoked with do_setlocale=False, but then, the docs seem to indicate the function may fail to get the correct value) -- Not sure if side-effects and thread-safety is something packaging have to worry about...

@zooba
Copy link
Member

zooba commented Aug 8, 2023

Is this a valid bug, or is it considered a wontfix/not a bug at all?

It's valid, but unfortunately not fixable outside of changing the default encoding for the entire process.

Considering the Python uses UTF-8 by default for source files, and that .pth are kind of source files ...

Being able to include code in the files is the weird bit, and is likely to be removed well before the default encoding is changed. If it were meant to be a source file, it'd have a .py extension (like sitecustomize.py does).

if at a later stage the users set PYTHONUTF8 or invoke python -X utf8, these files would stop working

Yep, which is why users have to opt-in to these options, and if it breaks then they need to opt-out or fix it themselves. Other tools don't need to be concerned about this scenario - just use the default encoding.

do you have any advice on how to handle this?

Use a regular open(path, 'w') call to write to the file, suppressing a warning if needed.

Unfortunately, it seems it currently requests locale encoding explicitly to read, when it ought to be using the default. If it were the default, everyone would get onto UTF-8 when that default changes, but as it stands it'll be stuck on locale encoding forever. So I guess the advice is to explicitly use locale encoding until we fix that.

@abravalheri
Copy link

Thank you very much @zooba, I am working on pypa/setuptools#4009 trying to follow this recommendation.

@methane
Copy link
Member

methane commented Apr 12, 2024

Sorry for delay. I believe pth file encoding should follow utf-8 mode.

When I implemented 4827483, I used "locale" encoding because setuptools used locale encoding.
Now setuptools used locale encoding because of I used locale encoding. This is really sad.

How do you think this, @abravalheri ?:

  • Since Python 3.13, we use only UTF-8 for pth files.
  • For Python 3.11/3.12, follow UTF-8 mode. (e.g. encoding="utf-8" if sys.flags.utf8_mode else "locale")

@vstinner
Copy link
Member

Can we try to decode from UTF-8, and fallback on locale on decode error?

@methane
Copy link
Member

methane commented Apr 12, 2024

Yes. pth files are small for most cases.
We can use StringIO instead of TextIOWrapper there.

@abravalheri
Copy link

abravalheri commented Apr 12, 2024

Can we try to decode from UTF-8, and fallback on locale on decode error?

I think this is a good idea, for the following reason: the creation of the .pth file is independent of running the .pth file (i.e. setuptools creates the file and CPython runs it). So it is difficult to "match" the specific version of setuptools and CPython and the different levels of support (e.g. old setuptools + new CPython; new setuptools + old CPython), especially because the user can do any arbitrary combination here.

  • Since Python 3.13, we use only UTF-8 for pth files.
  • For Python 3.11/3.12, follow UTF-8 mode. (e.g. encoding="utf-8" if sys.flags.utf8_mode else "locale")

Assuming that we are talking about editable installations1, my comments on the suggestions are the following:

  1. Since Python 3.13, we use only UTF-8 for pth files

    That is easy for setuptools to handle.
    We can check for sys.version_info >= (3, 13) as an indicator of the capability of processing UTF-8 .pth files.

  2. For Python 3.11/3.12, follow UTF-8 mode it

    What would be the suggestion of the capability check? Something like the following?

    if (
       sys.version_info >= (3, 11, X) and sys.version_info < (3, 12)
       or sys.version_info >= (3, 12, Y)
    ):

    This is a bit less trivial to maintain since it requires to keep an eye on the release cadence of CPython.

Footnotes

  1. As far as I know that is the only non-deprecated use case that uses .pth, but I don't claim to understand 100% of setuptools/distutils 😅. The implication of "editable installations" is that the Python used during the build process is the same Python that will be used for the installation and runtime (as far as I understood from PEP 660).

@zooba
Copy link
Member

zooba commented Apr 12, 2024

UTF-8 mode is a runtime option rather than a build option, so I don't know that you'd want to use it to create a persistent config file like this. It's not really reasonable to expect UTF-8 mode to be consistently on between install processes and application runtime.

You're probably just as well defaulting to UTF-8 all the time, and offer an obscure environment variable setting to switch back to 'locale' for those who actually hit problems (telling them to run in UTF-8 mode is also an option, but likely too big an ask if they're that stuck).

Maybe on 3.12 and earlier you could even try encoding to ASCII and print a warning that Python may not be able to load it?

The sooner we make it start reading UTF-8 and then fall back to locale, the better.

@pfmoore
Copy link
Member

pfmoore commented Apr 12, 2024

It's worth noting that tools which use the editables library will also write .pth files for support of editable installations (editables returns the file name and content as strings, the calling tool needs to choose the encoding when writing the file). And people can manually write their own. I think the important thing here is to define a standard that tools can work to. Normally I'd suggest it should be a packaging standard, but .pth files are a core feature, so that might not work. I don't know where a rule like this, for a core Python feature that needs to be documented across versions, should be published - expecting users to check the 3.11, 3.12 and 3.13+ documentation for .pth files and work out the differences themselves seems rather hostile 🙁

From the perspective of a tool (or user) writing a .pth file, I think the rule should be "use UTF-8 for Python 3.13 or later, and the default locale for Python 3.12 and earlier". It's easy to know what the target Python is, as you're typically writing the file into the site-packages for that installation. It's too late to change the default for 3.12, so this is really the only possible option.

From the perspective of the core reading .pth files, I'd try UTF-8 first, and in the case of an encoding error (as @vstinner suggested) fall back to the default encoding. I'd be inclined to leave the fallback in place even in 3.13+ to support users still using older tools - but I could accept 3.13 applying the stricter "only UTF-8" rule.

I don't think we should bring UTF-8 mode into it, as that's a runtime choice (and .pth files are static, not affected by runtime settings).

@methane
Copy link
Member

methane commented Apr 13, 2024

Locale encoding is also a runtime configuration. When you switch languages on Windows or change environment variables on Linux, it changes. UTF-8 Mode helps ensure that the virtual environment is not destroyed by changes in language or environment variables.

When I proposed PEP 597, I did not have plans to make UTF-8 Mode the default. site.py supported UTF-8 Mode up to Python 3.9, but when we implemented PEP 597, I stopped supporting it for eliminating EncodingWarning. I choose locale encoding because setuptools used it.

However, I realized that the feasibility of a migration process using EncodingWarnings is very low. I proposed PEP 686, which is a migration process using the UTF-8 Mode, and it was approved. In UTF-8 Mode, we should use UTF-8 in all places unless there is a special reason not to.

So I think supporting UTF-8 mode again in 3.11/3.12 is good idea. It is compatible to 3.9. It provides reason to opt-in UTF-8 mode before Python 3.15.

Another way to support UTF-8 pth file is #117802 . It tries UTF-8 and fallbacks to locale encoding. I'm OK for it too.

Anyway, Python is a glue language, and interoperability with other languages is important. I don't want to force the uv developers to use a huge number of codecs that works 100% same to Python codecs to handle locale encoding. I want to provide a way to use UTF-8 in not only Python 3.13 but also Python 3.11 and 3.12 as well.

@methane
Copy link
Member

methane commented Apr 13, 2024

Maybe, #117802 would be better idea for now. setuptools uses locale encoding and uv uses utf-8.
It would be better to support both of utf-8 and locale encoding in one mode.

@hauntsaninja
Copy link
Contributor

hauntsaninja commented Apr 14, 2024

(possibly dumb question) uv doesn't provide a build backend, so when is it writing pth files?

edit: I guess uv venv writes to _virtualenv.pth, but that should be ascii

@pfmoore
Copy link
Member

pfmoore commented Apr 14, 2024

uv venv writes a .pth file, as you noticed, but it's ASCII-only. That's the only time uv writes a .pth file itself (as far as I know) - any other .pth files will come from installing projects, and those projects will have their own build backend (specified in the project's pyproject.toml file).

The only difficulty from .pth files created in the install process is likely to come from editable mode installs, as has been discussed here. Those files include the absolute path of the project source directory, and that can be non-ASCII (for example, if it's in the user's home directory and the user has a username that includes non-ASCII characters).

There are other problem cases:

  1. .pth files created manually by the user, which can contain anything the user chooses.
  2. .pth files included in a project, which put subdirectories installed by the project onto sys.path. For example, pywin32 does this. But actually, the only example of this that I know of is pywin32, and that has ASCII names, so this might well be a non-issue.

For (1), we need to document the rules properly, and then it's up to the user to follow them - if the user wants to add a site-packages directory (via $PYTHONPATH, for instance) that is shared across Python versions, making that work is on them, I guess1.

For (2), things are similar, as the same package could be installed in different Python versions. I doubt a project is going to want to publish different wheels for different Python versions, just to handle .pth file encodings. But as I said, it's quite possible that there are no packages that have this problem in any case.

Footnotes

  1. Supporting locale encoding as a fallback if there are UTF-8 errors until all Python versions that don't allow UTF-8 are out of support would help with this, though.

@methane
Copy link
Member

methane commented Apr 15, 2024

uv venv writes a .pth file, as you noticed, but it's ASCII-only. That's the only time uv writes a .pth file itself (as far as I know) - any other .pth files will come from installing projects, and those projects will have their own build backend (specified in the project's pyproject.toml file).

The only difficulty from .pth files created in the install process is likely to come from editable mode installs, as has been discussed here. Those files include the absolute path of the project source directory, and that can be non-ASCII (for example, if it's in the user's home directory and the user has a username that includes non-ASCII characters).

You are right. I thought pth files are created by uv pip install -e but it is created by the Hatch. Hatchling uses UTF-8 when building wheel for editable install:

Poetry uses locale.getpreferredencoding() that uses UTF-8 in UTF-8 mode.

For (1), we need to document the rules properly, and then it's up to the user to follow them - if the user wants to add a site-packages directory (via $PYTHONPATH, for instance) that is shared across Python versions, making that work is on them, I guess1.

How about this plan?

  • Python 3.11~3.13: Use UTF-8 and fallback to locale encoding.
  • Python 3.14~3.15: Same to above but show EncodingWarning when fallback is used.
  • Python 3.16~: UTF-8 only.

@abravalheri
Copy link

abravalheri commented Apr 15, 2024

How about this plan?

Python 3.11~3.13: Use UTF-8 and fallback to locale encoding.
Python 3.14~3.15: Same to above but show EncodingWarning when fallback is used.
Python 3.16~: UTF-8 only.

LGTM.

Would the check sys.version_info >= (3, 13) be good enough for libraries producing .pth to know when they can emit UTF-8? (Ideally there should be a simple and easy to maintain condition that producers can check to see when using UTF-8 is guaranteed to work for the given Python installation).

@methane
Copy link
Member

methane commented Apr 15, 2024

Would the check sys.version_info >= (3, 13) be good enough for libraries producing .pth to know when they can emit UTF-8? (Ideally there should be a simple and easy to maintain condition that producers can check to see when using UTF-8 is guaranteed to work for the given Python installation).

Yes. UTF-8 support will be backported to 3.11 and 3.12 only for tools already produce UTF-8 .pth files.

@methane
Copy link
Member

methane commented Apr 16, 2024

3.11 is security fix only mode. I backported this only to 3.12.

@methane methane closed this as completed Apr 16, 2024
@vstinner
Copy link
Member

vstinner commented Apr 16, 2024

Thanks @methane, using UTF-8 or falling back on locale encoding is a nice tradeoff.

@ncoghlan
Copy link
Contributor

GH-119503 made a small tweak to ignore UTF-8 BOMs when reading .pth files (this aligns with the way source files are decoded)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OS-windows topic-unicode type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

8 participants