Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set encoding for tar file and use unicode path for unpacking #7668

Closed
wants to merge 1 commit into from

Conversation

chrahunt
Copy link
Member

When tarfile.TarFile decodes filenames in Python 2.7 by default it uses
sys.getfilesystemencoding. On Windows this returns "mbcs", which is
lossy when converting from proper utf-8 to bytes (results in '?' for out
of range characters).

We now pass an encoding to tarfile.open which will be used instead.
Since the encoding argument is only ever used for the PAX format, and
since the PAX format guarantees utf-8 encoded information, this should
work in all circumstances.

For filesystem APIs in Python 2, the type of the path object passed
dictates the underlying Windows API that is called. For str it is the
*A (for ANSI) APIs. For unicode it is the *W (for Wide character)
APIs. To use the second set of APIs, which properly handles unicode
filenames, we convert the byte path to utf-8.

Fixes #7667.

When tarfile.TarFile decodes filenames in Python 2.7 by default it uses
sys.getfilesystemencoding. On Windows this returns "mbcs", which is
lossy when converting from proper utf-8 to bytes (results in '?' for out
of range characters).

We now pass an encoding to tarfile.open which will be used instead.
Since the encoding argument is only ever used for the PAX format, and
since the PAX format guarantees utf-8 encoded information, this should
work in all circumstances.

For filesystem APIs in Python 2, the type of the path object passed
dictates the underlying Windows API that is called. For `str` it is the
`*A` (for ANSI) APIs. For `unicode` it is the `*W` (for Wide character)
APIs. To use the second set of APIs, which properly handles unicode
filenames, we try to convert the byte path to utf-8. Since there is no
obvious way to identify a "PAX" tar file or tar info entity, we
optimistically try to do the conversion and silently continue if it
fails.
@ncoghlan
Copy link
Member

ncoghlan commented Jan 29, 2020

This should resolve the issue on Windows, but non-Windows systems have the opposite problem: if the locale uses a non-universal encoding (e.g. ascii), then they'll trigger an encoding error when attempting to open the Unicode path.

There's no obviously correct answer for what to do in that case. Failing loudly at install time at least highlights the potential filename corruption issue, but another reasonable alternative would be to try encoding with the locale encoding first, and fall back to utf-8 if that fails (emitting a warning that the filename may not be correctly encoded due to locale issues).

@pradyunsg
Copy link
Member

I suggest that we ignore this because it's a Python 2 specific issue, and there's are more impactful tasks for volunteers to work on, than this problem that's only occurring on EoL Python versions.

It's our documented policy that pip's maintainers won't necessarily be solving problems that are Python 2 only.

@uranusjr
Copy link
Member

According to #7667 this also affects Python < 3.7 on systems using less capable encodings (e.g. C)

@ncoghlan
Copy link
Member

(Back on a real computer rather than my phone)

The Windows aspect is genuinely fixed at the interpreter level, as of CPython 3.6 (the filesystem encoding is always UTF-8 instead of mbcs). The problem still exists in Python 3.5 as well as in 2.7 (Chris's patch fixes that, but at the risk of causing new problems on non-Windows systems).

For non-Windows systems, the problem exists by default in 3.5 and 3.6, and can still be induced in 3.7 or later by setting "LC_ALL=C" (since that will not only set a bad filesystem encoding, it will also turn off locale coercion). (I'm not actually sure what Python 2.7 will do, but I suspect it will just unconditionally pass the UTF-8 bytes from the PAX file to the local filesystem)

On non-Windows systems, the problem is amenable to an environment fix, which is "Ensure your installs run with a proper locale set". At the CPython level, we officially gave up on ever getting the C locale to actually work properly in a Unicode-centric world: https://www.python.org/dev/peps/pep-0011/#legacy-c-locale

So while I do think it would be nice to actually handle this at the pip level, it's also reasonable to tell people that hit these kinds of encoding issues to make sure that they at least set "LANG=C.UTF-8" in their build environments.

@pradyunsg
Copy link
Member

Closing since this is a Python-2 only fix, and, well... it's not been updated in quite a while!

@pradyunsg pradyunsg closed this Sep 23, 2020
@uranusjr
Copy link
Member

uranusjr commented Sep 24, 2020

I’d say this affects Python 3 as well (3.6 is supported for at least another year) since more and more people are running pip in containers, where LANG is usually set to C. Fixing this would save us from answering to support tickets asking this question, which IMO is worth it given the small changeset.

@uranusjr uranusjr reopened this Sep 24, 2020
@uranusjr uranusjr self-assigned this Sep 24, 2020
@uranusjr
Copy link
Member

Reopening so I don’t forget to do something. I’ll probably file another PR against master later today to replace this.

@BrownTruck
Copy link
Contributor

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will be eligible for code review and hopefully merging!

@BrownTruck BrownTruck added the needs rebase or merge PR has conflicts with current master label Feb 4, 2021
@pradyunsg
Copy link
Member

Closing as this is significantly out of date, and has merge conflicts. Please do feel free to resubmit an updated PR for this! ^>^

@pradyunsg pradyunsg closed this Mar 5, 2021
@uranusjr
Copy link
Member

uranusjr commented Mar 5, 2021

#9569 would cover this fix post-21.0.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs rebase or merge PR has conflicts with current master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Filename encoding error in some environments with PAX sdist
5 participants