-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set encoding for tar file and use unicode path for unpacking #7668
Conversation
When tarfile.TarFile decodes filenames in Python 2.7 by default it uses sys.getfilesystemencoding. On Windows this returns "mbcs", which is lossy when converting from proper utf-8 to bytes (results in '?' for out of range characters). We now pass an encoding to tarfile.open which will be used instead. Since the encoding argument is only ever used for the PAX format, and since the PAX format guarantees utf-8 encoded information, this should work in all circumstances. For filesystem APIs in Python 2, the type of the path object passed dictates the underlying Windows API that is called. For `str` it is the `*A` (for ANSI) APIs. For `unicode` it is the `*W` (for Wide character) APIs. To use the second set of APIs, which properly handles unicode filenames, we try to convert the byte path to utf-8. Since there is no obvious way to identify a "PAX" tar file or tar info entity, we optimistically try to do the conversion and silently continue if it fails.
85f95d7
to
8e59ad3
Compare
This should resolve the issue on Windows, but non-Windows systems have the opposite problem: if the locale uses a non-universal encoding (e.g. ascii), then they'll trigger an encoding error when attempting to open the Unicode path. There's no obviously correct answer for what to do in that case. Failing loudly at install time at least highlights the potential filename corruption issue, but another reasonable alternative would be to try encoding with the locale encoding first, and fall back to utf-8 if that fails (emitting a warning that the filename may not be correctly encoded due to locale issues). |
I suggest that we ignore this because it's a Python 2 specific issue, and there's are more impactful tasks for volunteers to work on, than this problem that's only occurring on EoL Python versions. It's our documented policy that pip's maintainers won't necessarily be solving problems that are Python 2 only. |
According to #7667 this also affects Python < 3.7 on systems using less capable encodings (e.g. |
(Back on a real computer rather than my phone) The Windows aspect is genuinely fixed at the interpreter level, as of CPython 3.6 (the filesystem encoding is always UTF-8 instead of mbcs). The problem still exists in Python 3.5 as well as in 2.7 (Chris's patch fixes that, but at the risk of causing new problems on non-Windows systems). For non-Windows systems, the problem exists by default in 3.5 and 3.6, and can still be induced in 3.7 or later by setting "LC_ALL=C" (since that will not only set a bad filesystem encoding, it will also turn off locale coercion). (I'm not actually sure what Python 2.7 will do, but I suspect it will just unconditionally pass the UTF-8 bytes from the PAX file to the local filesystem) On non-Windows systems, the problem is amenable to an environment fix, which is "Ensure your installs run with a proper locale set". At the CPython level, we officially gave up on ever getting the C locale to actually work properly in a Unicode-centric world: https://www.python.org/dev/peps/pep-0011/#legacy-c-locale So while I do think it would be nice to actually handle this at the pip level, it's also reasonable to tell people that hit these kinds of encoding issues to make sure that they at least set "LANG=C.UTF-8" in their build environments. |
Closing since this is a Python-2 only fix, and, well... it's not been updated in quite a while! |
I’d say this affects Python 3 as well (3.6 is supported for at least another year) since more and more people are running pip in containers, where |
Reopening so I don’t forget to do something. I’ll probably file another PR against master later today to replace this. |
Hello! I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the |
Closing as this is significantly out of date, and has merge conflicts. Please do feel free to resubmit an updated PR for this! ^>^ |
#9569 would cover this fix post-21.0. |
When tarfile.TarFile decodes filenames in Python 2.7 by default it uses
sys.getfilesystemencoding. On Windows this returns "mbcs", which is
lossy when converting from proper utf-8 to bytes (results in '?' for out
of range characters).
We now pass an encoding to tarfile.open which will be used instead.
Since the encoding argument is only ever used for the PAX format, and
since the PAX format guarantees utf-8 encoded information, this should
work in all circumstances.
For filesystem APIs in Python 2, the type of the path object passed
dictates the underlying Windows API that is called. For
str
it is the*A
(for ANSI) APIs. Forunicode
it is the*W
(for Wide character)APIs. To use the second set of APIs, which properly handles unicode
filenames, we convert the byte path to utf-8.
Fixes #7667.