-
Notifications
You must be signed in to change notification settings - Fork 3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Set encoding for tar file and use unicode path for unpacking
When tarfile.TarFile decodes filenames in Python 2.7 by default it uses sys.getfilesystemencoding. On Windows this returns "mbcs", which is lossy when converting from proper utf-8 to bytes (results in '?' for out of range characters). We now pass an encoding to tarfile.open which will be used instead. Since the encoding argument is only ever used for the PAX format, and since the PAX format guarantees utf-8 encoded information, this should work in all circumstances. For filesystem APIs in Python 2, the type of the path object passed dictates the underlying Windows API that is called. For `str` it is the `*A` (for ANSI) APIs. For `unicode` it is the `*W` (for Wide character) APIs. To use the second set of APIs, which properly handles unicode filenames, we try to convert the byte path to utf-8. Since there is no obvious way to identify a "PAX" tar file or tar info entity, we optimistically try to do the conversion and silently continue if it fails.
- Loading branch information
Showing
3 changed files
with
35 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Fix extraction of files with utf-8 encoded paths from tars. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters