Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pants fails to create chroot from packages containing UTF-8 encoded files when default encoding is non-UTF-8. #3823

Closed
jsirois opened this issue Aug 29, 2016 · 10 comments
Assignees
Labels

Comments

@jsirois
Copy link
Contributor

jsirois commented Aug 29, 2016

Saw this in an Aurora RB: https://reviews.apache.org/r/51499/

Repro'd via the following with a few prints added to the pex code:

LANG=en_US.ISO-8859-1 ./pants ...
...
17:24:18 00:01         [chroot]Unpacking package CherryPy...
Extracting /tmp/tmp8LEeNF/CherryPy-7.1.0.zip...

17:24:26 00:09   [complete]
               FAILURE
Exception caught: (<type 'exceptions.UnicodeEncodeError'>)
...

And, in fact:

$ DIR=$(mktemp -d) && unzip -qd $DIR desktop/CherryPy-7.1.0.zip && find $DIR -type f | xargs file | grep UTF-8
/tmp/tmp.LmrsVWoAVl/CherryPy-7.1.0/cherrypy/test/test_static.py:                          C++ source, UTF-8 Unicode text
/tmp/tmp.LmrsVWoAVl/CherryPy-7.1.0/cherrypy/test/test_encoding.py:                        C++ source, UTF-8 Unicode text
/tmp/tmp.LmrsVWoAVl/CherryPy-7.1.0/cherrypy/test/test_core.py:                            C++ source, UTF-8 Unicode text
/tmp/tmp.LmrsVWoAVl/CherryPy-7.1.0/CherryPy.egg-info/SOURCES.txt:                         UTF-8 Unicode text
@jsirois
Copy link
Contributor Author

jsirois commented Sep 1, 2016

OK, using the following script:

#!/usr/bin/env python2
from __future__ import print_function

import codecs
import errno
import os
import tempfile
import zipfile


target = tempfile.mkdtemp()
print('Extracting zfile entry by entry to {}'.format(target))


codec = codecs.lookup('utf-8')
with zipfile.ZipFile('CherryPy-7.1.0.zip') as zfile:
  for info in zfile.infolist():
    path, _ = codec.encode(info.filename)
    if not path.endswith(b'/'):
      rel_dir = os.path.dirname(path)
      abs_dir = os.path.join(target, rel_dir)
      abs_path = os.path.join(abs_dir, os.path.basename(path))
      try:
        os.makedirs(abs_dir)
      except os.error as e:
        if e.errno != errno.EEXIST:
          raise e
      with open(abs_path, 'wb') as tfp:
        tfp.write(zfile.read(info))


target = tempfile.mkdtemp()
print('Extracting zfile using extractall to {}'.format(target))


with zipfile.ZipFile('CherryPy-7.1.0.zip') as zfile:
  zfile.extractall(target)

I find on my default UTF-8 encoding machine:

$ LANG=en_US.ISO-8859-1 ./test.py
Extracting zfile entry by entry to /tmp/tmpQR7sFT
Extracting zfile using extractall to /tmp/tmpXKYKSn
Traceback (most recent call last):
  File "./test.py", line 37, in <module>
    zfile.extractall(target)
  File "/usr/lib/python2.7/zipfile.py", line 1040, in extractall
    self.extract(zipinfo, path, pwd)
  File "/usr/lib/python2.7/zipfile.py", line 1028, in extract
    return self._extract_member(member, path, pwd)
  File "/usr/lib/python2.7/zipfile.py", line 1083, in _extract_member
    file(targetpath, "wb") as target:
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 51-55: ordinal not in range(256)

So the entry-by-entry extraction might be the way to go over in pex. Thinking on this a bit before filing an issue over there, but maybe @kwlzn can spot the problem with doing this more quickly than I.

@kwlzn
Copy link
Member

kwlzn commented Sep 1, 2016

seems reasonable to me. tho it's also worth noting that I don't seem to repro the failure on OSX (10.11):

[illuminati 1]$ echo $LANG
en_US.UTF-8
[illuminati 1]$ DIR=$(mktemp -d) && unzip -qd $DIR ./CherryPy-7.1.0.zip && find $DIR -type f | xargs file | grep UTF-8
/var/folders/3t/xkwqrkld4xxgklk2s4n41jb80000gn/T/tmp.S91td9u2/CherryPy-7.1.0/cherrypy/test/test_core.py:                            UTF-8 Unicode English text
/var/folders/3t/xkwqrkld4xxgklk2s4n41jb80000gn/T/tmp.S91td9u2/CherryPy-7.1.0/cherrypy/test/test_encoding.py:                        UTF-8 Unicode Java program text
/var/folders/3t/xkwqrkld4xxgklk2s4n41jb80000gn/T/tmp.S91td9u2/CherryPy-7.1.0/cherrypy/test/test_static.py:                          UTF-8 Unicode Java program text
/var/folders/3t/xkwqrkld4xxgklk2s4n41jb80000gn/T/tmp.S91td9u2/CherryPy-7.1.0/CherryPy.egg-info/SOURCES.txt:                         UTF-8 Unicode text
[illuminati 1]$ LANG=en_US.ISO-8859-1 python2.7 test.py
Extracting zfile entry by entry to /var/folders/3t/xkwqrkld4xxgklk2s4n41jb80000gn/T/tmpMRVF5J
Extracting zfile using extractall to /var/folders/3t/xkwqrkld4xxgklk2s4n41jb80000gn/T/tmpDznVte
[illuminati 1]$

I wonder what's different between our envs?

@jsirois
Copy link
Contributor Author

jsirois commented Sep 1, 2016

See my LANG=... CMD line prefix

On Sep 1, 2016 3:26 PM, "Kris Wilson" [email protected] wrote:

seems reasonable to me. tho it's also worth noting that I don't seem to
repro the failure on OSX (10.11):

[illuminati 1]$ echo $LANG
en_US.UTF-8
[illuminati 1]$ DIR=$(mktemp -d) && unzip -qd $DIR ./CherryPy-7.1.0.zip && find $DIR -type f | xargs file | grep UTF-8
/var/folders/3t/xkwqrkld4xxgklk2s4n41jb80000gn/T/tmp.S91td9u2/CherryPy-7.1.0/cherrypy/test/test_core.py: UTF-8 Unicode English text
/var/folders/3t/xkwqrkld4xxgklk2s4n41jb80000gn/T/tmp.S91td9u2/CherryPy-7.1.0/cherrypy/test/test_encoding.py: UTF-8 Unicode Java program text
/var/folders/3t/xkwqrkld4xxgklk2s4n41jb80000gn/T/tmp.S91td9u2/CherryPy-7.1.0/cherrypy/test/test_static.py: UTF-8 Unicode Java program text
/var/folders/3t/xkwqrkld4xxgklk2s4n41jb80000gn/T/tmp.S91td9u2/CherryPy-7.1.0/CherryPy.egg-info/SOURCES.txt: UTF-8 Unicode text
[illuminati 1]$ LANG=en_US.ISO-8859-1 python2.7 test.py
Extracting zfile entry by entry to /var/folders/3t/xkwqrkld4xxgklk2s4n41jb80000gn/T/tmpMRVF5J
Extracting zfile using extractall to /var/folders/3t/xkwqrkld4xxgklk2s4n41jb80000gn/T/tmpDznVte
[illuminati 1]$

I wonder what's different between our envs?


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#3823 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAJnkkBMeH5Ak9FAzc91S_aH8_Rc12T1ks5ql0LvgaJpZM4JwAz5
.

@kwlzn
Copy link
Member

kwlzn commented Sep 1, 2016

yup - present in my paste.

@jsirois
Copy link
Contributor Author

jsirois commented Sep 1, 2016

Aha - not sure, but the environment that spawned this ticket from Aurora CI is Linux.

@jsirois
Copy link
Contributor Author

jsirois commented Sep 2, 2016

Aha: https://docs.python.org/2/library/sys.html#sys.getfilesystemencoding
So this is a non-OSX-specific problem for a change.

@jsirois
Copy link
Contributor Author

jsirois commented Sep 2, 2016

Noting that CherryPy has "fixed" the motivating case in 8.0.0: cherrypy/cherrypy@b8e2518

This solves the motivating error, see Aurora RB: https://reviews.apache.org/r/51615/

@benjyw
Copy link
Contributor

benjyw commented Feb 14, 2017

To work around this in a session:

LANG=en_US.utf8 ./pants ...

But note that you may first need to:

locale-gen en_US.UTF-8

@Eric-Arellano
Copy link
Contributor

Likely fixed by the Python 3 migration.

@jsirois
Copy link
Contributor Author

jsirois commented Oct 1, 2024

Likely fixed by the Python 3 migration.

This was certainly never fixed by Pants own internal code migration since, in May 2020, Pants was already calling Pex in a sub-process and that sub-process could be running a Python 2.7 interpreter. In fact, to this day, FWICT, Pants supports Python 2.7 since Pex does; so this continues to be a bug for the ~0 Pants users using Python 2.7.

All that said, the issue could now be closed for real here once Pants upgrades to Pex 2.20.2 which has fixed this issue: https://github.com/pex-tool/pex/releases/tag/v2.20.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants