Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: UnicodeEncodeError in test_to_latex_filename (pandas.tests.test_format.TestDataFrameFormatting) #12337

Closed
dhomeier opened this issue Feb 15, 2016 · 17 comments
Labels
Bug IO LaTeX to_latex Unicode Unicode strings Unreliable Test Unit tests that occasionally fail

Comments

@dhomeier
Copy link

Getting this error if (and I think only if) LANG is not defined or not set to any utf8-conforming value on 0.18.0rc1 (Mac OS X 10.10, python 3.4.4, numpy 0.11.0b3):

ERROR: test_to_latex_filename (pandas.tests.test_format.TestDataFrameFormatting)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch.noindex/fink.build/pandas-py34-0.18.0rc1-1/pandas-0.18.0rc1/pandas/tests/test_format.py", line 2614, in test_to_latex_filename
    df.to_latex(path)
  File "/scratch.noindex/fink.build/pandas-py34-0.18.0rc1-1/pandas-0.18.0rc1/pandas/core/frame.py", line 1593, in to_latex
    encoding=encoding)
  File "/scratch.noindex/fink.build/pandas-py34-0.18.0rc1-1/pandas-0.18.0rc1/pandas/core/format.py", line 641, in to_latex
    latex_renderer.write_result(f)
  File "/scratch.noindex/fink.build/pandas-py34-0.18.0rc1-1/pandas-0.18.0rc1/pandas/core/format.py", line 877, in write_result
    buf.write(' & '.join(crow))
UnicodeEncodeError: 'ascii' codec can't encode character '\xdf' in position 7: ordinal not in range(128)

Don't know about the inner workings of this test; if there are any open files involved, this might be relevant: scipy/scipy#5694

@jreback
Copy link
Contributor

jreback commented Feb 16, 2016

hmm I can repro this, but only on mac (usin 3.5 and latest numpy but I don't think numpy matters)

cc @nbonnotte

@jreback jreback added Bug IO LaTeX to_latex labels Feb 16, 2016
@jreback jreback added this to the 0.18.0 milestone Feb 16, 2016
@jreback
Copy link
Contributor

jreback commented Feb 16, 2016

@dhomeier if you want to experiment on this one. not really sure what the issue is.

@dhomeier
Copy link
Author

Seems to me this:
test_format.py:2611

        # test with utf-8 without encoding option
        if compat.PY3:  # python3 default encoding is utf-8

I believe this is not correct; the default encoding is whatever is specified by the LANG (or possibly one of the LC_*) environment variable. If that's not set it falls back to 'ascii'. That's at least what the docs for the builtin open() state:

    In text mode, if encoding is not specified the encoding used is platform
    dependent: locale.getpreferredencoding(False) is called to get the
    current locale encoding.

@nbonnotte
Copy link
Contributor

Uh, I did write that part, but what I meant was "the default for pandas in a python3 environment is utf8", not "the default in python3 is utf8". This was to be consistant with to_csv:

encoding : string, optional
A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.

It would be interesting to compare to_latex and to_csv, because I don't see any reason why it should work in one case and not in another. I may have missed something, I'll have a look.

@nbonnotte
Copy link
Contributor

Uh, I can't repro this, even though I'm on Mac OS X 10.10, python 3.5.1, numpy 1.10.4, and my LANG is empty.

@dhomeier
Copy link
Author

So is the default supposed to be initialised somewhere in the pandas setup independently of the system settings? Otherwise, do you have LC_ALL set? What do you get for this?

>>> import locale
>>> locale.getdefaultlocale()
(None, None)
>>> locale.getpreferredencoding(False)
'US-ASCII'

@jreback
Copy link
Contributor

jreback commented Feb 16, 2016

This gets the pandas encoding.

In [1]: pd.get_option('display.encoding')
Out[1]: 'UTF-8'

I think this is correct in this case, rather its the comparison that's the issue. you need to do like its above

           with codecs.open(path, 'r', encoding='utf-8') as f:
                self.assertEqual(df.to_latex(), f.read())

@nbonnotte
Copy link
Contributor

Hum, I think I did overlook the default behaviour in Python 3 as pointed out by @dhomeier

See this line: the parameter encoding can be None, in which case we go for the default python behavior (which depends on the locale), but both the documentation and the tests expect the default pandas behavior (i.e. UTF-8).

A simple solution could be to replace encoding=None with either 'ascii' or 'utf-8', depending on the version of python being used. Although I don't like much to hardcode it like that way...

@dhomeier
Copy link
Author

AFAICS to_csv ultimately gets the default from UnicodeWriter, which sets the default encoding to "utf-8" regardless of the Python version.
I thought to_latex would get it from codecs.open(), which is supposed to use sys.getdefaultencoding().
But this indeed returns 'ascii' in Python 2.7, and 'utf-8' in Python 3, regardless of the environment setting. Just added this check to make sure:

        # test with utf-8 without encoding option
        if compat.PY3:  # python3 default encoding is utf-8
            self.assertEqual(sys.getdefaultencoding(), 'utf-8')
            with tm.ensure_clean('test.tex') as path:
                df.to_latex(path)
                with codecs.open(path, 'r') as f:
                    self.assertEqual(df.to_latex(), f.read())

It's still throwing the error in

            with codecs.open(self.buf, 'w', encoding=encoding) as f:
                latex_renderer.write_result(f)

so maybe the codecs.open() encoding is not passed on to LatexFormatter.write_result(f) (should it?)...

@dhomeier
Copy link
Author

@nbonnotte, you could perhaps explicitly call sys.getdefaultencoding() in to_latex() if encoding is None. Though I am wondering now if it makes sense to allow non-ASCII characters in LaTeX output; might still depend on your TeX installation if they are accepted? But for consistency with to_csv one should perhaps use the same as there (or have both resort to sys.getdefaultencoding()).

@nbonnotte
Copy link
Contributor

@dhomeier Of course we want non-ascii characters in LaTeX, they're handled by the inputenc package (or directly by XeLaTeX).

Explicitly calling sys.getdefaultencoding() seems a good idea to me, if that's not what codecs.open looks for implicitly. Can you do a PR?

@dhomeier
Copy link
Author

Yes, I may have misread the codecs docstring; for codecs.open it does not explicitly state any default for the encoding. Should we use sys.getdefaultencoding() then or pd.get_option('display.encoding') as put forth by @jreback? And for to_latex only or the same for to_csv? The latter would default to csv.writer with encoding=None, which again would accept unicode characters in Python 3, but not 2; but I don't see any tests for this.

@jreback
Copy link
Contributor

jreback commented Feb 16, 2016

ideally you could do something to make this fail on Travis (as it is currently). I suspect in one of the alternate encoding builds (where we ovrride the LOCALE), may need to set some other variable to get this to fail. That way you can test wether a fix works.

It may be that we need to set an alternate py3 build to use LOCALE (e.g. you can do this with the 3.4 slow build)

@dhomeier
Copy link
Author

Not sure I understand what you intend - add tests (for to_latex and to_csv?) that will fail if no LOCALE is set in the environment?

@jreback jreback modified the milestones: 0.18.1, 0.18.0 Feb 21, 2016
@yarikoptic
Copy link
Contributor

FWIW -- running into the same issue while building the package for 0.18.0-114-g6c692ae on debian sid. Will skip this test for now

@jreback jreback modified the milestones: 0.18.1, 0.18.2 Apr 26, 2016
@0-wiz-0
Copy link

0-wiz-0 commented Aug 19, 2016

I see this too when running the tests on NetBSD with the default LC_ALL=C (which is ASCII) and python-3.5.2.

jreback pushed a commit that referenced this issue Sep 10, 2016
xref #12337

Author: Nicolas Bonnotte <[email protected]>

Closes #14114 from nbonnotte/unicode-to_latex-12337 and squashes the following commits:

dadf73c [Nicolas Bonnotte] New tentative with C locale
b876296 [Nicolas Bonnotte] Base matrix configuration
c825f86 [Nicolas Bonnotte] New files requirements-3.5_ASCII.*
3b4c6a5 [Nicolas Bonnotte] Travis conf: new test with python 3.5 and LC_ALL=C
3b859ce [Nicolas Bonnotte] Test for Python 3.4 with C locale
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.0, 0.20.0 Sep 12, 2016
@jreback jreback modified the milestones: 0.19.0, 0.19.1 Sep 28, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.20.0, 0.19.1 Oct 22, 2016
@jreback jreback modified the milestones: 0.20.0, 0.21.0, Next Major Release Mar 23, 2017
@jbrockmendel jbrockmendel added the Unreliable Test Unit tests that occasionally fail label Dec 19, 2019
@mroeschke
Copy link
Member

Since we don't support Python versions less than 3.6.1 and the CI hasn't had issues with this test, I imagine this is no longer an issue. Happy to reopen if anyone else experiences issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO LaTeX to_latex Unicode Unicode strings Unreliable Test Unit tests that occasionally fail
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants