Issue with Unicode in Python 2 #29

goerlitz · 2018-03-17T21:44:04Z

I'm trying to annotate some Unicode strings. But following example throws errors.

Case 1: Passing Unicode strings.

from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000/')
text = u'Köln is a city in Germany.'
print(nlp.annotate(text))

throws

AssertionErrorTraceback (most recent call last)
/home/jovyan/work/python/pycorenlp/corenlp.py in annotate(self, text, properties)
---> 11         assert isinstance(text, str)

AssertionError:

because it's a string of type 'unicode' in Python 2.

Case 2: Passing encoded Unicode strings:

from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000/')
text = u'Köln is a city in Germany.'.encode('utf-8')
print(nlp.annotate(text))

throws

UnicodeDecodeErrorTraceback (most recent call last)
/home/jovyan/work/python/pycorenlp/corenlp.py in annotate(self, text, properties)
---> 25         data = text.encode()

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

because the string has already been encoded and cannot be encoded again.

These two lines of code in the error messages were both introduced in #6 in May 2016 to fix some Unicode issues.
~~However, is seems the explicit encoding in line 25 is not required anymore, because if removed case 2 works perfectly (both in Python 2 and Python 3).~~

~~Note also that encoding issues were fixed in CoreNLP in October 2016 (stanfordnlp/CoreNLP#270).~~

The text was updated successfully, but these errors were encountered:

goerlitz · 2018-03-17T23:19:16Z

I have to correct myself. The characters encoding in Python 3 will be broken again if text.encode() is removed. So this string problem seems to be caused by one of the incompatible changes in Python 3.

If found out that a different wrapper implementation uses this piece of code to fix the issue:

if sys.version_info.major >= 3:
    text = text.encode('utf-8')

goerlitz mentioned this issue Mar 18, 2018

Fixed Unicode string encoding issues for python 2/3. #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Unicode in Python 2 #29

Issue with Unicode in Python 2 #29

goerlitz commented Mar 17, 2018 •

edited

Loading

goerlitz commented Mar 17, 2018

Issue with Unicode in Python 2 #29

Issue with Unicode in Python 2 #29

Comments

goerlitz commented Mar 17, 2018 • edited Loading

goerlitz commented Mar 17, 2018

goerlitz commented Mar 17, 2018 •

edited

Loading