Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Unicode in Python 2 #29

Open
goerlitz opened this issue Mar 17, 2018 · 1 comment
Open

Issue with Unicode in Python 2 #29

goerlitz opened this issue Mar 17, 2018 · 1 comment

Comments

@goerlitz
Copy link

goerlitz commented Mar 17, 2018

I'm trying to annotate some Unicode strings. But following example throws errors.

Case 1: Passing Unicode strings.

from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000/')
text = u'Köln is a city in Germany.'
print(nlp.annotate(text))

throws

AssertionErrorTraceback (most recent call last)
/home/jovyan/work/python/pycorenlp/corenlp.py in annotate(self, text, properties)
---> 11         assert isinstance(text, str)

AssertionError:

because it's a string of type 'unicode' in Python 2.

Case 2: Passing encoded Unicode strings:

from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000/')
text = u'Köln is a city in Germany.'.encode('utf-8')
print(nlp.annotate(text))

throws

UnicodeDecodeErrorTraceback (most recent call last)
/home/jovyan/work/python/pycorenlp/corenlp.py in annotate(self, text, properties)
---> 25         data = text.encode()

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

because the string has already been encoded and cannot be encoded again.

These two lines of code in the error messages were both introduced in #6 in May 2016 to fix some Unicode issues.
However, is seems the explicit encoding in line 25 is not required anymore, because if removed case 2 works perfectly (both in Python 2 and Python 3).

Note also that encoding issues were fixed in CoreNLP in October 2016 (stanfordnlp/CoreNLP#270).

@goerlitz
Copy link
Author

I have to correct myself. The characters encoding in Python 3 will be broken again if text.encode() is removed. So this string problem seems to be caused by one of the incompatible changes in Python 3.

If found out that a different wrapper implementation uses this piece of code to fix the issue:

if sys.version_info.major >= 3:
    text = text.encode('utf-8')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant