Handling special characters with CoreNLP server. #270

Threynaud · 2016-09-30T12:32:23Z

I want to do some Named Entity Recognition on several documents using a CoreNLP server that I launch this way:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer

Basic sentences like "The quick brown fox jumps over the lazy dog" works well on the test page at http://localhost:9000/ but as soon as special characters are involved, like the "é" in "I went to a café yesterday", I get the error:

{"sentences":[{"index":0,"parse":"SENTENCE_SKIPPED_OR_UNPARSABLE","basic-dependencies": ...

Note that this sentence works well on [http://corenlp.run/].

I tried -strict and -encode utf-8 or adding parse in the annotators with no success.

I am using v 3.6.0 downloaded there: [http://stanfordnlp.github.io/CoreNLP/download.html]

Trying to call the server with corenlp_pywrap as python wrapper returns the following error:

UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-163-aa251fb39f9e> in <module>()
      2     text = profile['text']
      3     processed_chunks = process_chunks(text)
----> 4     print(idx, (find_name_in_text(processed_chunks, profile), profile['full_name']))

<ipython-input-162-2c34ab37440e> in find_name_in_text(processed_chunks, profile, stop)
      1 def find_name_in_text(processed_chunks, profile, stop=10):
      2     for chunk in processed_chunks[:stop]:
----> 3         token_dict = cn.arrange(chunk)
      4         detected_names = find_names_in_dict(token_dict)
      5         if detected_names:

/usr/local/lib/python3.5/site-packages/corenlp_pywrap/pywrap.py in arrange(self, data)
    151 
    152         current_url = self.url_calc()
--> 153         r = self.server_connection(current_url, data)
    154         try:
    155             r = r.json()

/usr/local/lib/python3.5/site-packages/corenlp_pywrap/pywrap.py in server_connection(current_url, data)
     50             server_out = requests.post(current_url, 
     51                                         data,
---> 52                                         headers={'Connection': 'close'})
     53         except requests.exceptions.ConnectionError:
     54             root.error('Connection Error, check you have server running')

/usr/local/lib/python3.5/site-packages/requests/api.py in post(url, data, json, **kwargs)
    108     """
    109 
--> 110     return request('post', url, data=data, json=json, **kwargs)
    111 
    112 

/usr/local/lib/python3.5/site-packages/requests/api.py in request(method, url, **kwargs)
     54     # cases, and look like a memory leak in others.
     55     with sessions.Session() as session:
---> 56         return session.request(method=method, url=url, **kwargs)
     57 
     58 

/usr/local/lib/python3.5/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    473         }
    474         send_kwargs.update(settings)
--> 475         resp = self.send(prep, **send_kwargs)
    476 
    477         return resp

/usr/local/lib/python3.5/site-packages/requests/sessions.py in send(self, request, **kwargs)
    594 
    595         # Send the request
--> 596         r = adapter.send(request, **kwargs)
    597 
    598         # Total elapsed time of the request (approximately)

/usr/local/lib/python3.5/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    421                     decode_content=False,
    422                     retries=self.max_retries,
--> 423                     timeout=timeout
    424                 )
    425 

/usr/local/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, **response_kw)
    593                                                   timeout=timeout_obj,
    594                                                   body=body, headers=headers,
--> 595                                                   chunked=chunked)
    596 
    597             # If we're going to release the connection in ``finally:``, then

/usr/local/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    361             conn.request_chunked(method, url, **httplib_request_kw)
    362         else:
--> 363             conn.request(method, url, **httplib_request_kw)
    364 
    365         # Reset the timeout for the recv() on the socket

/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py in request(self, method, url, body, headers)
   1104     def request(self, method, url, body=None, headers={}):
   1105         """Send a complete request to the server."""
-> 1106         self._send_request(method, url, body, headers)
   1107 
   1108     def _set_content_length(self, body, method):

/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py in _send_request(self, method, url, body, headers)
   1148             # RFC 2616 Section 3.7.1 says that text default has a
   1149             # default charset of iso-8859-1.
-> 1150             body = _encode(body, 'body')
   1151         self.endheaders(body)
   1152 

/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py in _encode(data, name)
    159             "%s (%.20r) is not valid Latin-1. Use %s.encode('utf-8') "
    160             "if you want to send it encoded in UTF-8." %
--> 161             (name.title(), data[err.start:err.end], name)) from None
    162 
    163 

UnicodeEncodeError: 'latin-1' codec can't encode character '\uf02a' in position 1: Body ('\uf02a') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

It looks like the problem comes from how the request sent to the server is processed.

The text was updated successfully, but these errors were encountered:

gangeli · 2016-09-30T15:35:30Z

This is a known bug in 3.6.0. If you use the HEAD of the GitHub repo, it should fix this (along with a number of other improvements :) ).

manning added the bug label Oct 9, 2016

manning closed this as completed Oct 9, 2016

goerlitz mentioned this issue Mar 17, 2018

Issue with Unicode in Python 2 smilli/py-corenlp#29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling special characters with CoreNLP server. #270

Handling special characters with CoreNLP server. #270

Threynaud commented Sep 30, 2016

gangeli commented Sep 30, 2016

Handling special characters with CoreNLP server. #270

Handling special characters with CoreNLP server. #270

Comments

Threynaud commented Sep 30, 2016

gangeli commented Sep 30, 2016