Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling special characters with CoreNLP server. #270

Closed
Threynaud opened this issue Sep 30, 2016 · 1 comment
Closed

Handling special characters with CoreNLP server. #270

Threynaud opened this issue Sep 30, 2016 · 1 comment
Labels

Comments

@Threynaud
Copy link

I want to do some Named Entity Recognition on several documents using a CoreNLP server that I launch this way:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer

Basic sentences like "The quick brown fox jumps over the lazy dog" works well on the test page at http://localhost:9000/ but as soon as special characters are involved, like the "é" in "I went to a café yesterday", I get the error:

{"sentences":[{"index":0,"parse":"SENTENCE_SKIPPED_OR_UNPARSABLE","basic-dependencies": ...

Note that this sentence works well on [http://corenlp.run/].

I tried -strict and -encode utf-8 or adding parse in the annotators with no success.

I am using v 3.6.0 downloaded there: [http://stanfordnlp.github.io/CoreNLP/download.html]

Trying to call the server with corenlp_pywrap as python wrapper returns the following error:

UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-163-aa251fb39f9e> in <module>()
      2     text = profile['text']
      3     processed_chunks = process_chunks(text)
----> 4     print(idx, (find_name_in_text(processed_chunks, profile), profile['full_name']))

<ipython-input-162-2c34ab37440e> in find_name_in_text(processed_chunks, profile, stop)
      1 def find_name_in_text(processed_chunks, profile, stop=10):
      2     for chunk in processed_chunks[:stop]:
----> 3         token_dict = cn.arrange(chunk)
      4         detected_names = find_names_in_dict(token_dict)
      5         if detected_names:

/usr/local/lib/python3.5/site-packages/corenlp_pywrap/pywrap.py in arrange(self, data)
    151 
    152         current_url = self.url_calc()
--> 153         r = self.server_connection(current_url, data)
    154         try:
    155             r = r.json()

/usr/local/lib/python3.5/site-packages/corenlp_pywrap/pywrap.py in server_connection(current_url, data)
     50             server_out = requests.post(current_url, 
     51                                         data,
---> 52                                         headers={'Connection': 'close'})
     53         except requests.exceptions.ConnectionError:
     54             root.error('Connection Error, check you have server running')

/usr/local/lib/python3.5/site-packages/requests/api.py in post(url, data, json, **kwargs)
    108     """
    109 
--> 110     return request('post', url, data=data, json=json, **kwargs)
    111 
    112 

/usr/local/lib/python3.5/site-packages/requests/api.py in request(method, url, **kwargs)
     54     # cases, and look like a memory leak in others.
     55     with sessions.Session() as session:
---> 56         return session.request(method=method, url=url, **kwargs)
     57 
     58 

/usr/local/lib/python3.5/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    473         }
    474         send_kwargs.update(settings)
--> 475         resp = self.send(prep, **send_kwargs)
    476 
    477         return resp

/usr/local/lib/python3.5/site-packages/requests/sessions.py in send(self, request, **kwargs)
    594 
    595         # Send the request
--> 596         r = adapter.send(request, **kwargs)
    597 
    598         # Total elapsed time of the request (approximately)

/usr/local/lib/python3.5/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    421                     decode_content=False,
    422                     retries=self.max_retries,
--> 423                     timeout=timeout
    424                 )
    425 

/usr/local/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, **response_kw)
    593                                                   timeout=timeout_obj,
    594                                                   body=body, headers=headers,
--> 595                                                   chunked=chunked)
    596 
    597             # If we're going to release the connection in ``finally:``, then

/usr/local/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    361             conn.request_chunked(method, url, **httplib_request_kw)
    362         else:
--> 363             conn.request(method, url, **httplib_request_kw)
    364 
    365         # Reset the timeout for the recv() on the socket

/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py in request(self, method, url, body, headers)
   1104     def request(self, method, url, body=None, headers={}):
   1105         """Send a complete request to the server."""
-> 1106         self._send_request(method, url, body, headers)
   1107 
   1108     def _set_content_length(self, body, method):

/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py in _send_request(self, method, url, body, headers)
   1148             # RFC 2616 Section 3.7.1 says that text default has a
   1149             # default charset of iso-8859-1.
-> 1150             body = _encode(body, 'body')
   1151         self.endheaders(body)
   1152 

/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py in _encode(data, name)
    159             "%s (%.20r) is not valid Latin-1. Use %s.encode('utf-8') "
    160             "if you want to send it encoded in UTF-8." %
--> 161             (name.title(), data[err.start:err.end], name)) from None
    162 
    163 

UnicodeEncodeError: 'latin-1' codec can't encode character '\uf02a' in position 1: Body ('\uf02a') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

It looks like the problem comes from how the request sent to the server is processed.

@gangeli
Copy link
Member

gangeli commented Sep 30, 2016

This is a known bug in 3.6.0. If you use the HEAD of the GitHub repo, it should fix this (along with a number of other improvements :) ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants