-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode error due to using surrogates #18
Comments
After some more inspection, the error seems to be that the elasticsearch module expects Unicode strings that can be directly encoded as UTF-8, but surrogates are Unicode characters that cannot. The problem sees to come from the way Perceval reads the data from the git log (thanks @sduenas for pointing this out): for line in gitlog:
line = line.decode(encoding, errors='surrogateescape')
yield line This means that Perceval is producing Unicode strings with surrogates, which raise the exception UnicodeEncodeError when elasticsearch tries to encode them as UTF-8. |
After commenting with @sduenas and @acs the solutions I see are:
for line in gitlog:
line = line.decode(encoding, errors='backslashreplace')
yield line
try:
body.encode("utf-8")
except UnicodeEncodeError as e:
if e.reason == 'surrogates not allowed':
body = body.encode('utf-8', "backslashreplace").decode('utf-8')
res = self.es.index(index = self.index, doc_type = self.type,
id = self._id(item), body = item)
Maybe (1) is the best solution (trying with another encodings, and if none works, using "backslashreplace"), since it allows for all strings to be "encodeable" as UTF-8, which is a good thing. I would avoid (3), because that way some characters could be lost with a trace, but it is clearly better than doing nothing. If (1) cannot be done for some reason, maybe (2) (use "backslashreplace") could be the next better thing to do. That would allow to capture most of the problematic cases, not losing any character, and still being able of encoding as UTF-8. The main drawback of this method is that it would force all strings to be checked for UTF-8 compliance. But if the operation raising the exception is idempotent, the code could be more efficient, sanitizing strings only when the exception is raised: try:
res = self.es.index(index = self.index, doc_type = self.type,
id = self._id(item), body = item)
except UnicodeEncodeError as e:
if e.reason == 'surrogates not allowed':
body = body.encode('utf-8', "backslashreplace").decode('utf-8')
res = self.es.index(index = self.index, doc_type = self.type,
id = self._id(item), body = item) Therefore, could (1) be implemented in Perceval (at least for git)? I can produce a PR for that, if convenient... Some references which I read before making up my mind:
|
My idea using (1) sounds good but I have to find a way to make it clean because it can add more complexity to the code when it isn't really needed. |
When decoding as utf8, if the character cannnot be decoded, use the backslashreplace error handler, instead of the surrogateescape error handler. Fixes #18 for git backend, maybe others should be fixed too.
When decoding as utf8, if the character cannnot be decoded, use the backslashreplace error handler, instead of the surrogateescape error handler. Fixes chaoss#18 for git backend, maybe others should be fixed too.
When running the git backend to analyze the git repo of the torvalds/linux GitHub repository, and uploading it to ElasticSearch (using the Python elasticsearch module) i get an error which apparently is due to trying to UTF-8 encode a "surrogated" string. The relevant part of the debugging messages and the exception I get is:
The code causing the error is:
The problem seems to be that there is a character, decoded from the git log by Perceval into a Unicode string, which is using surrogates (the character is '\udca0'), and it cannot be properly encoded in UTF-8 by the elasticsearch code, thus raising the exception.
The text was updated successfully, but these errors were encountered: