Unicode error due to using surrogates #18

jgbarah · 2016-03-15T22:37:23Z

When running the git backend to analyze the git repo of the torvalds/linux GitHub repository, and uploading it to ElasticSearch (using the Python elasticsearch module) i get an error which apparently is due to trying to UTF-8 encode a "surrogated" string. The relevant part of the debugging messages and the exception I get is:

...
'message': "[PATCH] intel8x0: AC'97 audio patch for Intel ESB2\n\nThis
patch adds the Intel ESB2 DID's to the intel8x0.c file for AC'97
audio\nsupport.\n\nSigned-off-by: \udca0Jason Gaston
<[email protected]>\nSigned-off-by: Andrew Morton
<[email protected]>\nSigned-off-by: Linus Torvalds <[email protected]>",
'commit': 'c4c8ea948aa21527d502e87227b2f1d951bc506d'}
...
Traceback (most recent call last):
...
  File "/home/jgb/venvs/grimoirelab/lib/python3.5/site-packages/elasticsearch/transport.py", line 312, in perform_request
     body = body.encode('utf-8')
 UnicodeEncodeError: 'utf-8' codec can't encode character '\udca0' in position 984: surrogates not allowed

The code causing the error is:

res = self.es.index(index = self.index, doc_type = self.type,
                            id = self._id(item), body = item)

The problem seems to be that there is a character, decoded from the git log by Perceval into a Unicode string, which is using surrogates (the character is '\udca0'), and it cannot be properly encoded in UTF-8 by the elasticsearch code, thus raising the exception.

The text was updated successfully, but these errors were encountered:

jgbarah · 2016-03-15T22:48:29Z

After some more inspection, the error seems to be that the elasticsearch module expects Unicode strings that can be directly encoded as UTF-8, but surrogates are Unicode characters that cannot.

The problem sees to come from the way Perceval reads the data from the git log (thanks @sduenas for pointing this out):

        for line in gitlog:
            line = line.decode(encoding, errors='surrogateescape')
            yield line

This means that Perceval is producing Unicode strings with surrogates, which raise the exception UnicodeEncodeError when elasticsearch tries to encode them as UTF-8.

jgbarah · 2016-03-15T23:17:18Z

After commenting with @sduenas and @acs the solutions I see are:

(1) Use some other method for non-UTF-8 strings in the git log. I suggest trying the "usual" Latin-1 encoding if UTF-8 fails, and then 'backslashreplace' as the error handler instead of 'surrogateescape'. A simple code (without trying another encoding if UTF-8 fails) would be:

        for line in gitlog:
            line = line.decode(encoding, errors='backslashreplace')
            yield line

(2) Removing surrogates when they are found. That is, "sanitizing" the strings to make sure they can be encoded to UTF-8. For example, in my code, this could amount to:

        try:
            body.encode("utf-8")
        except UnicodeEncodeError as e:
            if e.reason == 'surrogates not allowed':
                body = body.encode('utf-8', "backslashreplace").decode('utf-8')
        res = self.es.index(index = self.index, doc_type = self.type,
                            id = self._id(item), body = item)

(3) Ignoring surrogated characters, by using "ignore" instead of "backslashreplace" in the code above.

Maybe (1) is the best solution (trying with another encodings, and if none works, using "backslashreplace"), since it allows for all strings to be "encodeable" as UTF-8, which is a good thing.

I would avoid (3), because that way some characters could be lost with a trace, but it is clearly better than doing nothing.

If (1) cannot be done for some reason, maybe (2) (use "backslashreplace") could be the next better thing to do. That would allow to capture most of the problematic cases, not losing any character, and still being able of encoding as UTF-8. The main drawback of this method is that it would force all strings to be checked for UTF-8 compliance. But if the operation raising the exception is idempotent, the code could be more efficient, sanitizing strings only when the exception is raised:

        try:
            res = self.es.index(index = self.index, doc_type = self.type,
                            id = self._id(item), body = item)
        except UnicodeEncodeError as e:
            if e.reason == 'surrogates not allowed':
                body = body.encode('utf-8', "backslashreplace").decode('utf-8')
                res = self.es.index(index = self.index, doc_type = self.type,
                            id = self._id(item), body = item)

Therefore, could (1) be implemented in Perceval (at least for git)? I can produce a PR for that, if convenient...

Some references which I read before making up my mind:

The Updated Guide to Unicode in Python: this gave me the clues to really understand what was happening
Python3 Unicode Howto|The String Type: this mentions the "backslashreplace" error handler
Processing Text Files in Python3

sduenas · 2016-03-18T11:10:51Z

My idea using subrogateescape was the same you can get with blackslashreplace but I didn't know the weird behaviour of store those characters inside the Unicode private area.

(1) sounds good but I have to find a way to make it clean because it can add more complexity to the code when it isn't really needed.

When decoding as utf8, if the character cannnot be decoded, use the backslashreplace error handler, instead of the surrogateescape error handler. Fixes #18 for git backend, maybe others should be fixed too.

When decoding as utf8, if the character cannnot be decoded, use the backslashreplace error handler, instead of the surrogateescape error handler. Fixes chaoss#18 for git backend, maybe others should be fixed too.

Closes chaoss#18

jgbarah linked a pull request Mar 18, 2016 that will close this issue

[git] Use "backslashreplace" instead of "surrogateescape". #22

Open

valeriocos pushed a commit to valeriocos/perceval that referenced this issue Dec 6, 2017

Migrate TravisCI from legacy to container-based infrastructure

eaaa169

Closes chaoss#18

sduenas added bug good first issue Good issue for first-time contributors labels Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode error due to using surrogates #18

Unicode error due to using surrogates #18

jgbarah commented Mar 15, 2016

jgbarah commented Mar 15, 2016

jgbarah commented Mar 15, 2016

sduenas commented Mar 18, 2016

Unicode error due to using surrogates #18

Unicode error due to using surrogates #18

Comments

jgbarah commented Mar 15, 2016

jgbarah commented Mar 15, 2016

jgbarah commented Mar 15, 2016

sduenas commented Mar 18, 2016