Fix crash in `ia tasks` when a task log contains invalid UTF-8 #360

JustAnotherArchivist · 2020-06-10T00:54:43Z

Fixes #359

Since simply returning bytes from CatalogTask.get_task_log is not an option (would break API), the bytes.decode call needs an errors kwarg. get_task_log is a library function, and therefore I think it should return the data as close to the original as possible. Python's solution for that problem is surrogateescape, which uses surrogate codes to encode those invalid bytes. Encoding again with the same handler produces the exact same bytes that were decoded, i.e. it is a lossless operation. However, surrogates can't be printed, so simply using that handler just causes ia tasks to crash a bit later on line 103 in cli/ia_tasks.py (with UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd0' in position 413070: surrogates not allowed). The surrogates need to be replaced before printing; the best replacement for that is U+FFFD (REPLACEMENT CHARACTER). All error handlers other than surrogateescape are lossy and therefore not suitable for a library function in my opinion.

While the extra encode/decode round-trip is not very elegant, it is by far the fastest option. Even on this fairly large log (1.7 MB), it only adds a negligible 5 ms to the runtime.

Performance comparison with other options

with open('task-1750701745.log', 'rb') as fp:
	d = fp.read()
s = d.decode('utf-8', 'surrogateescape')


def replsurrogates(s):
	for o in range(56448, 56576):
		c = chr(o)
		if c in s:
			s = s.replace(c, '\ufffd')
	return s

import re
def resurrogates(s):
	return re.sub('[\udc80-\udcff]', '\ufffd', s)


import timeit
print('encode/decode')
print(timeit.timeit('s.encode("utf-8", "surrogateescape").decode("utf-8", "replace")', globals = globals(), number = 100))
print(timeit.timeit('s.encode("utf-8", "surrogateescape").decode("utf-8", "replace")', globals = globals(), number = 100))
print(timeit.timeit('s.encode("utf-8", "surrogateescape").decode("utf-8", "replace")', globals = globals(), number = 100))

print('re')
print(timeit.timeit('resurrogates(s)', globals = globals(), number = 100))
print(timeit.timeit('resurrogates(s)', globals = globals(), number = 100))
print(timeit.timeit('resurrogates(s)', globals = globals(), number = 100))

print('str.replace')
print(timeit.timeit('replsurrogates(s)', globals = globals(), number = 100))
print(timeit.timeit('replsurrogates(s)', globals = globals(), number = 100))
print(timeit.timeit('replsurrogates(s)', globals = globals(), number = 100))

where task-1750701745.log was retrieved using curl -H 'Authorization: LOW access:secret' https://catalogd.archive.org/services/tasks.php?task_log=1750701745 >task-1750701745.log.

The functions were based on this analysis on Stack Overflow. Since surrogates are expected to be rare, the in check in the str.replace approach will significantly improve performance in the normal case where no invalid UTF-8 is present, and I didn't even bother to test without it.

On my test machine:

encode/decode
0.5847692609968362
0.6027463229984278
0.5812073729903204
re
1.579910109998309
1.5495683979970636
1.5630015220085625
str.replace
6.836233349997201
8.074072139002965
7.55777670200041

There is one issue with this solution: surrogateescape was only added in Python 3.1, and therefore this won't work on Python 2 (and the test suite on this PR ~~will fail for this reason~~ Edit: would have failed if it covered that part of the code). I'm not aware of a solution for Python 2 since this error handler simply didn't exist. I'm not sure if this is still a concern though; Python 2 is EOL since January and really shouldn't be used anymore by anyone. I don't think there's any good reason to still support it. If you really want to keep it anyway, another if six.PY2 will be needed inside get_task_log to not use the error handler there. However, #359 is probably impossible to fix in Python 2 without breaking the API and returning bytes from get_task_log.

jjjake · 2020-06-22T19:27:02Z

looks great, thanks @JustAnotherArchivist!

JustAnotherArchivist added 3 commits June 10, 2020 00:18

Fix crash in ia tasks when a task log contains invalid UTF-8

d2a9e62

Remove spaces on kwargs as the test suite doesn't like them

49cc804

Split up line considered too long by the test suite

90c9d4f

jjjake merged commit ae36b29 into jjjake:master Jun 22, 2020

JustAnotherArchivist mentioned this pull request Sep 18, 2021

Consider dropping Python 2 support #435

Closed

JustAnotherArchivist mentioned this pull request Feb 7, 2022

Occasional UnicodeDecodeError from get_task_log() #310

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix crash in `ia tasks` when a task log contains invalid UTF-8 #360

Fix crash in `ia tasks` when a task log contains invalid UTF-8 #360

JustAnotherArchivist commented Jun 10, 2020 •

edited

Loading

jjjake commented Jun 22, 2020

Fix crash in ia tasks when a task log contains invalid UTF-8 #360

Fix crash in ia tasks when a task log contains invalid UTF-8 #360

Conversation

JustAnotherArchivist commented Jun 10, 2020 • edited Loading

jjjake commented Jun 22, 2020

Fix crash in `ia tasks` when a task log contains invalid UTF-8 #360

Fix crash in `ia tasks` when a task log contains invalid UTF-8 #360

JustAnotherArchivist commented Jun 10, 2020 •

edited

Loading