Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crash in ia tasks when a task log contains invalid UTF-8 #360

Merged
merged 3 commits into from
Jun 22, 2020

Conversation

JustAnotherArchivist
Copy link
Contributor

@JustAnotherArchivist JustAnotherArchivist commented Jun 10, 2020

Fixes #359

Since simply returning bytes from CatalogTask.get_task_log is not an option (would break API), the bytes.decode call needs an errors kwarg. get_task_log is a library function, and therefore I think it should return the data as close to the original as possible. Python's solution for that problem is surrogateescape, which uses surrogate codes to encode those invalid bytes. Encoding again with the same handler produces the exact same bytes that were decoded, i.e. it is a lossless operation. However, surrogates can't be printed, so simply using that handler just causes ia tasks to crash a bit later on line 103 in cli/ia_tasks.py (with UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd0' in position 413070: surrogates not allowed). The surrogates need to be replaced before printing; the best replacement for that is U+FFFD (REPLACEMENT CHARACTER). All error handlers other than surrogateescape are lossy and therefore not suitable for a library function in my opinion.

While the extra encode/decode round-trip is not very elegant, it is by far the fastest option. Even on this fairly large log (1.7 MB), it only adds a negligible 5 ms to the runtime.

Performance comparison with other options
with open('task-1750701745.log', 'rb') as fp:
	d = fp.read()
s = d.decode('utf-8', 'surrogateescape')


def replsurrogates(s):
	for o in range(56448, 56576):
		c = chr(o)
		if c in s:
			s = s.replace(c, '\ufffd')
	return s

import re
def resurrogates(s):
	return re.sub('[\udc80-\udcff]', '\ufffd', s)


import timeit
print('encode/decode')
print(timeit.timeit('s.encode("utf-8", "surrogateescape").decode("utf-8", "replace")', globals = globals(), number = 100))
print(timeit.timeit('s.encode("utf-8", "surrogateescape").decode("utf-8", "replace")', globals = globals(), number = 100))
print(timeit.timeit('s.encode("utf-8", "surrogateescape").decode("utf-8", "replace")', globals = globals(), number = 100))

print('re')
print(timeit.timeit('resurrogates(s)', globals = globals(), number = 100))
print(timeit.timeit('resurrogates(s)', globals = globals(), number = 100))
print(timeit.timeit('resurrogates(s)', globals = globals(), number = 100))

print('str.replace')
print(timeit.timeit('replsurrogates(s)', globals = globals(), number = 100))
print(timeit.timeit('replsurrogates(s)', globals = globals(), number = 100))
print(timeit.timeit('replsurrogates(s)', globals = globals(), number = 100))

where task-1750701745.log was retrieved using curl -H 'Authorization: LOW access:secret' https://catalogd.archive.org/services/tasks.php?task_log=1750701745 >task-1750701745.log.

The functions were based on this analysis on Stack Overflow. Since surrogates are expected to be rare, the in check in the str.replace approach will significantly improve performance in the normal case where no invalid UTF-8 is present, and I didn't even bother to test without it.

On my test machine:

encode/decode
0.5847692609968362
0.6027463229984278
0.5812073729903204
re
1.579910109998309
1.5495683979970636
1.5630015220085625
str.replace
6.836233349997201
8.074072139002965
7.55777670200041

There is one issue with this solution: surrogateescape was only added in Python 3.1, and therefore this won't work on Python 2 (and the test suite on this PR will fail for this reason Edit: would have failed if it covered that part of the code). I'm not aware of a solution for Python 2 since this error handler simply didn't exist. I'm not sure if this is still a concern though; Python 2 is EOL since January and really shouldn't be used anymore by anyone. I don't think there's any good reason to still support it. If you really want to keep it anyway, another if six.PY2 will be needed inside get_task_log to not use the error handler there. However, #359 is probably impossible to fix in Python 2 without breaking the API and returning bytes from get_task_log.

@jjjake jjjake merged commit ae36b29 into jjjake:master Jun 22, 2020
@jjjake
Copy link
Owner

jjjake commented Jun 22, 2020

looks great, thanks @JustAnotherArchivist!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UnicodeDecodeError when trying to display task logs with invalid UTF-8
2 participants