Fix crash in ia tasks
when a task log contains invalid UTF-8
#360
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #359
Since simply returning bytes from
CatalogTask.get_task_log
is not an option (would break API), thebytes.decode
call needs anerrors
kwarg.get_task_log
is a library function, and therefore I think it should return the data as close to the original as possible. Python's solution for that problem issurrogateescape
, which uses surrogate codes to encode those invalid bytes. Encoding again with the same handler produces the exact samebytes
that were decoded, i.e. it is a lossless operation. However, surrogates can't be printed, so simply using that handler just causesia tasks
to crash a bit later on line 103 incli/ia_tasks.py
(withUnicodeEncodeError: 'utf-8' codec can't encode character '\udcd0' in position 413070: surrogates not allowed
). The surrogates need to be replaced before printing; the best replacement for that isU+FFFD
(REPLACEMENT CHARACTER). All error handlers other thansurrogateescape
are lossy and therefore not suitable for a library function in my opinion.While the extra encode/decode round-trip is not very elegant, it is by far the fastest option. Even on this fairly large log (1.7 MB), it only adds a negligible 5 ms to the runtime.
Performance comparison with other options
where
task-1750701745.log
was retrieved usingcurl -H 'Authorization: LOW access:secret' https://catalogd.archive.org/services/tasks.php?task_log=1750701745 >task-1750701745.log
.The functions were based on this analysis on Stack Overflow. Since surrogates are expected to be rare, the
in
check in thestr.replace
approach will significantly improve performance in the normal case where no invalid UTF-8 is present, and I didn't even bother to test without it.On my test machine:
There is one issue with this solution:
surrogateescape
was only added in Python 3.1, and therefore this won't work on Python 2 (and the test suite on this PRwill fail for this reasonEdit: would have failed if it covered that part of the code). I'm not aware of a solution for Python 2 since this error handler simply didn't exist. I'm not sure if this is still a concern though; Python 2 is EOL since January and really shouldn't be used anymore by anyone. I don't think there's any good reason to still support it. If you really want to keep it anyway, anotherif six.PY2
will be needed insideget_task_log
to not use the error handler there. However, #359 is probably impossible to fix in Python 2 without breaking the API and returning bytes fromget_task_log
.