fix(`decode_bytes`): always backslashreplace when asked to #742

mih · 2024-07-11T04:37:18Z

The previous implementation enabled error handling in decoding only for the segment of a bytestring that an exception was raised for.

However, it may well be that more decoding errors exist in other parts of the bytestring. I have a complicated real-world case where this happens, i.e. raising UnicodeDecodeError again, even though decode_bytes was called with backslash_replace=True. Unfortunately the data is so large that I did not manage to catch the condition exactly.

It seems to be a needless sophistication to decode some part of the bytestring with error handling, but not another.

There is a good chance that this patch is badly interacting with the logic to obtain the next chunk before attempting a decoding again.

The previous implementation enabled error handling in decoding only for the segment of a bytestring that an exception was raised for. However, it may well be that more decoding errors exist in other parts of the bytestring. I have a complicated real-world case where this happens, i.e. raising `UnicodeDecodeError` again, even though `decode_bytes` was called with `backslash_replace=True`. Unfortunately the data is so large that I did not manage to catch the condition exactly. It seems to be a needless sophistication to decode some part of the bytestring with error handling, but not another. There is a good chance that this patch is badly interacting with the logic to obtain the next chunk before attempting a decoding again.

codecov · 2024-07-11T05:38:31Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.51%. Comparing base (f00cfdb) to head (0ac7974).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #742      +/-   ##
==========================================
+ Coverage   92.47%   92.51%   +0.03%     
==========================================
  Files         195      195              
  Lines       14301    14301              
  Branches     2162     2162              
==========================================
+ Hits        13225    13230       +5     
+ Misses        812      806       -6     
- Partials      264      265       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

christian-monch

Thx @mih for catching that. The fix needs a change IMHO.

christian-monch · 2024-07-11T07:24:29Z

datalad_next/itertools/decode_bytes.py

            return (
                position + exc.end,
-                joined_data[:position + exc.start].decode(encoding)
-                + joined_data[position + exc.start:position + exc.end].decode(
-                    encoding,
-                    errors='backslashreplace'
-                ),
+                joined_data[:position + exc.end].decode(
+                    encoding, errors='backslashreplace')


I think this does not work. The code duplicates input data, as in the following example:

>>> tuple(decode_bytes([b'08 War \xaf No \xaf More \xaf Trouble.shn.mp3']) ('08 War \\xaf', '08 War \\xaf No \\xaf', '08 War \\xaf No \\xaf More \\xaf', ' Trouble.shn.mp3')

I think the return statement should be (a missing position index is added):

return ( position + exc.end, joined_data[position:position + exc.start].decode(encoding) + joined_data[position + exc.start:position + exc.end].decode( encoding, errors='backslashreplace' ), )

christian-monch · 2024-07-11T07:30:03Z

We should also add a test for proper handling of multiple errors in a single input chunk. For example

def test_multiple_errors():
    r = ''.join(decode_bytes([b'08 War \xaf No \xaf More \xaf Trouble.shn.mp3']))
    assert r == '08 War \\xaf No \\xaf More \\xaf Trouble.shn.mp3'

mih · 2024-07-11T08:35:48Z

Thanks @christian-monch ! Can you make the changes directly, and push into this PR. You have a better understanding of the logic and a fear that more iterations are needed, if I do it. Thanks!

mih · 2024-07-11T09:13:35Z

I just saw that this code has already been migrated to datasalad (unreleased yet). The fix needs to be ported over, before/when this code is removed here.

mih · 2024-07-11T09:21:28Z

An alternative to fixing it here, is to fix it in datasalad. This code has already been migrated and is pending a release.

This commit fixes an issue in multiple error handling where parts of the input strings were repeated in the output of `decode_bytes`. It also adds a regreesion test to enure that multiple encoding errors in a single input chunk are handled properly.

mih

Thanks! Also merged into datasalad!

mih mentioned this pull request Jul 11, 2024

decode_bytes() improper error handling datalad/datasalad#35

Closed

christian-monch requested changes Jul 11, 2024

View reviewed changes

christian-monch mentioned this pull request Jul 11, 2024

Fix handling of multiple encoding errors in decode_bytes input chunks. datalad/datasalad#36

Merged

mih merged commit b20e779 into datalad:main Jul 11, 2024
6 of 7 checks passed

mih deleted the decodebytes branch July 11, 2024 10:26

mih commented Jul 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(`decode_bytes`): always backslashreplace when asked to #742

fix(`decode_bytes`): always backslashreplace when asked to #742

mih commented Jul 11, 2024

codecov bot commented Jul 11, 2024

christian-monch left a comment

christian-monch Jul 11, 2024

christian-monch commented Jul 11, 2024

mih commented Jul 11, 2024

mih commented Jul 11, 2024

mih commented Jul 11, 2024

mih left a comment

fix(decode_bytes): always backslashreplace when asked to #742

fix(decode_bytes): always backslashreplace when asked to #742

Conversation

mih commented Jul 11, 2024

codecov bot commented Jul 11, 2024

Codecov Report

christian-monch left a comment

Choose a reason for hiding this comment

christian-monch Jul 11, 2024

Choose a reason for hiding this comment

christian-monch commented Jul 11, 2024

mih commented Jul 11, 2024

mih commented Jul 11, 2024

mih commented Jul 11, 2024

mih left a comment

Choose a reason for hiding this comment

fix(`decode_bytes`): always backslashreplace when asked to #742

fix(`decode_bytes`): always backslashreplace when asked to #742