Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(decode_bytes): always backslashreplace when asked to #742

Merged
merged 2 commits into from
Jul 11, 2024

Conversation

mih
Copy link
Member

@mih mih commented Jul 11, 2024

The previous implementation enabled error handling in decoding only for the segment of a bytestring that an exception was raised for.

However, it may well be that more decoding errors exist in other parts of the bytestring. I have a complicated real-world case where this happens, i.e. raising UnicodeDecodeError again, even though decode_bytes was called with backslash_replace=True. Unfortunately the data is so large that I did not manage to catch the condition exactly.

It seems to be a needless sophistication to decode some part of the bytestring with error handling, but not another.

There is a good chance that this patch is badly interacting with the logic to obtain the next chunk before attempting a decoding again.

The previous implementation enabled error handling in decoding only for
the segment of a bytestring that an exception was raised for.

However, it may well be that more decoding errors exist in other parts
of the bytestring. I have a complicated real-world case where this
happens, i.e. raising `UnicodeDecodeError` again, even though
`decode_bytes` was called with `backslash_replace=True`.  Unfortunately
the data is so large that I did not manage to catch the condition
exactly.

It seems to be a needless sophistication to decode some part of the
bytestring with error handling, but not another.

There is a good chance that this patch is badly interacting with the
logic to obtain the next chunk before attempting a decoding again.
Copy link

codecov bot commented Jul 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.51%. Comparing base (f00cfdb) to head (0ac7974).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #742      +/-   ##
==========================================
+ Coverage   92.47%   92.51%   +0.03%     
==========================================
  Files         195      195              
  Lines       14301    14301              
  Branches     2162     2162              
==========================================
+ Hits        13225    13230       +5     
+ Misses        812      806       -6     
- Partials      264      265       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@christian-monch christian-monch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @mih for catching that. The fix needs a change IMHO.

Comment on lines 104 to 107
return (
position + exc.end,
joined_data[:position + exc.start].decode(encoding)
+ joined_data[position + exc.start:position + exc.end].decode(
encoding,
errors='backslashreplace'
),
joined_data[:position + exc.end].decode(
encoding, errors='backslashreplace')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this does not work. The code duplicates input data, as in the following example:

>>> tuple(decode_bytes([b'08 War \xaf No \xaf More \xaf Trouble.shn.mp3'])
('08 War \\xaf', '08 War \\xaf No \\xaf', '08 War \\xaf No \\xaf More \\xaf', ' Trouble.shn.mp3')

I think the return statement should be (a missing position index is added):

            return (
                position + exc.end,
                joined_data[position:position + exc.start].decode(encoding)
                + joined_data[position + exc.start:position + exc.end].decode(
                    encoding,
                    errors='backslashreplace'
                ),
            )

@christian-monch
Copy link
Contributor

We should also add a test for proper handling of multiple errors in a single input chunk. For example

def test_multiple_errors():
    r = ''.join(decode_bytes([b'08 War \xaf No \xaf More \xaf Trouble.shn.mp3']))
    assert r == '08 War \\xaf No \\xaf More \\xaf Trouble.shn.mp3'

@mih
Copy link
Member Author

mih commented Jul 11, 2024

Thanks @christian-monch ! Can you make the changes directly, and push into this PR. You have a better understanding of the logic and a fear that more iterations are needed, if I do it. Thanks!

@mih
Copy link
Member Author

mih commented Jul 11, 2024

I just saw that this code has already been migrated to datasalad (unreleased yet). The fix needs to be ported over, before/when this code is removed here.

@mih
Copy link
Member Author

mih commented Jul 11, 2024

An alternative to fixing it here, is to fix it in datasalad. This code has already been migrated and is pending a release.

This commit fixes an issue in multiple error
handling where parts of the input strings were
repeated in the output of `decode_bytes`.

It also adds a regreesion test to enure that
multiple encoding errors in a single input
chunk are handled properly.
@mih mih merged commit b20e779 into datalad:main Jul 11, 2024
6 of 7 checks passed
@mih mih deleted the decodebytes branch July 11, 2024 10:26
Copy link
Member Author

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Also merged into datasalad!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants