Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decode_bytes() improper error handling #35

Closed
mih opened this issue Jul 10, 2024 · 5 comments · Fixed by #37
Closed

decode_bytes() improper error handling #35

mih opened this issue Jul 10, 2024 · 5 comments · Fixed by #37

Comments

@mih
Copy link
Member

mih commented Jul 10, 2024

I have a line from git ls-files that comes out of itemize() like so:

b'120000 39d6af4c8d20bfdd8effecf5babaaacc971873e6 0\tBob Marley & The Wailers/Live 9-23-80/08 War \xaf No More Trouble.shn.mp3'

decode_bytes() trips on the \xaf and yields

'120000 39d6af4c8d20bfdd8effecf5babaaacc971873e6 0\tBob Marley & The Wailers/Live 9-23-80/08 War \\xaf'

i.e. it swallows the end of the file name.

A test like the following is sufficient to reproduce the problem

diff --git a/datalad_next/itertools/tests/test_decode_bytes.py b/datalad_next/itertools/tests/test_decode_bytes.py
index a463cc4..ee5517f 100644
--- a/datalad_next/itertools/tests/test_decode_bytes.py
+++ b/datalad_next/itertools/tests/test_decode_bytes.py
@@ -35,3 +35,9 @@ def test_no_empty_strings():
     # check that empty strings are not yielded
     r = tuple(decode_bytes([b'\xc3', b'\xb6']))
     assert r == ('ö',)
+
+
+def test_marley():
+    beits = b'08 War \xaf No More Trouble.shn.mp3'
+    r = tuple(decode_bytes([beits]))
+    assert r[0] == '08 War \xaf No More Trouble.shn.mp3'

Giving:

>       assert r[0] == '08 War \xaf No More Trouble.shn.mp3'
E       AssertionError: assert '08 War \\xaf' == '08 War ¯ No ...ouble.shn.mp3'
E         
E         - 08 War ¯ No More Trouble.shn.mp3
E         + 08 War \xaf

(the test target is probably not the real thing, but it shows the problematic behavior)

@christian-monch
Copy link
Contributor

christian-monch commented Jul 10, 2024

The decode_bytes-iterator returns the result of decode_bytes([b'08 War \xaf No More Trouble.shn.mp3']) in two strings.

>>> tuple(decode_bytes([b'08 War \xaf No More Trouble.shn.mp3']))
('08 War \\xaf', ' No More Trouble.shn.mp3')

This behavior allows for relatively simple error handling in the presence of multiple decoding errors in an input chunk.

Generally, there is no guarantee that decode_bytes yields the same number of chunks as its input iterable. It might split input byte strings at error-locations and it might join input byte strings, if a multi-byte encoding is spread over multiple chunks.

If one wants to receive individual decoded lines, one can decode first and itemize the decoded stream. The following examples illustrate the difference:

>>> tuple(decode_bytes(itemize([b'08 War \xaf No More Trouble.shn.mp3'], None)))
('08 War \\xaf', ' No More Trouble.shn.mp3')
tuple(itemize(decode_bytes([b'08 War \xaf No More Trouble.shn.mp3']), None))
('08 War \\xaf No More Trouble.shn.mp3',)

WDYT about extending the documentation to make users aware of this effect?

mih referenced this issue in mih/datalad-next Jul 11, 2024
`decode_bytes()` can yield multiple output chunks for one input chunk
(in the case of decoding errors). This implies that it cannot be
meaningfully used after `itemize()` without threatening the semantics
of the items (i.e., with split-by-line items are no longer unique
lines).

For this reason, this commit changes the order of these helpers in
the `gitworktree()` implementation.

Refs: https://github.com/datalad/datalad-next/issues/740
@mih
Copy link
Member Author

mih commented Jul 11, 2024

Thanks! This makes sense and it should be documented, I agree.

I addressed the problem for iter_gitworktree in datalad/datalad-next#741. But pretty much all other iterators are affected too.

This also revealed another issue datalad/datalad-next#742

@mih
Copy link
Member Author

mih commented Jul 11, 2024

I am moving this issue to datasalad. The respective code (still affected by datalad/datalad-next#742) has already been migrated there.

@mih mih transferred this issue from datalad/datalad-next Jul 11, 2024
@mih
Copy link
Member Author

mih commented Jul 11, 2024

Fixed by #36

@mih mih closed this as completed Jul 11, 2024
@mih
Copy link
Member Author

mih commented Jul 11, 2024

Reopening, because the docs are not yet adjusted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants