Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a fast path for character counting in wc. #3735

Merged
merged 2 commits into from
Jul 22, 2022

Conversation

resistor
Copy link
Contributor

When wc is invoked with only the -m flag, we only need to count the
number of Unicode characters in the input. In order to do so, we don't
actually need to decode the input bytes into characters. Rather, we can
simply count the number of non-continuation bytes in the UTF-8 stream,
since every character will contain exactly one non-continuation byte.

On my laptop, this speeds up wc -m odyssey1024.txt from 745ms to
109ms.

When wc is invoked with only the -m flag, we only need to count the
number of Unicode characters in the input. In order to do so, we don't
actually need to decode the input bytes into characters. Rather, we can
simply count the number of non-continuation bytes in the UTF-8 stream,
since every character will contain exactly one non-continuation byte.

On my laptop, this speeds up `wc -m odyssey1024.txt` from 745ms to
109ms.
@tertsdiepraam
Copy link
Member

Wow that's excellent! Are there any edge cases here we need to think about with invalid utf-8? For example, could there be an off by one error if the last byte is a continuation byte? Looking at the other specialized function, it looks like we don't handle that case at all, but it might still be interesting to document and to check what GNU does.

@resistor
Copy link
Contributor Author

Wow that's excellent! Are there any edge cases here we need to think about with invalid utf-8? For example, could there be an off by one error if the last byte is a continuation byte? Looking at the other specialized function, it looks like we don't handle that case at all, but it might still be interesting to document and to check what GNU does.

The only corner case is when the input stream ends with continuation bytes, which means the stream is invalid UTF-8 to begin with.

I tested GNU coreutils wc 8.32. For a file containing three 0x80 bytes (all continuation bytes), wc -m prints 0. This behavior does differ from the system wc on my Mac, which reports an error and then prints 3. I assume we only aim to replicate the GNU coreutils behavior?

@tertsdiepraam
Copy link
Member

Interesting! Thanks for checking! So the mac wc then falls back to counting bytes? It sounds like GNU has the "correct" behaviour here, which is the same as what you implemented right? We do indeed follow GNU by default.

@resistor
Copy link
Contributor Author

Correct, the Mac wc seems to fallback to byte counting. What is implemented here matches the GNU behavior.

@sylvestre sylvestre merged commit ec9130a into uutils:main Jul 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants