Implement a fast path for character counting in wc. #3735

resistor · 2022-07-21T05:36:37Z

When wc is invoked with only the -m flag, we only need to count the
number of Unicode characters in the input. In order to do so, we don't
actually need to decode the input bytes into characters. Rather, we can
simply count the number of non-continuation bytes in the UTF-8 stream,
since every character will contain exactly one non-continuation byte.

On my laptop, this speeds up wc -m odyssey1024.txt from 745ms to
109ms.

When wc is invoked with only the -m flag, we only need to count the number of Unicode characters in the input. In order to do so, we don't actually need to decode the input bytes into characters. Rather, we can simply count the number of non-continuation bytes in the UTF-8 stream, since every character will contain exactly one non-continuation byte. On my laptop, this speeds up `wc -m odyssey1024.txt` from 745ms to 109ms.

src/uu/wc/src/count_fast.rs

tertsdiepraam · 2022-07-21T07:45:05Z

Wow that's excellent! Are there any edge cases here we need to think about with invalid utf-8? For example, could there be an off by one error if the last byte is a continuation byte? Looking at the other specialized function, it looks like we don't handle that case at all, but it might still be interesting to document and to check what GNU does.

resistor · 2022-07-21T21:12:32Z

Wow that's excellent! Are there any edge cases here we need to think about with invalid utf-8? For example, could there be an off by one error if the last byte is a continuation byte? Looking at the other specialized function, it looks like we don't handle that case at all, but it might still be interesting to document and to check what GNU does.

The only corner case is when the input stream ends with continuation bytes, which means the stream is invalid UTF-8 to begin with.

I tested GNU coreutils wc 8.32. For a file containing three 0x80 bytes (all continuation bytes), wc -m prints 0. This behavior does differ from the system wc on my Mac, which reports an error and then prints 3. I assume we only aim to replicate the GNU coreutils behavior?

tertsdiepraam · 2022-07-21T22:07:53Z

Interesting! Thanks for checking! So the mac wc then falls back to counting bytes? It sounds like GNU has the "correct" behaviour here, which is the same as what you implemented right? We do indeed follow GNU by default.

resistor · 2022-07-21T23:14:35Z

Correct, the Mac wc seems to fallback to byte counting. What is implemented here matches the GNU behavior.

sylvestre reviewed Jul 21, 2022

View reviewed changes

src/uu/wc/src/count_fast.rs Show resolved Hide resolved

Add rustdoc comment.

417ad0e

sylvestre merged commit ec9130a into uutils:main Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a fast path for character counting in wc. #3735

Implement a fast path for character counting in wc. #3735

resistor commented Jul 21, 2022

tertsdiepraam commented Jul 21, 2022

resistor commented Jul 21, 2022

tertsdiepraam commented Jul 21, 2022

resistor commented Jul 21, 2022

Implement a fast path for character counting in wc. #3735

Implement a fast path for character counting in wc. #3735

Conversation

resistor commented Jul 21, 2022

tertsdiepraam commented Jul 21, 2022

resistor commented Jul 21, 2022

tertsdiepraam commented Jul 21, 2022

resistor commented Jul 21, 2022