join: add support for multibyte separators #6736

jtracey · 2024-09-25T06:16:30Z

GNU join used to not support any separators over 1 byte in length, and had tests ensuring such separators would return an error. As such, years ago, we removed our support for multibyte unicode characters, and added support for single non-unicode bytes to match their behavior, and got GNU tests passing. Now, GNU join supports multibyte separators when they're valid characters in the current locale's encoding. We're not in a position to add proper locale support yet, but this adds the next best thing, and should get the new GNU tests to pass in CI by assuming UTF-8 encodings. This is just straight up wrong on Windows, which always uses UTF-16, but that was always weird, and the "right" thing on Windows at this point would be to reject all possible separators other than null, which doesn't sound very useful.

The first commit fixes our tests to better reflect the new GNU behavior. The second commit gets multibyte separators working in a way closely resembling what we were already doing. But this bloats the separator enum we were using, and adds a bunch of clones or (de)refs, which profiling showed added a notable hit to performance (though still significantly faster than GNU), all to support a few extra bytes that will likely be rarely used. The third commit therefore changes the enum into a trait, so we can use generics instead of matching, getting us most of that lost performance back (or possibly all, it's hard to get consistent measurements).

sylvestre · 2024-09-25T06:29:03Z

clippy is complaining on:

 error: this argument is passed by value, but not consumed in the function body
   --> src/uu/join/src/join.rs:366:56
    |
366 |     fn new<Sep: Separator>(string: Vec<u8>, separator: Sep, len_guess: usize) -> Self {
    |                                                        ^^^ help: consider taking a reference instead: `&Sep`
    |
    = help: for further information visit [https://rust-lang.github.io/rust-clippy/master/](https://rust-lang.github.io/rust-clippy/master/index.html#needless_pass_by_value)

github-actions · 2024-09-25T06:43:04Z

GNU testsuite comparison:

GNU test failed: tests/join/join. tests/join/join is passing on 'main'. Maybe you have to rebase?
Congrats! The gnu test tests/join/join-utf8 is no longer failing!

sylvestre · 2024-09-25T14:13:46Z

Congrats! The gnu test tests/join/join-utf8 is no longer failing!

well done :)

github-actions · 2024-09-26T03:30:10Z

GNU testsuite comparison:

GNU test failed: tests/join/join. tests/join/join is passing on 'main'. Maybe you have to rebase?
Congrats! The gnu test tests/join/join-utf8 is no longer failing!
Congrats! The gnu test tests/timeout/timeout is no longer failing!

github-actions · 2024-09-26T04:38:33Z

GNU testsuite comparison:

GNU test failed: tests/join/join. tests/join/join is passing on 'main'. Maybe you have to rebase?
Congrats! The gnu test tests/join/join-utf8 is no longer failing!

src/uu/join/src/join.rs

github-actions · 2024-09-27T20:16:29Z

GNU testsuite comparison:

Congrats! The gnu test tests/join/join-utf8 is no longer failing!

jtracey · 2024-09-27T22:04:50Z

Sorry for all the force pushes, made a series of dumb mistakes by trying to fix things too quickly. The new additional commit adds a test that should have been there already and would have saved me some time (join handles whitespace separators differently than every other kind of separator).

github-actions · 2024-09-27T22:20:40Z

GNU testsuite comparison:

Congrats! The gnu test tests/join/join-utf8 is no longer failing!

sylvestre · 2024-09-28T07:05:29Z

src/uu/join/src/join.rs

+    Byte(u8),
+    Char(Vec<u8>),


please document these two, the diff isn't obvious :)

Sure, done.

jtracey · 2024-09-28T18:22:24Z

It occurs to me that even though GNU join doesn't allow multibyte separators with invalid or multiple code points, I don't actually know why they have this restriction. When it was only single byte separators allowed, that made perfect sense, as it simplifies the algorithms and makes them faster. But once you allow any code point, I don't know why you would bother restricting it. I suppose there are some additional performance hacks you could do by assuming UTF-8 and knowing the separator is at most 4 bytes, but GNU supports other encodings, and we're not doing any such hacks. Should we just remove those checks, and allow using any arbitrary byte string as the separator?

github-actions · 2024-09-28T18:43:21Z

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)
GNU test failed: tests/timeout/timeout. tests/timeout/timeout is passing on 'main'. Maybe you have to rebase?
Congrats! The gnu test tests/join/join-utf8 is no longer failing!

sylvestre · 2024-10-06T09:39:11Z

I don't actually know why they have this restriction

maybe ask on their mailing list?

Should we just remove those checks, and allow using any arbitrary byte string as the separator?

Sure but in a different PR
and documented here:
https://github.com/uutils/coreutils/blob/main/docs/src/extensions.md

jtracey added 2 commits September 25, 2024 01:50

join: add test for multibyte separators

395c441

join: implement support for multibyte separators

2e96f64

jtracey force-pushed the join-multibyte branch 2 times, most recently from 69d71f7 to 849479c Compare September 26, 2024 03:03

jtracey force-pushed the join-multibyte branch 2 times, most recently from e15103c to 54b2b0c Compare September 26, 2024 04:12

jtracey marked this pull request as draft September 26, 2024 04:58

sylvestre reviewed Sep 26, 2024

View reviewed changes

src/uu/join/src/join.rs Show resolved Hide resolved

jtracey force-pushed the join-multibyte branch from 54b2b0c to 796cbc1 Compare September 27, 2024 19:49

jtracey force-pushed the join-multibyte branch 6 times, most recently from 3ddebb3 to cfd32e2 Compare September 27, 2024 21:54

jtracey marked this pull request as ready for review September 27, 2024 22:04

sylvestre reviewed Sep 28, 2024

View reviewed changes

jtracey added 2 commits September 28, 2024 14:15

join: use a trait instead of an enum for separator

c6cff91

join: test whitespace merging

879e1ea

jtracey force-pushed the join-multibyte branch from cfd32e2 to 879e1ea Compare September 28, 2024 18:16

sylvestre merged commit a51a731 into uutils:main Oct 6, 2024
67 of 68 checks passed

jtracey deleted the join-multibyte branch October 10, 2024 18:37

BrewTestBot mentioned this pull request Nov 16, 2024

uutils-coreutils 0.0.28 Homebrew/homebrew-core#197947

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

join: add support for multibyte separators #6736

join: add support for multibyte separators #6736

jtracey commented Sep 25, 2024

sylvestre commented Sep 25, 2024

github-actions bot commented Sep 25, 2024

sylvestre commented Sep 25, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 27, 2024

jtracey commented Sep 27, 2024

github-actions bot commented Sep 27, 2024

sylvestre Sep 28, 2024

jtracey Sep 28, 2024

jtracey commented Sep 28, 2024

github-actions bot commented Sep 28, 2024

sylvestre commented Oct 6, 2024

join: add support for multibyte separators #6736

join: add support for multibyte separators #6736

Conversation

jtracey commented Sep 25, 2024

sylvestre commented Sep 25, 2024

github-actions bot commented Sep 25, 2024

sylvestre commented Sep 25, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 27, 2024

jtracey commented Sep 27, 2024

github-actions bot commented Sep 27, 2024

sylvestre Sep 28, 2024

Choose a reason for hiding this comment

jtracey Sep 28, 2024

Choose a reason for hiding this comment

jtracey commented Sep 28, 2024

github-actions bot commented Sep 28, 2024

sylvestre commented Oct 6, 2024