-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
join: add support for multibyte separators #6736
Conversation
clippy is complaining on:
|
GNU testsuite comparison:
|
well done :) |
69d71f7
to
849479c
Compare
GNU testsuite comparison:
|
e15103c
to
54b2b0c
Compare
GNU testsuite comparison:
|
54b2b0c
to
796cbc1
Compare
GNU testsuite comparison:
|
3ddebb3
to
cfd32e2
Compare
Sorry for all the force pushes, made a series of dumb mistakes by trying to fix things too quickly. The new additional commit adds a test that should have been there already and would have saved me some time (join handles whitespace separators differently than every other kind of separator). |
GNU testsuite comparison:
|
src/uu/join/src/join.rs
Outdated
Byte(u8), | ||
Char(Vec<u8>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please document these two, the diff isn't obvious :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done.
cfd32e2
to
879e1ea
Compare
It occurs to me that even though GNU join doesn't allow multibyte separators with invalid or multiple code points, I don't actually know why they have this restriction. When it was only single byte separators allowed, that made perfect sense, as it simplifies the algorithms and makes them faster. But once you allow any code point, I don't know why you would bother restricting it. I suppose there are some additional performance hacks you could do by assuming UTF-8 and knowing the separator is at most 4 bytes, but GNU supports other encodings, and we're not doing any such hacks. Should we just remove those checks, and allow using any arbitrary byte string as the separator? |
GNU testsuite comparison:
|
maybe ask on their mailing list?
Sure but in a different PR |
GNU join used to not support any separators over 1 byte in length, and had tests ensuring such separators would return an error. As such, years ago, we removed our support for multibyte unicode characters, and added support for single non-unicode bytes to match their behavior, and got GNU tests passing. Now, GNU join supports multibyte separators when they're valid characters in the current locale's encoding. We're not in a position to add proper locale support yet, but this adds the next best thing, and should get the new GNU tests to pass in CI by assuming UTF-8 encodings. This is just straight up wrong on Windows, which always uses UTF-16, but that was always weird, and the "right" thing on Windows at this point would be to reject all possible separators other than null, which doesn't sound very useful.
The first commit fixes our tests to better reflect the new GNU behavior. The second commit gets multibyte separators working in a way closely resembling what we were already doing. But this bloats the separator enum we were using, and adds a bunch of clones or (de)refs, which profiling showed added a notable hit to performance (though still significantly faster than GNU), all to support a few extra bytes that will likely be rarely used. The third commit therefore changes the enum into a trait, so we can use generics instead of matching, getting us most of that lost performance back (or possibly all, it's hard to get consistent measurements).