Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add impl UnicodeSegmentation for [u8] #46

Open
BurntSushi opened this issue Oct 7, 2018 · 7 comments
Open

add impl UnicodeSegmentation for [u8] #46

BurntSushi opened this issue Oct 7, 2018 · 7 comments

Comments

@BurntSushi
Copy link

Would the maintainers be willing to accept an implementation of the trait for [u8]? Specifically, for possibly invalid UTF-8. We could specify that if invalid UTF-8 bytes are found, then we lossily emit a Unicode replacement codepoint.

cc @SimonSapin

@SimonSapin
Copy link
Contributor

In terms of implementation, this doesn’t look easy as the code seems to make pervasive use of &str.

In terms of API, I mildly oppose to silently doing replacement without users having to type the name of some items that contains the word lossy or similar. For example, this could mean implementing the trait for struct AsUtf8Lossy<'a>(&'a [u8]);.

But then we have trait methods that return itertors of &'a str or (usize, &'a str). The replacement characters "\u{FFFD}" is three bytes long, but it would replace byte sequences of variable length. This might require allocating a new string, whose ownership is unclear to me right now. But if we’re allocating, the caller might as well use String::from_utf8_lossy themselves and then the str implementation.

So instead, maybe the iterator types could be made generic to support both str and [u8]? They wouldn’t actually replace invalid byte sequence, but maybe only behave as if those sequences were U+FFFD. This alleviate my initial API concern, but makes implementation that much more involved.


By the way, in case you’re also tempted to do incremental processing where not all of the input bytes are in memory at the same time (or simply not in contiguous memory), note that segmentation does not "distribute" over concatenation: segment(a + b) does not always equal segment(a) + segment(b). Not only at the byte level since a single code point’s bytes could be split up in multiple input chunks, but even if you keep code points together a single grapheme cluster can be made of multiple code points from multiple chunks, for example.

@BurntSushi
Copy link
Author

BurntSushi commented Oct 8, 2018

In terms of implementation, this doesn’t look easy as the code seems to make pervasive use of &str.

Yeah. It's been a while since I looked at the implementation in this crate, but I suspected as much. I think that's definitely part of my question here. Namely, if I were to implement this, and it resulted in a fairly pervasive change, is that something the maintainers would be willing to accept?

In terms of API, I mildly oppose to silently doing replacement without users having to type the name of some items that contains the word lossy or similar. For example, this could mean implementing the trait for struct AsUtf8Lossy<'a>(&'a [u8]);.

Hmm. OK. I don't think I strongly disagree with this.

But then we have trait methods that return itertors of &'a str or (usize, &'a str). The replacement characters "\u{FFFD}" is three bytes long, but it would replace byte sequences of variable length. This might require allocating a new string, whose ownership is unclear to me right now. But if we’re allocating, the caller might as well use String::from_utf8_lossy themselves and then the str implementation.

Interesting. Hmm. OK. I think I'm still playing catch up here, and I think there are parts that I don't quite grok yet. In particular, I have not yet deeply familiarized myself with the segmentation algorithm, so I might be missing some critical context here.

What I had in mind shouldn't require any allocations. In particular, if an invalid UTF-8 sequence is seen, then that sequence is skipped and "\u{FFFD}" is yielded for that iteration. I think I'm missing where you think the additional allocation comes in?

So instead, maybe the iterator types could be made generic to support both str and [u8]? They wouldn’t actually replace invalid byte sequence, but maybe only behave as if those sequences were U+FFFD. This alleviate my initial API concern, but makes implementation that much more involved.

This is interesting! So in this sense, the API for [u8] wouldn't be lossy at all, and iterations with invalid sequences would yield the invalid bytes themselves. At that point, it would be the caller's responsibility to do a second UTF-8 validation step if they wanted an &str.

I'm not sure which API is better. It's conceivable both could be desirable.

By the way, in case you’re also tempted to do incremental processing where not all of the input bytes are in memory at the same time (or simply not in contiguous memory), note that segmentation does not "distribute" over concatenation: segment(a + b) does not always equal segment(a) + segment(b). Not only at the byte level since a single code point’s bytes could be split up in multiple input chunks, but even if you keep code points together a single grapheme cluster can be made of multiple code points from multiple chunks, for example.

Right, yeah, that makes sense. I imagine if I were doing this, then I'd need a buffer until I knew I had a single complete grapheme.


To pop up a level to give you an idea of what I want to do: basically, in my work, I frequently deal with &[u8]. Ideally, that &[u8] is valid UTF-8, but of course, it may not be in some cases. The common scenario is to treat &[u8] as if it were valid UTF-8. This doesn't work with &str, obviously, so I need to stay in &[u8] land. I've been kind of skirting around the issue and making the regex engine deal with this for the most part, but difficulties pop up now and then. For example, ripgrep wants to grow a feature that limits the length of a line that it prints. Right now, that's implemented by simply comparing the total number of bytes, and if it exceeds some limit, replacing the entire line with a message that it was stripped. Ideally, we could show the first N characters in that line first. The best way I know to implement that is to decode the first N grapheme clusters and print those.

However, it's not feasible to do that, AFAIK, with the current API of this crate. I could do a lossy conversion of the entire line, but the line could be quite large (potentially arbitrarily large), so it would be much nicer to just decode what I need and be done with it.

More generally, I've been toying with the idea of creating a new string type that is "conventionally UTF-8, but not required to be," similar to the semantics of string in Go. I don't know yet whether it's a niche type or not, but honestly, it is very frequently the actual thing I want in my work. In order to make a type like that work with Unicode, we basically need lossy (or similar) APIs for everything that work on &[u8] or otherwise possibly valid UTF-8. So this issue is me trying to feel out how amenable our Unicode ecosystem is to this or whether I should just go off and prototype this on my own first.

I'd very much welcome thoughts you have on my overall plan here. :-) Apologies if it's a little side tracked from the specific issue here, but I figured giving some context on what I'm trying to do would help.

@mbrubeck
Copy link
Contributor

mbrubeck commented Oct 8, 2018

Speaking as a sometime-contributor to this crate, I wouldn't object to the proposed implementation/API changes.

But I'd also note that this crate is not very actively maintained at all, and the fork in https://crates.io/crates/unic-segment might be a better place for new development. It hasn't been very actively developed either, but the "rust-unic" project as a whole has been more active recently than the "unicode-rs" project.

@BurntSushi
Copy link
Author

@mbrubeck Yeah I noticed that. It's tough keeping up with maintenance on these types of crates; there's a high context switch overhead for diving into it (in my experience, anyway).

Thanks for pointing out unic-segment. I hadn't seen that.

If I do decide to move forward with this, I might just go ahead and prototype this in my own space, with the intent of hopefully merging efforts at some point in the future.

@aochagavia
Copy link

aochagavia commented Jan 3, 2023

Just out of interest, did @BurntSushi or anyone else end up prototyping / implementing this somewhere? I have come across a use case where this might be useful.

Update: it looks like @BurntSushi went on and created bstr, which does segmentation too 😄

@BurntSushi
Copy link
Author

BurntSushi commented Jan 3, 2023

@aochagavia Yeah, indeed. I also took a different implementation technique than unicode-segmentation and used regexes to generate offline DFAs that get compiled into the library. For example: https://github.com/BurntSushi/bstr/blob/86947727666d7b21c97eb16145b3ad6ac22aacd3/scripts/regex/grapheme.sh

I generally think my technique was mostly a failure. It works, but not as well as I'd hoped. The actual binary size is still big and I'm not sure it's meaningfully faster. ICU4X is probably a good project to look into if you want the best, and I believe it supports [u8].

As for APIs, I just exposed, e.g., graphemes and graphemes_indices on [u8] directly. As I mentioned above, the Unicode replacement codepoint is automatically substituted when invalid UTF-8 is seen, but the grapheme_indices API provides offsets into the original slice that permit accessing the original invalid bytes when necessary.

@Manishearth
Copy link
Member

Also worth noting that icu_segmenter supports a bunch of different encodings (including invalid ones), and supports rule-based and dictionary-based segmentation: https://docs.rs/icu_segmenter

otoh it is probably not as fast as hardcoded segmenters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants