-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implementation for io::Read/io::Write #8
Comments
In abstract, yes, something like this looks like something that could make sense for this crate. My main worry is that I'd like to avoid shipping the wrong thing and Apart from losing the type system-level UTF-8 validity indication,
Unclear. It could well be. However, the currently foreseeable C++ and Rust use cases in Gecko don't seem to need it. (I expect the first Rust users in Gecko to use encoding_rs in the non-streaming mode in order to benefit from borrowing when possible. The main streaming users will be the HTML and XML parsers, which will remain as C++ for the time being.) At this point, I'd like to gain a better understanding of how the not-quite-rightness of |
There’s two axes to this: decoding vs encoding (which side has bytes and which side has Unicode), and “pushing” data into a {de,en}coder vs “pulling” data from it. That’s four traits to represent different kinds of streams:
So to be complete, I think four types would be needed. They could be in an
The @BurntSushi Your Now, while all of these five stream wrappers are possible, I don’t know if they’re all equally useful. |
About |
@SimonSapin I'm not sure if I was clear about this, but the internals of The above is why It sounds like there's enough design space here that I could drown pretty easily if I tried to upstream this right away. I will take a crack at implementing N.B. For other work that I've enjoyed on this topic, Go's |
@BurntSushi Right, I understood what your Even with conventionally UTF-8 there’s still four combinations of transformers, right? (read/write * decoder/encode) |
Ah gotya!
I guess implicitly, sure. But you build them yourself and everything still satisfies |
(Ah, I see. Since the type is the same on both "sides" (I/O vs Unicode) in Go, encoders and decoders can implement the same Transformer API.) |
I’ve been thinking about the wrapper types I proposed above, so I wrote them down: #9. |
Below is my current thinking on this. I still don't know if this is the right thing. My main problem is that I've been writing so little I/O code in Rust that I don't have solid experience to draw expectations from. It's not clear to me that it's worthwhile to do a wrapper type for writing. In general, software these days should use UTF-8 for interchange, so if your program is generating something other than UTF-8 for interchange, chances are that you are doing something wrong. Dealing with legacy encodings on the input side is dealing with someone else doing something wrong or dealing with content that was written a long time ago. Furthermore, due to the heavy Firefox bias of encoding_rs and libxul size being a concern especially on Android, right now apps that want to use encoding_rs for output in scenarios that don't have the mitigating factors that make legacy encoder performance not really matter in Firefox would have a very bad time. (I feel tempted, just to be able to not to repeat this caveat every time, to add some kind of Anyway, for today, I'm going to focus on the decoder side. Instead of having only a new
Providing an implementation of If going the route that the things that don't quite fit don't panic but implement the API faithfully by the means of having more complexity internally, the complexity could indeed get quite complex making the implementation and especially testing of this feature more time-consuming that it would seem on the surface. Unless there is a use case in Firefox, it would probably be hard for me to justify spending a lot of time on this right now. Still, I think it's worthwhile to at least write my thoughts down. I'm thinking that the wrapper type would heap allocate a buffer of the same capacity as Unfortunately, there is plenty of opportunity to use the As far as I can tell, the various things that could trigger the slower (double-buffered mode) would include:
|
Could this second buffer be |
Yes, but then the cases that trigger its use would be even slower, because there would be per-code point entry and exit to the decoder calls. That is, making the secondary buffer tiny would make the performance cliff worse. |
It seems to me like it would be a better match for the byte-oriented Read and BufRead APIs if the buffer has the decoded data rather than decoding when fetching from it. It also seems a lot easier to implement :) I tried implementing a prototype: https://pastebin.com/bjzeLmda. Probably plenty of rough edges, but it seems to work and the following trivial driver program seems to outperform the recode_rs example by a few percents: extern crate decode_reader;
extern crate encoding_rs;
use std::io::{self, Read, Write, BufRead};
use decode_reader::DecodeReader;
use encoding_rs::WINDOWS_1252;
fn main() {
let stdin = io::stdin();
let in_rdr = stdin.lock();
let mut rdr = DecodeReader::new(in_rdr, WINDOWS_1252);
let stdout = io::stdout();
let mut out_wr = stdout.lock();
loop {
let written = {
let buf = rdr.fill_buf().unwrap();
if buf.len() == 0 {
break;
}
let written = out_wr.write(&buf).unwrap();
written
};
rdr.consume(written);
}
} Does this look like a reasonable way to implement this? |
I've documented and polished the transcoder I wrote for ripgrep and extracted it to a separate crate: https://github.com/BurntSushi/encoding_rs_io I also summarized some possible future work (mostly outlined by @SimonSapin above) in crate docs as a possible path forward. My transcoder, I believe, does handle all of the corner cases implied by @hsivonen What do you think about closing this issue given that there is an external place for this particular problem to develop? |
@bobkare I believe your implementation has an extra copy in it that is avoided by transcoding into the caller provided buffer directly. |
@bobkare, sorry about your comment slipping off the top of my GitHub issues to reply to in May.
This is great! Thank you!
I find it a surprising, though, that by default, it has different REPLACEMENT CHARACTER behavior for BOMless UTF-8 and BOMful UTF-8. A couple of other minor observations:
Let's do that. (I'll add a pointer to |
@hsivonen Awesome, thanks for checking it out and giving feedback!
The thing is though is that BOMless UTF-8 isn't necessarily UTF-8. Basically, in the absence of a BOM or an otherwise explicitly specified encoding, You can also go the reverse direction, where BOMful UTF-8 passes the bytes through untouched just as if there were no BOM via the This API choice probably doesn't make sense for other types of decoders, e.g., ones that write into a
Yeah I believe that's not possible. I'm definitely open to exploring adding this, but I didn't have a specific use case in mind for myself, so I just went without it. I created an issue for it here: BurntSushi/encoding_rs_io#3
Ah yeah, nice idea! Added an issue for that here: BurntSushi/encoding_rs_io#4 |
Thanks for getting the issues on file. I added links to |
What are your thoughts on providing implementations of the
io::Read
/io::Write
traits as a convenience for handling stream encoding/decoding?Here is the specific problem I'd like to solve. Simplifying, I have a function that looks like the following:
Internally, the
search
function limits itself to the methods ofio::Read
to execute a search on its contents. The search is exhaustive, but is guaranteed to use a constant amount of heap space. The search routine expects the buffer to be UTF-8 encoded (and will handle invalid UTF-8 gracefully). I'd like to use this same search routine even if the contents ofrdr
are, say, UTF-16. I claim that this is possible if I wraprdr
in something that satisfiesio::Read
but uses aencoding_rs::Decoder
internally to convert UTF-16 to UTF-8. I would expect the callers ofsearch
to do that wrapping. If there's invalid UTF-16, then inserting replacement characters is OK.Does this sound like something you'd be willing to maintain? I would be happy to take an initial crack at an implementation if so. (In fact, I must do this. The point of this issue is asking whether I should try to upstream it or not.) However, I think there are some interesting points worth mentioning. (There may be more!)
io::Read
interface feels not-quite-right in some respects. For example, theio::Read
primarily operates on a&[u8]
. But ifencoding_rs
is used to provide anio::Read
implementation, then it necessarily guarantees that all consumers of that implementation will read valid UTF-8, which means converting the&[u8]
bytes to&str
safely will incur an unnecessary cost. I'm not sure what to make of this and how much one might care, but it seems worth pointing out. (This particular issue isn't a problem for me, since the search routine itself handles UTF-8 implicitly.)The text was updated successfully, but these errors were encountered: