Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implementation for io::Read/io::Write #8

Closed
BurntSushi opened this issue Mar 8, 2017 · 17 comments
Closed

implementation for io::Read/io::Write #8

BurntSushi opened this issue Mar 8, 2017 · 17 comments

Comments

@BurntSushi
Copy link
Contributor

What are your thoughts on providing implementations of the io::Read/io::Write traits as a convenience for handling stream encoding/decoding?

Here is the specific problem I'd like to solve. Simplifying, I have a function that looks like the following:

fn search<R: io::Read>(rdr: R) -> io::Result<SearchResults> { ... }

Internally, the search function limits itself to the methods of io::Read to execute a search on its contents. The search is exhaustive, but is guaranteed to use a constant amount of heap space. The search routine expects the buffer to be UTF-8 encoded (and will handle invalid UTF-8 gracefully). I'd like to use this same search routine even if the contents of rdr are, say, UTF-16. I claim that this is possible if I wrap rdr in something that satisfies io::Read but uses a encoding_rs::Decoder internally to convert UTF-16 to UTF-8. I would expect the callers of search to do that wrapping. If there's invalid UTF-16, then inserting replacement characters is OK.

Does this sound like something you'd be willing to maintain? I would be happy to take an initial crack at an implementation if so. (In fact, I must do this. The point of this issue is asking whether I should try to upstream it or not.) However, I think there are some interesting points worth mentioning. (There may be more!)

  1. Is this type of API useful in the context of the web? If not, then maybe it shouldn't live in this crate.
  2. The io::Read interface feels not-quite-right in some respects. For example, the io::Read primarily operates on a &[u8]. But if encoding_rs is used to provide an io::Read implementation, then it necessarily guarantees that all consumers of that implementation will read valid UTF-8, which means converting the &[u8] bytes to &str safely will incur an unnecessary cost. I'm not sure what to make of this and how much one might care, but it seems worth pointing out. (This particular issue isn't a problem for me, since the search routine itself handles UTF-8 implicitly.)
@hsivonen
Copy link
Owner

hsivonen commented Mar 8, 2017

Does this sound like something you'd be willing to maintain?

In abstract, yes, something like this looks like something that could make sense for this crate. My main worry is that I'd like to avoid shipping the wrong thing and io::Read doesn't seem like quite the right thing as @SimonSapin remarked in the rust-encoding issue. That is, it seems wrong for the type system-level UTF-8 validity indication to be lost. (I realize this doesn't matter for your use case.)

Apart from losing the type system-level UTF-8 validity indication, io::Read has the problem that the length of the buffer that the caller asks the io::Read object to fill is allowed to be arbitrarily short. While it's certainly possible to write adapter code that deals with very short buffers using copying, it would be more efficient to have the freedom to require the buffer to be at least 4 bytes long, since encoding_rs wants to output unsplit UTF-8 byte sequences.

Is this type of API useful in the context of the web?

Unclear. It could well be. However, the currently foreseeable C++ and Rust use cases in Gecko don't seem to need it. (I expect the first Rust users in Gecko to use encoding_rs in the non-streaming mode in order to benefit from borrowing when possible. The main streaming users will be the HTML and XML parsers, which will remain as C++ for the time being.)

At this point, I'd like to gain a better understanding of how the not-quite-rightness of io::Read should be dealt with before committing.

@SimonSapin
Copy link

There’s two axes to this: decoding vs encoding (which side has bytes and which side has Unicode), and “pushing” data into a {de,en}coder vs “pulling” data from it. That’s four traits to represent different kinds of streams:

  • std::io::Write (pushing bytes)
  • std::io::Read (pulling bytes)
  • std::fmt::Write (pushing Unicode)
  • Some trait that doesn’t exist in std to pull Unicode. I had trouble coming up with this before, but it turns out that &mut str is a thing so something like this would work:
    pub trait UnicodeRead {
        fn read(&mut self, buf: &mut str) -> Result<usize>;
    }
    &mut str is a bit unusual: trait implementations would need to unsafely transmute it to &mut [u8] to write to it, and be careful to preserve the UTF-8 invariant (which means zeroing up to three extra bytes when writing a prefix sub-slice). encoding-rs’ Decoder::decode_to_str* methods already do this.

So to be complete, I think four types would be needed. They could be in an encoding_rs::stream module:

  • WriteEncoder that contains Encoder and W: std::io::Write, and implements std::fmt::Write
  • WriteDecoder that contains Decoder and W: std::fmt::Write, and implements std::io::Write
  • ReadEncoder that contains Encoder and R: UnicodeRead, and implements std::io::Read
  • ReadDecoder that contains Decoder and R: std::io::Read, and implements UnicodeRead

The Write* ones would need a buffer allocation strategy. Perhaps [u8; 1024] on the stack is reasonable. (With possibly multiple calls to the encoder/decoder for each incoming write call.)

@BurntSushi Your search function uses std::io::Read with bytes despite being on the "Unicode side". It is possible to do that and give up on UTF-8 well-formedness being enforced by the type system (by using str). As you suggested, implementing this on top of ReadDecoder does require expensive zeroing (or something) of the input &mut [u8] buffer before it can be passed as &mut str to UnicodeRead. Instead this could be done not on top of UnicodeRead, but with a fifth wrapper type ReadBytesDecoder that has std::io::Read on both sides and uses encoding-rs’ Decoder::decode_to_utf8* methods that take &mut [u8].


Now, while all of these five stream wrappers are possible, I don’t know if they’re all equally useful.

@SimonSapin
Copy link

About std::fmt::Write being in std::fmt: it lives there because formatting is the justification for its inclusion in the standard library, but it works quite well as a general purpose Unicode stream not necessarily related to formatting. We use it in Servo for CSS serialization.

@BurntSushi
Copy link
Contributor Author

BurntSushi commented Mar 8, 2017

@SimonSapin I'm not sure if I was clear about this, but the internals of search never materialize an &str. Searching is done directly on the &[u8], so I never need to do any kind of zeroing. Invalid UTF-8 is handled implicitly. (The search is executed using a FSM, so if there's invalid UTF-8 and the FSM requires valid UTF-8, the FSM simply won't match.)

The above is why io::Read makes perfect sense for my use case, even at the type system level. But I can see how my case might be a bit specialized, and therefore, the general usefulness may be questionable.

It sounds like there's enough design space here that I could drown pretty easily if I tried to upstream this right away. I will take a crack at implementing io::Read in my project and report back.

N.B. For other work that I've enjoyed on this topic, Go's golang.org/x/text/encoding package uses transformers that provide io.Reader/io.Writer impls and it works quite nicely. Likely the key difference is the fact that a Go string is only conventionally valid UTF-8. (And also because a []byte isn't quite as inconvenient to work with as if it were text as &[u8] is in Rust.)

@SimonSapin
Copy link

@BurntSushi Right, I understood what your search does. I meant zeroing if you had to implement Read on top of UnicodeRead.

Even with conventionally UTF-8 there’s still four combinations of transformers, right? (read/write * decoder/encode)

@BurntSushi
Copy link
Contributor Author

I meant zeroing if you had to implement Read on top of UnicodeRead.

Ah gotya!

Even with conventionally UTF-8 there’s still four combinations of transformers, right? (read/write * decoder/encode)

I guess implicitly, sure. But you build them yourself and everything still satisfies io.Writer/io.Reader (there's no analog to std::fmt::Write or a hypothetical UnicodeRead for example).

@SimonSapin
Copy link

(Ah, I see. Since the type is the same on both "sides" (I/O vs Unicode) in Go, encoders and decoders can implement the same Transformer API.)

SimonSapin added a commit to SimonSapin/encoding_rs that referenced this issue Mar 9, 2017
@SimonSapin
Copy link

I’ve been thinking about the wrapper types I proposed above, so I wrote them down: #9.

@hsivonen
Copy link
Owner

Below is my current thinking on this. I still don't know if this is the right thing. My main problem is that I've been writing so little I/O code in Rust that I don't have solid experience to draw expectations from.

It's not clear to me that it's worthwhile to do a wrapper type for writing. In general, software these days should use UTF-8 for interchange, so if your program is generating something other than UTF-8 for interchange, chances are that you are doing something wrong. Dealing with legacy encodings on the input side is dealing with someone else doing something wrong or dealing with content that was written a long time ago. Furthermore, due to the heavy Firefox bias of encoding_rs and libxul size being a concern especially on Android, right now apps that want to use encoding_rs for output in scenarios that don't have the mitigating factors that make legacy encoder performance not really matter in Firefox would have a very bad time. (I feel tempted, just to be able to not to repeat this caveat every time, to add some kind of win-microbenchmarks cargo feature that would make encode fast.)

Anyway, for today, I'm going to focus on the decoder side.

Instead of having only a new ReadUnicode trait as in #9, it would probably be easier for existing Rust code bases that want to add support for input and legacy encodings to deal with encoding_rs providing an implementation of BufRead potentially with the addition of a method for reading into &str as in ReadUnicode in #9.

BufRead already provides methods like fn read_line(&mut self, buf: &mut String) -> Result<usize> and fn lines(self) -> Lines<Self>, which I expect would make sense to implement directly in a way that avoids copying. Also, making it clear that the wrapper type includes a buffer would be good so that users don't feel the need of adding useless BufReader.

Providing an implementation of BufRead is a bit problematic, because it has some things that are hard to implement without a second buffer, so either the implementation wouldn't be a full implementation of the trait (panicing if you try to use the wrong things) or would have performance cliffs (slower if you use the wrong things). Still, it would seem nice, from the perspective of being a drop-in the thing, to use the standard library trait for the good parts, specifically line-oriented I/O.

If going the route that the things that don't quite fit don't panic but implement the API faithfully by the means of having more complexity internally, the complexity could indeed get quite complex making the implementation and especially testing of this feature more time-consuming that it would seem on the surface. Unless there is a use case in Firefox, it would probably be hard for me to justify spending a lot of time on this right now. Still, I think it's worthwhile to at least write my thoughts down.

I'm thinking that the wrapper type would heap allocate a buffer of the same capacity as BufReader. (If the API provided a capability for the caller to specify the capacity, the there would probably need to be a documented minimum size and a panic on insufficient size.) The preferred mode of operation would be that this buffer is passed to the underlying Read to fill and then Decoder would convert directly from this buffer to the buffers provided by callers or created for each line for the methods that return heap-allocated lines.

Unfortunately, there is plenty of opportunity to use the Read and BufRead traits in byte-oriented ways that would break this copy minimization scheme. When such a thing occurs, the wrapper type would heap-allocate another buffer internally, use that buffer as the output buffer of Decoder and then serve callers from that buffer until list is exhausted at which point the wrapper type would go back to the copy-avoidance mode.

As far as I can tell, the various things that could trigger the slower (double-buffered mode) would include:

  • Calling read() with an argument whose length is 1, 2, or 3, when the next code unit takes more bytes to represent in UTF-8. (AFAICT, when the argument has greater than zero length, the API contract requires the trait implementation to produce at least one byte into the buffer.)
  • Calling read_exact() (except in very lucky cases where the Decoder happens to fill the buffers exactly).
  • After a call to bytes() when first non-ASCII or ISO-2022-JP non-ASCII-state input is encountered (or immediately if the encoding is UTF-16BE or UTF-16LE.
  • After a call to chars() when first non-ASCII or ISO-2022-JP non-ASCII-state input is encountered (or immediately if the encoding is UTF-16BE or UTF-16LE.
  • Calling fill_buf().
  • Calling read_until() with a byte that has its high bit set or with an ASCII byte whose interpretation can differ in the non-ASCII states of ISO-2022-JP.
  • Same thing on split().

@SimonSapin
Copy link

Could this second buffer be [u8; 4] without heap allocation?

@hsivonen
Copy link
Owner

Could this second buffer be [u8; 4] without heap allocation?

Yes, but then the cases that trigger its use would be even slower, because there would be per-code point entry and exit to the decoder calls. That is, making the secondary buffer tiny would make the performance cliff worse.

@bobkare
Copy link

bobkare commented May 12, 2018

It seems to me like it would be a better match for the byte-oriented Read and BufRead APIs if the buffer has the decoded data rather than decoding when fetching from it. It also seems a lot easier to implement :)

I tried implementing a prototype: https://pastebin.com/bjzeLmda. Probably plenty of rough edges, but it seems to work and the following trivial driver program seems to outperform the recode_rs example by a few percents:

extern crate decode_reader;
extern crate encoding_rs;

use std::io::{self, Read, Write, BufRead};

use decode_reader::DecodeReader;
use encoding_rs::WINDOWS_1252;

fn main() {
    let stdin = io::stdin();
    let in_rdr = stdin.lock();
    let mut rdr = DecodeReader::new(in_rdr, WINDOWS_1252);

    let stdout = io::stdout();
    let mut out_wr = stdout.lock();

    loop {
        let written = {
            let buf = rdr.fill_buf().unwrap();
            if buf.len() == 0 {
                break;
            }
            let written = out_wr.write(&buf).unwrap();
            written
        };
        rdr.consume(written);
    }
}

Does this look like a reasonable way to implement this?

@BurntSushi
Copy link
Contributor Author

I've documented and polished the transcoder I wrote for ripgrep and extracted it to a separate crate: https://github.com/BurntSushi/encoding_rs_io I also summarized some possible future work (mostly outlined by @SimonSapin above) in crate docs as a possible path forward. My transcoder, I believe, does handle all of the corner cases implied by io::Read. A notable simplification is that the implementation may write partial UTF-8 sequences to the caller provided &mut [u8]. I claim that this is OK because this particular transcoder's goal isn't to necessarily enforce valid UTF-8 by any means necessary. Some other streaming transcoders that might be added to this crate may want that.

@hsivonen What do you think about closing this issue given that there is an external place for this particular problem to develop?

@BurntSushi
Copy link
Contributor Author

@bobkare I believe your implementation has an extra copy in it that is avoided by transcoding into the caller provided buffer directly.

@hsivonen
Copy link
Owner

hsivonen commented Aug 8, 2018

@bobkare, sorry about your comment slipping off the top of my GitHub issues to reply to in May.

@BurntSushi,

I've documented and polished the transcoder I wrote for ripgrep and extracted it to a separate crate

This is great! Thank you!

I claim that this is OK because this particular transcoder's goal isn't to necessarily enforce valid UTF-8 by any means necessary.

I find it a surprising, though, that by default, it has different REPLACEMENT CHARACTER behavior for BOMless UTF-8 and BOMful UTF-8.

A couple of other minor observations:

  • It seems that the API doesn't allow the semantics of the "decode" algorithm from the spec for non-UTF-8 encodings. That is, it seems it's not possible to specify "Use windows-1252, but if there's a BOM, honor the BOM".
  • It would probably be worthwhile to implement Read::read_to_string on top of Decoder::decode_to_string as an optimization that would avoid UTF-8 re-validation in the case that presumably might become common (outside ripgrep).

What do you think about closing this issue given that there is an external place for this particular problem to develop?

Let's do that.

(I'll add a pointer to encoding_rs_io in the documentation of encoding_rs.)

@BurntSushi
Copy link
Contributor Author

BurntSushi commented Aug 8, 2018

@hsivonen Awesome, thanks for checking it out and giving feedback!

I find it a surprising, though, that by default, it has different REPLACEMENT CHARACTER behavior for BOMless UTF-8 and BOMful UTF-8.

The thing is though is that BOMless UTF-8 isn't necessarily UTF-8. Basically, in the absence of a BOM or an otherwise explicitly specified encoding, DecodeReaderBytes is effectively a no-op and doesn't do any work. This is important to permit users of the decoder to handle possibly invalid UTF-8 in their own way with as little overhead as possible. Basically, my feeling is that if users want to explicitly state that they need UTF-8 in the absence of a BOM, then they should explicitly specify an encoding. If you explicitly specify UTF-8, then I think the replacement character behavior should be the same as-if a UTF-8 BOM were found.

You can also go the reverse direction, where BOMful UTF-8 passes the bytes through untouched just as if there were no BOM via the utf8_passthru option.

This API choice probably doesn't make sense for other types of decoders, e.g., ones that write into a &mut str via some hypothetical new UnicodeRead trait.

It seems that the API doesn't allow the semantics of the "decode" algorithm from the spec for non-UTF-8 encodings. That is, it seems it's not possible to specify "Use windows-1252, but if there's a BOM, honor the BOM".

Yeah I believe that's not possible. I'm definitely open to exploring adding this, but I didn't have a specific use case in mind for myself, so I just went without it. I created an issue for it here: BurntSushi/encoding_rs_io#3

It would probably be worthwhile to implement Read::read_to_string on top of Decoder::decode_to_string as an optimization that would avoid UTF-8 re-validation in the case that presumably might become common (outside ripgrep).

Ah yeah, nice idea! Added an issue for that here: BurntSushi/encoding_rs_io#4

@hsivonen
Copy link
Owner

hsivonen commented Aug 8, 2018

Thanks for getting the issues on file. I added links to encoding_rs_io from encoding_rs documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants