RFC: Text/Unicode oriented streams #57

SimonSapin · 2014-04-29T17:01:03Z

No description provided.

SimonSapin · 2014-04-29T17:01:30Z

Rendered view: https://github.com/SimonSapin/rfcs/blob/text-streams/active/0000-text-streams.md

sfackler · 2014-04-29T21:02:19Z

active/0000-text-streams.md

+    fn write_str(&mut self, buf: &str) -> IoResult<()>;
+
+    // These are similar to Writer, but based on `write_str` instead of `write`.
+    fn write_char(&mut self, c: char) -> IoResult<()> { ... }


I would (perhaps naively) expect that this would be the fundamental method. Why would write_str be it instead?

It really could be either. See the paragraph below about rust-lang/rust#7771

chris-morgan · 2014-04-30T02:22:41Z

It would be the case with a multi-byte encoding text reader that a tiny buffer was essential (three bytes for UTF-8), in case the underlying byte-oriented stream only gave you as far as part of a character in a read, would it not? Blending it in with Buffer in some way might make sense for such a thing, allowing one to have a single buffer rather than two buffers.

(I speak with no experience of trying to actually implement such a thing, and I haven’t thought it through thoroughly by any means.)

lifthrasiir · 2014-04-30T04:01:22Z

If this would ever get into the standard library, we may also want to have a separate IoError for the encoding conversion error.

SimonSapin · 2014-04-30T08:07:11Z

@chris-morgan Yes, multi-byte decoders in rust-encoding do have some buffering for this reason, but they do so with custom code (e.g. a couple of struct fields) rather than something like Buffer. We may still want something like Buffer to make efficient handling one code point at a time.

SimonSapin · 2014-04-30T08:07:56Z

@lifthrasiir Yes, that sounds like the right thing to do for decoding or encoding stream wrappers.

netvl · 2014-05-06T07:36:56Z

Addition of text streams working over byte streams is something I wanted for a long time. I'm currently working on XML processing library for Rust, and I'm missing Java's separation between byte and character streams.

BTW, Java concept of byte/characters streams is really worth noting. Apart from somewhat bloated API, the basic idea is really great. It really resembles our current composition of iterators, when top layers give additional features, wrapping around lower layers.

lilyball · 2014-05-13T00:56:14Z

I'm cautiously in favor of this proposal. I would like text-oriented streams, but I'm worried about how it would interact with our existing Writer-focus stuff. Would Show be rebuilt on top of text-oriented streams? Would Writer become a supertrait of TextWriter (because all of the methods from TextWriter also exist in Writer)? My hope is that the answer to both of these is "yes". And presumably we could write a Writer adaptor for a TextWriter that e.g. does a lossy utf-8 transformation of written bytes (and buffers incomplete codepoints until flush() or Drop).

Just today in IRC it was asked how to append a formatted string onto a StrBuf without the intermediate allocation. As far as I can tell, you can't. And I was suggesting that a Writer adaptor would need to be written that does the lossy utf-8 conversion I just suggested. But this would work much better as a general adaptor for TextWriter.

SimonSapin · 2014-05-13T07:49:47Z

Would Show be rebuilt on top of text-oriented streams?

I think that makes sense.

Would Writer become a supertrait of TextWriter (because all of the methods from TextWriter also exist in Writer)?

So any byte writer is also implicitly a Unicode writer that encodes as UTF-8? I’m conflicted about this. On one hand, I would love to be in a world where we can assume everything is UTF-8 and nobody uses any of legacy encodings anymore. On the other hand this feels a bit sloppy: blurring the distinction between Unicode text and bytes could lead to Mojibake and related bugs. Python says "Explicit is better than implicit."

So I think we should either take your suggestion, or go the opposite direction: remove the Writer and Reader methods that assume UTF-8, and have a explicit UTF-8 adaptors.

And presumably we could write a Writer adaptor for a TextWriter that e.g. does a lossy utf-8 transformation of written bytes (and buffers incomplete codepoints until flush() or Drop).

Yeah, that’s part of rust-encoding. I suppose that at some point we’ll want to distribute rust-encoding with the compiler, although maybe before that we’ll only import parts of it such as the adaptor you describe here.

zkamsler · 2014-05-13T14:12:44Z

active/0000-text-streams.md

+
+```rust
+pub trait TextReader {
+    fn read(&mut self, buf: &mut StrBuf, max_bytes: uint) -> IoResult<uint>;


It would be nice if read did not necessarily have to write to a StrBuf (and incur the related heap allocations at some point). It could be parameterized over some trait that can append &strs, (or trait objects if TextReader will be largely used as a trait object. It could even be a TextWriter, although that could be more broad a trait than is necessary.

Also, is max_bytes the maximum number of utf8 bytes that can be read, or the maximum number of bytes from the underlying encoded data?

max_bytes is meant is the number of UTF-8 bytes added to buf, so that you could pre-allocate with StrBuf::reserve_additional before calling read.

I’m not very happy with the design of TextReader::read here, but I don’t know what else would be better. Suggestions welcome.

lilyball · 2014-05-13T18:54:07Z

@SimonSapin

So I think we should either take your suggestion, or go the opposite direction: remove the Writer and Reader methods that assume UTF-8, and have a explicit UTF-8 adaptors.

After considering the options, I think this is the best. From Writer, we remove write_str(), write_line(), write_char(), write_uint(), and write_int(). We then introduce

pub struct UTF8Writer<W>(pub W);

impl<W: Writer> Writer for UTF8Writer<W> { ... }

impl<W: Writer> TextWriter for UTF8Writer<W> { ... }

And do the equivalent for Reader (although the only string-based method I see there is read_to_str(), so that's easier; maybe we should just remove it and skip the adaptor, as anyone who simply wants to read to end and interpret as string can be using TextReader directly).

Addendum: It occurs to me that this still doesn't handle flush() and by_ref() being defined on both Writer and TextWriter. We can fix that for concrete types like UTF8Writer easily, by just defining those methods on the UTF8Writer directly as that takes precedence over the traits. But for anyone that wants to be generic over both Writer and TextWriter, the solution isn't as clear. We could define a trait that inherits from both and redeclare those methods in that supertrait, which works but is awkward (plus, what do you call it?). The alternative is to merely rename them in TextWriter to something that doesn't conflict, but the question is what? Of course, if we had UFCS this wouldn't be a big deal.

SimonSapin · 2014-05-13T19:48:47Z

@kballard That’s what I had in mind, except without impl<W: Writer> Writer for UTF8Writer<W>: a text writer is not also a byte writer. If you want to write non-UTF-8 byte to the underlying stream of UTF8Writer, unwrap it or keep a reference to the underlying Writer.

lilyball · 2014-05-13T19:50:48Z

@SimonSapin There's no reason to not implement Writer if the underlying W is a Writer. All that does is impose extra burden on any user if they need to do both byte and string writing to the same output. If the wrapped W is not itself a Writer then the UTF8Writer will not be one either.

lilyball · 2014-05-13T20:01:58Z

Regarding read(), here's my thoughts on a new design (presented here instead of as a line comment because of length):

This is a bit complicated. I suspect "read N bytes and interpret as a string" is more useful than "read and produce N utf-8 bytes", so the meaning of max_bytes should be changed. Similarly, we should also support a way to just return a StrBuf (for convenience, when the user doesn't care about allocation) as well as appending to an existing StrBuf. Similarly it should be possible to read directly onto the stack.

It's also complicated because any sort of reading like this (for anything besides single-byte encodings i.e. ASCII) is going to require at least a small internal buffer, in case the read bytes don't constitute an entire character. The alternative is requiring the read methods to take a precise byte count and error out if this doesn't end up hitting a character boundary, and that kind of sucks. Using character counts instead of byte counts doesn't really help either.

I think we have to take it as a given that implementations of this may require at least a small internal buffer (e.g. a UTF-8 reader requires at least a [u8, ..3]). And we should probably stick with the idea of a "max" count instead of an "exact" count. And finally, I think we need both maximum-bytes-read and maximum-utf8-bytes-emitted, depending on the method in question. And this means we may need (uint, uint) to return both bytes-read and bytes-emitted, again, depending on the method in question.

Given all that, I think the fundamental reading method would be

fn read_to_utf8(&mut self, v: &mut [u8], max_bytes_read: uint) -> IoResult<(uint, uint)>;

This reads to a fixed-size u8 slice. The length of the slice is taken as the max-bytes-emitted count. The max_bytes_read is, of course, the maximum number of bytes to read from the source. It only emits full utf-8 subsequences, and it returns both the number of bytes read and the number of bytes emitted (we may want a struct to be able to name those two fields).

The biggest issue I see with this method is that, because it's reading to a &[u8], there's nothing beyond the documentation of the method that ensures the result is valid utf-8, which means the caller can't skip the utf-8 check. Unfortunately there's no way to take a &mut str instead (because those aren't mutable, and even if you want to unsafely convert to a &[u8] you need to somehow come up with a "zero-filled" &str to begin with).

This could possibly be mitigated by taking some sort of trait that a) produces stronger guarantees about UTF-8 data (e.g. by using unsafe to access the underlying &mut [u8]), and b) allows for using a stack array as the destination. But there's no obvious design for this trait that makes much sense. I suppose we could do something like

pub trait TextReaderDestination {
    /// Returns the receiver as a `&mut [u8]`.
    ///
    /// Data written to this slice must be valid UTF-8.
    unsafe fn as_utf8_dest_slice<'a>(&'a mut self) -> &'a mut [u8];
}

impl<'l> TextReaderDestination for &'l mut [u8] {
    unsafe fn as_utf8_dest_slice<'a>(&'a mut self) -> &'a mut [u8] {
        self.as_mut_slice()
    }
}

and then we can declare our function

/// Returns (bytes read, bytes written to `v`)
fn read_to_utf8<T: TextReaderDestination>(&mut self, v: T, max_bytes_read: uint) -> IoResult<(uint, uint)>;

This is functionally equivalent, but requires using unsafe to extract the &mut [u8] that can be written to. This still doesn't guarantee the written data is actually UTF-8, but the use of unsafe tells the implementer that they're taking safety into their own hands. With something like this I'm more comfortable eliding the UTF-8 check later.

With this function, we can then provide some convenience methods:

/// Returns bytes read
fn read_to_strbuf(&mut self, buf: &mut StrBuf, max_bytes_read: uint) -> IoResult<uint> {
    unsafe {
        let v = buf.as_mut_vec();
        v.reserve_additional(max_bytes_read);
        let l = v.len();
        v.set_len(l + max_bytes_read);
        let s = v.mut_slice_from(l);
        v.set_len(l);
        let (read, count) = try!(self.read_to_utf8(s, max_bytes_read));
        debug_assert!(count <= max_bytes_read);
        v.set_len(l + count);
        Ok(read)
    }
}

/// Reads exactly `bytes_to_read` (this may involve several calls to
/// `read_to_utf8()`).
///
/// Returns an error if the result does not lie on a character
/// boundary.
///
/// The reason for this is this is intended to be used when you know precisely
/// how many bytes should comprise the string. If you want to allow smaller
/// reads or partial character reads, use `read_to_strbuf()`.
fn read_strbuf(&mut self, bytes_to_read: uint) -> IoResult<StrBuf> {
    let mut buf = StrBuf::new();
    let mut read = 0;
    while read < bytes_to_read {
        read += try!(self.read_to_strbuf(&mut buf, remaining));
    }
    Ok(buf)
}

If you want to optimize slightly, we can also have

/// Returns the estimated number of UTF-8 bytes that may be emitted
/// as a consequence of reading `byte_count` bytes.
///
/// This is a guess, the actual value may be higher or lower.
fn estimated_utf8_count_for_byte_count(&self, byte_count: uint) -> uint;

Then we can use this for both the capacity of the vector in read_to_strbuf() and the initial capacity of the StrBuf in read_strbuf(). The implementation for a UTF-8 TextReader would of course just return its byte_count unchanged. This could also be adjusted to either be maximum or minimum, but I suspect average is more generally useful.

SimonSapin · 2014-05-13T20:02:50Z

There's no reason to not implement Writer if the underlying W is a Writer.

There is: IMO cleanly separating the different type of streams to avoid accidentally treating one as the other is worth the inconvenience in rare case where you do need to mix them and know what you’re doing. And wrapping the result of .by_ref() to keep the original byte stream doesn’t seem so bad.

lilyball · 2014-05-13T20:05:52Z

A separate idea is to have a function like

fn read_to_writer<W: TextWriter>(&mut self, w: W, max_bytes_read: uint) -> IoResult<uint>;

as the only mechanism for reading. It would require using a TextWriter adaptor for the desired destination (e.g. StrBuf, [u8, ..N]), but is a bit simpler for the implementer.

The downside, of course, is that it can't read directly into the destination but must use a separate allocation first, which it then extracts a &str from. For this reason, I think this is sub-par (despite the simpler implementation) and recommend going with the more complex approach outlined above.

lilyball · 2014-05-13T20:10:13Z

And wrapping the result of .by_ref() to keep the original byte stream doesn’t seem so bad.

This doesn't work. It borrows the original stream, so you have to throw away your UTF8Writer to be able to use the Writer again.

lilyball · 2014-05-13T20:11:27Z

active/0000-text-streams.md

+
+    // These are similar to Writer
+    fn flush(&mut self) -> IoResult<()> { ... }
+    fn by_ref<'a>(&'a mut self) -> RefTextWriter<'a, Self> { ... }


This should just return RefWriter instead of defining a new type. RefWriter can be made to implement TextWriter when its parameter does.

Below in TextReader you already do this, by returning RefReader. That may have been a mistake, of course, but I think it's the right action.

RefWriter can be made to implement TextWriter when its parameter does.

Oh, I didn’t think of that and thought I needed a separate type. I’ll change this as you suggest.

SimonSapin · 2014-05-13T20:48:28Z

And wrapping the result of .by_ref() to keep the original byte stream doesn’t seem so bad.

This doesn't work. It borrows the original stream, so you have to throw away your UTF8Writer to be able to use the Writer again.

Uhm, indeed. Then an unwrap() method on UTF8Writer should do (assuming UTF8Writer owns the wrapped writer).

RefWriter<W> can implement TextWriter when W does, no need for a separate type.

lilyball · 2014-05-13T20:52:12Z

Yes, it should have unwrap() (and perhaps get_ref() and get_mut_ref()). I'm still not convinced that there's any benefit whatsoever to intentionally removing Writer functionality when you wrap it in UTF8Writer. I have a Writer to begin with, I want to be able to use TextWriter methods (with UTF-8 as the output) so I wrap it in UTF8Writer. I feel like this should still be a Writer. UTF8Writer doesn't need an internal buffer, so there's no worry about causing corruption issues by interleaving Writer writes and TextWriter writes. Requiring the use of unwrap() or get_mut_ref() in order to use Writer functionality is just overly restrictive.

zkamsler · 2014-05-13T20:55:37Z

Note that using Ref{Writer,Reader} does lose which implementation of TextWriter it is using, so re-wrapping it may not be possible if using a trait object or in a generic method.

Plus, there may be internal state that is buffered between calls to the underlying Reader or Writer

lilyball · 2014-05-13T20:59:42Z

@zkamsler I'm not sure what you mean. RefWriter doesn't lose anything. It's generic over the type parameter. If its embedded type is a Writer, it implements Writer and delegates the calls to the internal type. Adding the same thing for TextWriter will work just as well as it does for Writer.

Also, RefWriter doesn't even have an unwrap() method (what would be the point?).

zkamsler · 2014-05-13T21:15:51Z

@kballard I had missed the implementing TextWriter for RefWriter bit, and thought you were suggesting re-wrapping it. Sorry about that.

SimonSapin · 2014-05-13T21:18:21Z

This is a bit complicated. I suspect "read N bytes and interpret as a string" is more useful than "read and produce N utf-8 bytes", so the meaning of max_bytes should be changed.

That only makes sense when wrapping a byte reader and decoding with a given character encoding. There could be TextReader implementations that do not involve a character encoding, such a a MemTextReader that wraps a &str, or a ChanTextReader that wraps a Receiver<StrBuf>.

Similarly, we should also support a way to just return a StrBuf (for convenience, when the user doesn't care about allocation) as well as appending to an existing StrBuf. Similarly it should be possible to read directly onto the stack.

Like Reader, there should be one "most fundamental" method that implementations have to implement (and whose design is still uncertain), and a bunch of convenience methods that have a default implementation based on the first method.

It's also complicated because any sort of reading like this (for anything besides single-byte encodings i.e. ASCII) is going to require at least a small internal buffer, in case the read bytes don't constitute an entire character.

Again, only when decoding from bytes. In rust-encoding, this is specific to the implementation and not necessarily stored as bytes. UTF8Decoder maintains a few bytes while UTF16LEDecoder maintains a pair of 16-bit units.

I don’t think "bytes to read" makes sense in the TextReader trait.

fn read_to_utf8(&mut self, v: &mut [u8], max_bytes_read: uint) -> IoResult<(uint, uint)>;

&mut [u8] does seem to be the "most fundamental" way to give a writable buffer allocated (in many possible ways) by the caller, but as you say:

The biggest issue I see with this method is that, because it's reading to a &[u8], there's nothing beyond the documentation of the method that ensures the result is valid utf-8, which means the caller can't skip the utf-8 check

Actually you could skip the check with the unsafe str::raw::from_utf8 (which is made "safe" by the promise that TextReader indeed writes valid UTF-8), but I’m uncomfortable leaving it to every user to get this unsafe code right. Maybe instead of uint the return value could be a &'a str sub-slice at the beginning of the given &'a mut [u8]?

fn estimated_utf8_count_for_byte_count(&self, byte_count: uint) -> uint;

See my Stream size hints RFC. If both get accepted, text streams would naturally have size hints too.

lilyball · 2014-05-13T22:40:39Z

That only makes sense when wrapping a byte reader and decoding with a given character encoding. There could be TextReader implementations that do not involve a character encoding, such a a MemTextReader that wraps a &str, or a ChanTextReader that wraps a Receiver<StrBuf>.

In these cases, "read N bytes and interpret as a string" and "read and produce N utf-8 bytes" are identical, so they don't care which way the count is interpreted. But for the readers that do perform some sort of conversion, "read N bytes and interpret as a string" is much more useful. Besides, my proposed read_to_utf8() actually provides both counts.

Like Reader, there should be one "most fundamental" method that implementations have to implement

That method is read_to_utf8(). The other methods read_strbuf() and read_to_strbuf() are implemented in terms of it (and I already gave default implementations).

It's also complicated because any sort of reading like this (for anything besides single-byte encodings i.e. ASCII) is going to require at least a small internal buffer, in case the read bytes don't constitute an entire character.

Again, only when decoding from bytes.

Which every single TextReader has to do at some level. Even your suggested MemTextReader that reads from a &str still takes a byte count. Although in that case the &str itself actually doubles as the internal buffer. But every other reader is either going to be wrapping a Reader or directly taking some source of bytes. Still, this is a rather moot point, I was just calling out a reason why implementing TextReader may take more work than implementing Reader.

I don’t think "bytes to read" makes sense in the TextReader trait.

I think TextReader is completely unusable in a lot of cases without it.

Trivial example: I have a packed format that embeds strings as . I need to do something like

fn read_packed_string<R: Reader>(r: &mut R) -> IoResult<StrBuf> {
    let count = try!(r.read_le_u16());
    UTF8Reader(r).read_strbuf(count) // count is exact for read_strbuf()
}

This relies on the count being the "bytes to read". Obviously in this case it works either way, but if this is ISOLatin1Reader(r) instead, then it definitely needs to be "bytes to read".

Similarly, if I'm trying to process a file as fast as I can, the recommendation (put forth by libuv and used in read_to_end()) is to read data in 64k chunks. This is specifically the number of bytes to read from the file descriptor, not the number of bytes to produce after whatever re-encoding I've done on it.

Basically, the only situation in which "bytes to emit" is useful is when I'm trying to read into a fixed-length area, e.g. what read_to_utf8() does, and that value is already provided as part of the slice. If I'm reading into a growable data structure (e.g. read_strbuf() and read_to_strbuf()), "bytes to emit" is completely useless. Knowing how many bytes it will write is useful from an optimization standpoint (hence the suggested estimated count fn), but absolutely useless from a behavioral and semantics standpoint.

&mut [u8] does seem to be the "most fundamental" way to give a writable buffer allocated (in many possible ways) by the caller

There's nothing more fundamental than &mut [u8] if you don't want to own the data structure yourself.

Actually you could skip the check with the unsafe str::raw::from_utf8 (which is made "safe" by the promise that TextReader indeed writes valid UTF-8)

Well yes, but that misses the point. I should not be able to write code without unsafe that accidentally violates the utf-8 guarantee of StrBuf. Which means the support code here should not run str::raw::from_utf8 on the &[u8] that I produced without using unsafe. You said it's made "safe" by the promise that TextReader makes, but that's not actually enforced anywhere. I can violate that promise rather trivially, which should only be possible by using unsafe.

I’m uncomfortable leaving it to every user to get this unsafe code right

Except users won't be implementing TextReaders. They'll just be using them. The unsafe is only necessary for the few people that have to implement them, and I think we cannot even consider skipping the UTF-8 check on the results unless the implementation is required to use unsafe.

Maybe instead of uint the return value could be a &'a str sub-slice at the beginning of the given &'a mut [u8]?

Interesting suggestion. It still requires unsafe to elide the UTF-8 check though. And there's one major flaw, which is that this would allow the implementation to write to a slice of the &[u8] that doesn't start at the beginning, and that's absolutely incorrect.

See my Stream size hints RFC. If both get accepted, text streams would naturally have size hints too.

The size hint proposed there is the wrong hint. For Reader you're proposing "an estimated range of how many bytes are remaining to be read". That's not useful here. What estimated_utf8_count_for_byte_count() is doing is providing the best guess of how many utf-8 bytes will be emitted if this TextReader reads precisely count bytes from its source. This doesn't care how many bytes are remaining in the Reader.

SimonSapin · 2014-05-13T23:23:07Z

@kballard I feel that "bytes to read" blows through the abstraction too much. I’d be OK with "an integer that is somehow related to the amount of data to be read", with the TextWriter trait saying nothing about that relationship, which is defined on documented by every implementation. So, same thing, just with a different name of semantics :)

I’m uncomfortable leaving it to every user to get this unsafe code right

Except users won't be implementing TextReaders. They'll just be using them. The unsafe is only necessary for the few people that have to implement them

Yeah, that’s what I meant. Users of the trait having to write unsafe is bad. Implementers of the trait having to may be acceptable.

Maybe instead of uint the return value could be a &'a str sub-slice at the beginning of the given &'a mut [u8]?

Interesting suggestion. It still requires unsafe to elide the UTF-8 check though.

In the trait implemations.

And there's one major flaw, which is that this would allow the implementation to write to a slice of the &[u8] that doesn't start at the beginning, and that's absolutely incorrect.

We’d have to trust trait implementations to get this right (just like the coercion to str).

What estimated_utf8_count_for_byte_count() is doing is providing the best guess of how many utf-8 bytes will be emitted if this TextReader reads precisely count bytes from its source. This doesn't care how many bytes are remaining in the Reader.

Oh, I see, sorry for the confusion. Yeah, that would probably be useful too.

lilyball · 2014-05-13T23:39:03Z

@SimonSapin

I feel that "bytes to read" blows through the abstraction too much. I’d be OK with "an integer that is somehow related to the amount of data to be read", with the TextWriter trait saying nothing about that relationship, which is defined on documented by every implementation. So, same thing, just with a different name of semantics :)

How is that any better? Certainly no adaptor built on top of generic TextReaders could ever define what the count is, because it depends entirely on the underlying concrete TextReader. And any function that operates on generic TextReaders would then have no idea what count means, because it doesn't know the concrete implementation. If my above fn read_packed_string() took a <R: Reader + TextReader> (thereby allowing the encoding to be specified externally), then it could not possibly be implemented because it doesn't know how the count works.

I can't come up with a single scenario where "number of bytes to emit" is useful, except in the case of writing to a fixed-size buffer, where &[u8] already provides the length. Can you come up with anything?

If you want to allow read_to_utf8() to attempt to fill its buffer without specifying a count of bytes to read, you could just pass uint::MAX. Alternatively, we could redefine the function as taking an Option<uint>, but that provides no benefit over merely using uint::MAX as the count. And of course read_strbuf() and read_to_strbuf() still need the counts, because they don't use fixed-size buffers.

If there is in fact some use-case for "read 5 characters from this stream, regardless of how many bytes it takes", then I would say that calls for a BufferedTextReader object that maintains an internal buffer and doles out the requested characters.

Users of the trait having to write unsafe is bad. Implementers of the trait having to may be acceptable.

Good news, users don't have to write unsafe.

The proposed TextReader trait exposes a safe API, and the only unsafe method is on the separate TextReaderDestination trait used by implementers, not users. And I think implementers must be required to use unsafe if we want to elide the UTF-8 check of the written data (and we'd really like to elide that; not only is there no clear behavior if it fails, besides fail!(), but it's also an unnecessary performance loss that).

And there's one major flaw, which is that this would allow the implementation to write to a slice of the &[u8] that doesn't start at the beginning, and that's absolutely incorrect.
We’d have to trust trait implementations to get this right (just like the coercion to str).

Trust them to get something right that doesn't require unsafe to do incorrectly, and that, when done incorrectly, will lead to horribly broken behavior elsewhere? I'd rather not. It's simple enough to just return uint, which enforces that the written data must be at the start of the &[u8].

SimonSapin · 2014-05-13T23:44:47Z

Trust them to get something right that doesn't require unsafe to do incorrectly, and that, when done incorrectly, will lead to horribly broken behavior elsewhere? I'd rather not. It's simple enough to just return uint, which enforces that the written data must be at the start of the &[u8].

That would leave same burden on the users of the trait instead of just the implementors.

If you want to write to &mut [u8] and then get &str or StrBuf out of it, you either have to pay the cost of the UTF-8 check or trust someone to write unsafe code and get it right.

SimonSapin · 2014-05-13T23:46:49Z

I can't come up with a single scenario where "number of bytes to emit" is useful, except in the case of writing to a fixed-size buffer, where &[u8] already provides the length.

Yes, if there is a &mut [u8] parameter to write to, "number of bytes to emit" is redundant with the length of the slice. I feel that many misunderstandings in this conversation are caused by conflating different proposals :(

lilyball · 2014-05-13T23:55:42Z

@SimonSapin Ok, you've convinced me that fn read_to_utf8<T: TextReaderDestination>(&mut self, dest: T, max_bytes_read: uint) -> IoResult<(&str, uint)> is easier for end-users to use. It makes the implementation more complicated, though, especially because it can't actually look like that but has to look like

fn read_to_utf8<'a, T: TextReaderDestination<'a>>(&mut self, dest: T, max_bytes_read: uint) -> IoResult<(&'a str, uint)>;

Experimentally, it's a little more awkward to have the lifetime on the TextReaderDestination (although this would get easier post-DST when we can implement that on [u8] directly and take a dest: &mut T instead).

I'm also still not entirely comfortable with the fact that the implementer could return a &str that's taken from somewhere besides the beginning of the &[u8].

Theoretically, we could keep the existing read_to_utf8() and provide a convenience wrapper that returns IoResult<(&str,uint)> implemented in terms of read_to_utf8(), but besides the naming issue, this would require an auxiliary trait implemented in terms of TextReader, so as to prevent any implementation from just implementing that method directly. Which is to say, this helper method is a bad idea.

Yes, if there is a &mut [u8] parameter to write to, "number of bytes to emit" is redundant with the length of the slice.

So what are you arguing against, then, when you seem to be arguing against the idea of the count as "number of bytes to read"? In the convenience methods read_strbuf() and read_to_strbuf(), "number of bytes to read" has value but "number of bytes to emit" does not.

SimonSapin · 2014-05-14T00:04:14Z

FWIW I’m not a fan of TextReaderDestination and its complexity compared to just &mut [u8], but I can’t come up with more reasoning than that right now.

So what are you arguing against, then, when you seem to be arguing against the idea of the count as "number of bytes to read"?

I’m arguing to rename it to something less specific, with implementation-defined semantics.

SimonSapin · 2014-05-14T00:05:36Z

fn read_to_utf8<'a, T: TextReaderDestination<'a>>(&mut self, dest: T, max_bytes_read: uint) -> IoResult<(&'a str, uint)>;

uint in the return value is the number of bytes actually read? How is it useful?

lilyball · 2014-05-14T00:34:09Z

FWIW I’m not a fan of TextReaderDestination

It's ugly, but it's the only way to require unsafe to get a &mut [u8], which is important because the caller is relying on the written data to be utf-8.

That said, it's a bit less important if read_to_utf() returns a &str, because then the implementation is responsible for interpreting the &[u8] as a string, which means if it wants to elide the utf-8 check it needs unsafe anyway. If we end up going that way, then maybe we should ditch the trait.

I’m arguing to rename it to something less specific, with implementation-defined semantics.

Which, as I have said above, completely breaks TextReader for a lot of use cases. And there's absolutely no benefit. The only argument you made in favor of this was that you consider defining the semantics of count to be somehow breaking the abstraction of TextReader, but you didn't offer any justification for that and I disagree. If the TextReader does no conversion, then "number of bytes to read" is identical to "number of bytes to emit" (which I assume you think is not breaking the abstraction). If TextReader does do conversion, then by its very nature it's dealing with bytes in some known format, and since it's dealing with bytes, "number of bytes to read" is a valid thing to talk about.

If you want an abstraction that doesn't come down to bytes in the end, then you should be working with char. But of course you're not, because converting between &str and sequences or iterators of char is unnecessarily inefficient.

uint in the return value is the number of bytes actually read? How is it useful?

Because various users that read text need to know how many bytes they just read. Look at read_strbuf() for an example; it loops until it's read precisely as many bytes as it asked for (it's the equivalent of Reader::read_exact()).

Incidentally, when talking about methods that read repeatedly (like read_strbuf(), or whatever the method is that is the equivalent of read_to_end()), you should be aware of my PR #13127 that implements read_at_least() on Reader. The reason being that read() is allowed to return 0 bytes, and so the PR introduces a way to handle Readers that return 0 indefinitely (either because they're broken, or because they're non-blocking and no data has come in).

brson · 2014-06-11T00:21:51Z

Discussed at https://github.com/mozilla/rust/wiki/Meeting-weekly-2014-06-10.

Although higher-level text handling is important, we don't want to merge this at this time. Primarily, we are concerned that we don't understand this domain well enough to avoid making irreversible mistakes (we're already uncertain about the existing I/O). As this sort of API is a pure addition, we would love to see this used successfully outside of the main repo (maybe in Servo).

I'm marking this as postponed.

SimonSapin · 2014-06-12T09:48:47Z

Alright. I agree that this is still too uncertain. I’ll see with @lifthrasiir about doing this in rust-encoding and get some practical use, possibly in Servo.

If we see an empty slot then we're not immediately ready, but rather ready when the underlying stream is ready, so propagate the readiness notification through to that. Closes rust-lang#57

RFC: Text/Unicode oriented streams

cd0df6b

(Text streams RFC) Add TextReader::by_ref

4196769

sfackler reviewed Apr 29, 2014
View reviewed changes

zkamsler reviewed May 13, 2014
View reviewed changes

lilyball reviewed May 13, 2014
View reviewed changes

(Text streams RFC) s/RefTextWriter/RefWriter/

a539852

RefWriter<W> can implement TextWriter when W does, no need for a separate type.

(Text streams RFC) TextReader design is unresolved

ff6283a

brson added the postponed label Jun 11, 2014

brson closed this Jun 11, 2014

alexcrichton mentioned this pull request Jun 17, 2014

Add StringReader, StringWriter, and StringStream rust-lang/rust#10564

Closed

SimonSapin mentioned this pull request Aug 28, 2014

Refactor URL formatting code into its own module. servo/rust-url#11

Merged

netvl mentioned this pull request Sep 2, 2014

Support arbitrary encodings for XML documents netvl/xml-rs#10

Closed

pcwalton added a commit that referenced this pull request Sep 7, 2014

RFC #57: Anonymous impls only in the same module

77ab503

rust-highfive mentioned this pull request Sep 24, 2014

RFC: Text/Unicode oriented streams #272

Open

This was referenced Sep 24, 2014

Implement character encoding detection / conversion servo/html5ever#18

Open

Pick over the standard libraries for types that don't (but can) implement common traits rust-lang/rust#15294

Closed

lifthrasiir mentioned this pull request Dec 19, 2014

rust-encoding support BurntSushi/rust-csv#13

Closed

aturon mentioned this pull request Jan 30, 2015

Amend RFC 517: Revisions to reader/writer, core::io and std::io #576

Merged

SimonSapin mentioned this pull request Oct 27, 2015

Readers? lifthrasiir/rust-encoding#91

Open

Centril added A-input-output Proposals relating to std{in, out, err}. A-string Proposals relating to strings. A-traits-libstd Standard library trait related proposals & ideas labels Nov 26, 2018

RFC: Text/Unicode oriented streams #57

RFC: Text/Unicode oriented streams #57

Conversation

SimonSapin commented Apr 29, 2014

SimonSapin commented Apr 29, 2014

sfackler Apr 29, 2014

Choose a reason for hiding this comment

SimonSapin Apr 29, 2014

Choose a reason for hiding this comment

chris-morgan commented Apr 30, 2014

lifthrasiir commented Apr 30, 2014

SimonSapin commented Apr 30, 2014

SimonSapin commented Apr 30, 2014

netvl commented May 6, 2014

lilyball commented May 13, 2014

SimonSapin commented May 13, 2014

zkamsler May 13, 2014

Choose a reason for hiding this comment

SimonSapin May 13, 2014

Choose a reason for hiding this comment

lilyball commented May 13, 2014

SimonSapin commented May 13, 2014

lilyball commented May 13, 2014

lilyball commented May 13, 2014

SimonSapin commented May 13, 2014

lilyball commented May 13, 2014

lilyball commented May 13, 2014

lilyball May 13, 2014

Choose a reason for hiding this comment

SimonSapin May 13, 2014

Choose a reason for hiding this comment

SimonSapin commented May 13, 2014

lilyball commented May 13, 2014

zkamsler commented May 13, 2014

lilyball commented May 13, 2014

zkamsler commented May 13, 2014

SimonSapin commented May 13, 2014

lilyball commented May 13, 2014

SimonSapin commented May 13, 2014

lilyball commented May 13, 2014

SimonSapin commented May 13, 2014

SimonSapin commented May 13, 2014

lilyball commented May 13, 2014

SimonSapin commented May 14, 2014

SimonSapin commented May 14, 2014

lilyball commented May 14, 2014

brson commented Jun 11, 2014

SimonSapin commented Jun 12, 2014