Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amend to RFC 517: add subsection on string handling #575

Merged
merged 2 commits into from
Jan 23, 2015
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
275 changes: 272 additions & 3 deletions text/0517-io-os-reform.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,10 @@ follow-up PRs against this RFC.
* [Platform-specific opt-in]
* [Proposed organization]
* [Revising `Reader` and `Writer`] (stub)
* [String handling] (stub)
* [String handling]
* [Key observations]
* [The design: `os_str`]
* [The future]
* [Deadlines] (stub)
* [Splitting streams and cancellation] (stub)
* [Modules]
Expand Down Expand Up @@ -452,7 +455,224 @@ counts, arguments to `main`, and so on).
## String handling
[String handling]: #string-handling

> To be added in a follow-up PR.
The fundamental problem with Rust's full embrace of UTF-8 strings is that not
all strings taken or returned by system APIs are Unicode, let alone UTF-8
encoded.

In the past, `std` has assumed that all strings are *either* in some form of
Unicode (Windows), *or* are simply `u8` sequences (Unix). Unfortunately, this is
wrong, and the situation is more subtle:

* Unix platforms do indeed work with arbitrary `u8` sequences (without interior
nulls) and today's platforms usually interpret them as UTF-8 when displayed.

* Windows, however, works with *arbitrary `u16` sequences* that are roughly
interpreted at UTF-16, but may not actually be valid UTF-16 -- an "encoding"
often called UCS-2; see http://justsolve.archiveteam.org/wiki/UCS-2 for a bit
more detail.

What this means is that all of Rust's platforms go beyond Unicode, but they do
so in different and incompatible ways.

The current solution of providing both `str` and `[u8]` versions of
APIs is therefore problematic for multiple reasons. For one, **the
`[u8]` versions are not actually cross-platform** -- even today, they
panic on Windows when given non-UTF-8 data, a platform-specific
behavior. But they are also incomplete, because on Windows you should
be able to work directly with UCS-2 data.

### Key observations
[Key observations]: #key-observations

Fortunately, there is a solution that fits well with Rust's UTF-8 strings *and*
offers the possibility of platform-specific APIs.

**Observation 1**: it is possible to re-encode UCS-2 data in a way that is also
compatible with UTF-8. This is the
[WTF-8 encoding format](http://simonsapin.github.io/wtf-8/) proposed by Simon
Sapin. This encoding has some remarkable properties:

* Valid UTF-8 data is valid WTF-8 data. When decoded to UCS-2, the result is
exactly what would be produced by going straight from UTF-8 to UTF-16. In
other words, making up some methods:

```rust
my_ut8_data.to_wtf8().to_ucs2().as_u16_slice() == my_utf8_data.to_utf16().as_u16_slice()
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least this needs a rewording. UCS-2 inherently can not represent certain USVs, while UTF-8 can represent arbitrary USVs. This means that a real to_ucs2() implementation can fail. This paragraph is clearly assuming that to_ucs2() is implemented as to_utf16() (i.e. creates surrogate pairs). This makes it quite unsurprising that the result after chaining it with a nop (to_wtf_8()) is the same.

This line also has quite a few typos, it should probably be:

 my_utf8_data.to_wtf8().to_ucs2().as_u16_slice() == my_utf8_data.to_utf16().as_u16_slice()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WTF-8 treats valid surrogate pairs in UCS-2 as if they were UTF-16, while invalid surrogates are encoded as themselves in WTF-8. This allows for the UCS-2 to represent all of Unicode and for WTF-8 to represent any possible sequence of WCHAR.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the notions like surrogate pairs and supplementary planes didn't exist until Unicode 2.0 defined UTF-16 to obsolete UCS-2 (correct me if I'm wrong,) it's better to avoid using the term "UCS-2" here...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What @aturon calls “UCS-2” here is arbitrary [u16] that is interpreted as UTF-16 when surrogates happen to be paired. http://justsolve.archiveteam.org/wiki/UCS-2 (linked from the RFC) has more background and details.

Potentially ill-formed UTF-16” is IMO the most accurate term for it, but it’s a bit unwieldy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I agree avoiding “UCS-2” entirely would be better, since it means different things to different people. I’m +0 on “wide string”, proposed below, if it’s properly defined.


* Valid UTF-16 data re-encoded as WTF-8 produces the corresponding UTF-8 data:

```rust
my_utf16_data.to_wtf8().as_bytes() == my_utf16_data.to_utf8().as_bytes()
```

These two properties mean that, when working with Unicode data, the WTF-8
encoding is highly compatible with both UTF-8 *and* UTF-16. In particular, the
conversion from a Rust string to a WTF-8 string is a no-op, and the conversion
in the other direction is just a validation.

**Observation 2**: all platforms can *consume* Unicode data (suitably
re-encoded), and it's also possible to validate the data they produce as
Unicode and extract it.

**Observation 3**: the non-Unicode spaces on various platforms are deeply
incompatible: there is no standard way to port non-Unicode data from one to
another. Therefore, the only cross-platform APIs are those that work entirely
with Unicode.

### The design: `os_str`
[The design: `os_str`]: #the-design-os_str

The observations above lead to a somewhat radical new treatment of strings,
first proposed in the
[Path Reform RFC](https://github.com/rust-lang/rfcs/pull/474). This RFC proposes
to introduce new string and string slice types that (opaquely) represent
*platform-sensitive strings*, housed in the `std::os_str` module.

The `OsString` type is analogous to `String`, and `OsStr` is analogous to `str`.
Their backing implementation is platform-dependent, but they offer a
cross-platform API:

```rust
pub mod os_str {
/// Owned OS strings
struct OsString {
inner: imp::Buf
}
/// Slices into OS strings
struct OsStr {
inner: imp::Slice
}

// Platform-specific implementation details:
#[cfg(unix)]
mod imp {
type Buf = Vec<u8>;
type Slice = [u8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing ].

...
}

#[cfg(windows)]
mod imp {
type Buf = Wtf8Buf; // See https://github.com/SimonSapin/rust-wtf8
type Slice = Wtf8;
...
}

impl OsString {
pub fn from_string(String) -> OsString;
pub fn from_str(&str) -> OsString;
pub fn as_slice(&self) -> &OsStr;
pub fn into_string(Self) -> Result<String, OsString>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as OsStr::as_str, shouldn't be made available.

pub fn into_string_lossy(Self) -> String;

// and ultimately other functionality typically found on vectors,
// but CRUCIALLY NOT as_bytes
}

impl Deref<OsStr> for OsString { ... }

impl OsStr {
pub fn from_str(value: &str) -> &OsStr;
pub fn as_str(&self) -> Option<&str>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method should not be made available. All legitimate uses of this function are converted by to_string_lossy and it encourages failing on non-UTF8 strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this. It is legitimate for a function to fail or select a fallback behavior if it has good reason to expect Unicode data from a system call, and that's not what it gets. (It may not be a good idea to panic in most cases, but returning an error or special value is legitimate.)

In fact, for any situation that emphasizes correctness over robustness, I would have the opposite worry. Specifically, that to_string_lossy will end up being used when non-Unicode data should be rejected entirely, or when non-Unicode data is actually expected and needs to be handled losslessly. Admittedly, in the latter case, the user should deal with the platform-specific u8/u16 representation (or their own custom type for the expected encoding) instead of converting to str.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quantheory I can't think of an example of an application that would need this to_string instead of the to_string_lossy.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm under the impression that this would lock us into using WTF-8 on Windows. Is that intentional?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Florob Why do you think so?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quantheory My feeling about your use case is that you need to drop down to platform-specific calls anyway if you want to support filenames as you suggest ("spelling out the bytes"), because e.g. Windows paths cannot be represented as bytes in a meaningful way.

Otherwise your program would just drop out on handling non-Unicode paths which would be very unfortunate IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not talking about filenames. I'm talking about information that's going to end up in Rust code, XML documents, and possibly other Unicode file formats that are not always for human consumption. (Purely for debugging purposes you may still want them to be as readable as reasonably possible.) At the moment (though as likely as not this will change) the sending/calling process knows ahead of time whether or not or not a sequence of bytes is guaranteed to be valid Unicode, and is responsible for doing any necessary translation to ensure that the receiving process always gets Unicode, as long as it gets the message at all.

But this is really getting more specific than it needs to be. My real point is that processes send information to co-designed programs or other instances of themselves in ways that incidentally pass through the OS. You can receive a string from a system call that you know actually originates within (another process of) your own application, or a specific other application that uses a known encoding. If what you get is somehow malformed anyway, that's not a normal situation, and the receiving process has no way to continue normally without knowing what went wrong on the other end.

(Also, unless the program just uses unwrap/expect on everything it gets, a None value doesn't necessarily mean that it will "just drop out" if it doesn't get Unicode. We're not looking at a function that just panics here.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quantheory Note that this OsString is not used for file contents or other stuff, but just for file names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbu-

I'm not talking about file contents, I'm talking about, for instance, the proposed std::env and std::process, as used to communicate short strings between processes. These will apparently use OsStr/OsString.

File names are Path/PathBuf, not OsStr/OsString. (They will have the same internal representation, I think, but they have different interfaces. Path will also implement as_str, though, according to RFC 474.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe this should be called to_string_strict instead of to_string?

pub fn to_string_lossy(&self) -> CowString;

// and ultimately other functionality typically found on slices,
// but CRUCIALLY NOT as_bytes
}

trait IntoOsString {
fn into_os_str_buf(self) -> OsString;
}

impl IntoOsString for OsString { ... }
impl<'a> IntoOsString for &'a OsStr { ... }

...
}
```

These APIs make OS strings appear roughly as opaque vectors (you
cannot see the byte representation directly), and can always be
produced starting from Unicode data. They make it possible to collapse
functions like `getenv` and `getenv_as_bytes` into a single function
that produces an OS string, allowing the client to decide how (or
whether) to extract Unicode data. It will be possible to do things
like concatenate OS strings without ever going through Unicode.

It will also likely be possible to do things like search for Unicode
substrings. The exact details of the API are left open and are likely
to grow over time.

In addition to APIs like the above, there will also be
platform-specific ways of viewing or constructing OS strings that
reveals more about the space of possible values:

```rust
pub mod os {
#[cfg(unix)]
pub mod unix {
trait OsStringExt {
fn from_vec(Vec<u8>) -> Self;
fn into_vec(Self) -> Vec<u8>;
}

impl OsStringExt for os_str::OsString { ... }

trait OsStrExt {
fn as_byte_slice(&self) -> &[u8];
fn from_byte_slice(&[u8]) -> &Self;
}

impl OsStrExt for os_str::OsStr { ... }

...
}

#[cfg(windows)]
pub mod windows{
// The following extension traits provide a UCS-2 view of OS strings

trait OsStringExt {
fn from_wide_slice(&[u16]) -> Self;
}

impl OsStringExt for os_str::OsString { ... }

trait OsStrExt {
fn to_wide_vec(&self) -> Vec<u16>;
}

impl OsStrExt for os_str::OsStr { ... }

...
}

...
}
```

By placing these APIs under `os`, using them requires a clear *opt in*
to platform-specific functionality.

### The future
[The future]: #the-future

Introducing an additional string type is a bit daunting, since many
existing APIs take and consume only standard Rust strings. Today's
solution demands that strings coming from the OS be assumed or turned
into Unicode, and the proposed API continues to allow that (with more
explicit and finer-grained control).

In the long run, however, robust applications are likely to work
opaquely with OS strings far beyond the boundary to the system to
avoid data loss and ensure maximal compatibility. If this situation
becomes common, it should be possible to introduce an abstraction over
various string types and generalize most functions that work with
`String`/`str` to instead work generically. This RFC does *not*
propose taking any such steps now -- but it's important that we *can*
do so later if Rust's standard strings turn out to not be sufficient
and OS strings become commonplace.

## Deadlines
[Deadlines]: #deadlines
Expand Down Expand Up @@ -547,4 +767,53 @@ principles or visions) are outside the scope of this RFC.
# Unresolved questions
[Unresolved questions]: #unresolved-questions

> To be expanded in a follow-up PR.
> To be expanded in follow-up PRs.

## Wide string representation

(Text from @SimonSapin)

Rather than WTF-8, `OsStr` and `OsString` on Windows could use
potentially-ill-formed UTF-16 (a.k.a. "wide" strings), with a
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is called «potentially-ill-formed UTF-16 (a.k.a. "wide" strings)» here is referred to as «UCS-2» everywhere else in this document. From a correctness standpoint I'd prefer if the term «wide strings» was introduced upfront and consistently used throughout.

different cost trade off.

Upside:
* No conversion between `OsStr` / `OsString` and OS calls.

Downsides:
* More expensive conversions between `OsStr` / `OsString` and `str` / `String`.
* These conversions have inconsistent performance characteristics between platforms. (Need to allocate on Windows, but not on Unix.)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'd argue that the inconsistent performance characteristics exist either way. The difference is whether they exist on OS calls, or on OsStrBuf creation. It seems a to me that it is not unlikely that the same OsStr would be used to call a function multiple times, or to call multiple different functions. To me this strongly suggests putting the cost at creation time, not call time.

* Some of them return `Cow`, which has some ergonomic hit.

The API (only parts that differ) could look like:

```rust
pub mod os_str {
#[cfg(windows)]
mod imp {
type Buf = Vec<u16>;
type Slice = [u16];
...
}

impl OsStr {
pub fn from_str(&str) -> Cow<OsString, OsStr>;
pub fn to_string(&self) -> Option<CowString>;
pub fn to_string_lossy(&self) -> CowString;
}

#[cfg(windows)]
pub mod windows{
trait OsStringExt {
fn from_wide_slice(&[u16]) -> Self;
fn from_wide_vec(Vec<u16>) -> Self;
fn into_wide_vec(self) -> Vec<u16>;
}

trait OsStrExt {
fn from_wide_slice(&[u16]) -> Self;
fn as_wide_slice(&self) -> &[u16];
}
}
}
```