Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document why the changes made in that fork couldn't be done in encoding directly #3

Closed
nox opened this issue Jan 4, 2017 · 2 comments

Comments

@nox
Copy link

nox commented Jan 4, 2017

I still don't understand why the ecosystem needs two crates, even after reading the design document and whatnot.

@hsivonen
Copy link
Owner

hsivonen commented Jan 5, 2017

Whether the ecosystem needs to crates and whether the changes made in encoding_rs_compat could be made in rust-encoding directly are two distinct questions.

The ecosystem might not need two crates that have the same scope if one is clearly better than the other on all metrics that people care about. At present, rust-encoding and encoding_rs have different scope and neither is clearly better than the other on all metrics.

As far as the scope goes, encoding_rs is narrower in scope than rust-encoding in the sense that encoding_rs is strictly an implementation of the Encoding Standard but rust-encoding implements also non-Encoding Standard encodings and names things in a way that's suggestive of rust-encoding not treating Encoding Standard as being the only thing that exists. (HZ used to be in the Encoding Standard and US-ASCII and ISO-8859-1 have a special trivial algorithmic relationship with Unicode, so their inclusion in rust-encoding might not alone be a definitive indication of broader scope, but see the recent merge of ARMSCII-8, as well as the open TODO list and the open CESU-8 issue.)

In contrast, encoding_rs's CONTRIBUTING.md is very clear about non-Encoding Standard encodings being out of scope.

On the other hand, encoding_rs is broader in scope in the sense that FFI-friendliness and catering to UTF-16 as an in-RAM Unicode representation are in scope.

On one hand, considering the needs of Gecko, an implementation of the Encoding Standard that has FFI-friendliness and catering to UTF-16 is an in-RAM Unicode representation is necessary. On the other hand, while an encoding library for software designed for interacting with the Web (not just browsers but also other pieces of software that wish to interact with the Web in a browser-compatible way) should not expose non-Encoding Standard encodings to a Web context, it would be unreasonable to take the position that the Rust ecosystem should not have crates that cater to non-Web legacy character encoding conversion use cases.

Aside: E.g. an email client likely needs to encode UTF-8 and decode the Encoding Standard encodings plus UTF-7 as well and decode and encode the IMAP-modified UTF-7. Exposing the protocol-internal IMAP-modified UTF-7 via a general interface seems like a bad idea, since it shouldn't be possible to use that encoding outside the protocol. OTOH, you'd want to share code with normal UTF-7 and, therefore, put those two in one crate. So that would be scope creep beyond UTF-7 itself for encoding_rs if encoding_rs would try to support just the seeming one additional encoding to bring email in scope. Furthermore, UTF-7 doesn't fit as nicely into the streaming model of encoding_rs as the Encoding Standard encodings, and an email client may be happy with non-streaming conversion anyway. So a separate crate that implements non-streaming UTF-7 and delegates the Encoding Standard encodings (non-streaming mode only) to encoding_rs and has a separate but back end sharing (with UTF-7) entry point for IMAP modified UTF-7 would likely be a happier solution for everyone involved than trying to grow the scope of one crate.

The question whether the changes made in encoding_rs_compat could be made in rust-encoding directly still makes sense to ask for the overlap of the scopes of encoding_rs and rust-encoding. Ultimately, answering that question will be up to @lifthrasiir. However, I would prefer @lifthrasiir not to evaluate encoding_rs in the light of that question at this time, because encoding_rs is still immature.

Specifically:

  • encoding_rs is completely broken on big-endian platforms. (I intend to fix this when I gain access to a working big-endian Rust environment. Currently, this is blocked by linker problems on ppc64 on one hand and on MIPS builds not being properly staged for download on the other hand as well emulated firmware for ppc32 being too hard to set up in the context of qemu as shipped on Ubuntu.)

  • If you look at the release notes for the last two point releases, encoding_rs has had very serious bugs even post-0.3.0:

    • There was a pointer-alignment bug that got into release because I was developing on x86_64, which is lenient about load/store alignment.

    • There was a bug that made UTF-8 to UTF-16 conversion completely bogus for an entire class of byte sequences.

  • The legacy encoders in encoding_rs are slow by design, but the CJK encoders are extremely slow and slower than I expected. However, I have ideas for making the Simplified Chinese, Korean and Japanese data tables smaller in a way that I expect to make encode faster as a side effect. I want to get that done first before any serious evaluation of the encoder performance or whether the encoding_rs's approach to data size vs. encoder performance should be changed.

  • FFI being part of the encoding_rs crate itself prevents LTO from discarding the UTF-16-related code in the context of pure Rust usage. It seems to me that splitting FFI out into a separate crate would make encoding_rs better for pure-Rust scenarios.

To your concern regarding working with rust-encoding, it is worth noting, that I have offered to rust-encoding a couple of small fixes that arose as a side effect of developing encoding_rs_compat. Also, just now, I created a PR to change to the ByteWriter and StringWriter traits to optimize, in a non-breaking way, the common case (Vec<u8> and String) in the context of an encoding_rs-style back end.

Upstreaming this change would allow future use of the upstream encoding-types crate, once published. Still, the rust-encoding types becoming a distinct crate in the future doesn't address the issue that encoding_rs_compat exists in order to allow Gecko to depend on crates from the Rust ecosystem without shipping duplicate data tables or converter code, because the ecosystem depends on rust-encoding 0.2.x, so 0.2.x is what encoding_rs_compat will need to be a drop-in replacement for in Gecko in the foreseeable near future. For the time being encoding_rs_compat includes its own (patched with the same changes I've submitted upstream PRs for) copy of rust-encoding types.

Anyway, encoding_rs maturity concerns aside, I'd be quite OK with code from encoding_rs_compat being adopted by the upstream in order to delegate to encoding_rs where the scopes overlap and encoding_rs internals are deemed better on relevant metrics, but, as noted, that's really @lifthrasiir's call, and one I'd prefer not to be made based on the current immature state of encoding_rs.

(Considering this comment to be the requested documentation.)

@hsivonen hsivonen closed this as completed Jan 5, 2017
@nox
Copy link
Author

nox commented Jan 5, 2017

Very nice! Thanks for the wall of text!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants