Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rust-encoding support #13

Closed
BurntSushi opened this issue Dec 18, 2014 · 7 comments
Closed

rust-encoding support #13

BurntSushi opened this issue Dec 18, 2014 · 7 comments

Comments

@BurntSushi
Copy link
Owner

It might be a good idea to think about directly supporting rust-encoding, which is a great library that converts between Unicode strings and raw bytes for a variety of different encodings.

Currently, the CSV parser operates at the byte level, which means that it assumes the source text is ascii compatible. That is, an ASCII byte always corresponds to a character in correspondence with the ASCII character. This is OK for regular ASCII text, UTF-8 and latin-1 I think. But for other encodings, this will fail (and it will probably fail silently by providing an incorrect parse). For this reason, decoding text can't be done after the CSV parser has its way, because it could corrupt the text.

I think the obvious implementation path is to implement std::io::Reader for all types that satisfy encoding::Encoding. Then it can be trivially used with the CSV parser because the raw bytes will be UTF-8 encoded. Come to think of it, @lifthrasiir, does an impl for Reader (and Writer) sound like something that belongs in encoding proper? One possible downside of this approach is that the caller will pay for checking that the string is Unicode when they call records.

An alternative is to demand that users run their CSV data through iconv or something similar.

@lifthrasiir
Copy link

Text-oriented stream is indeed a big concern of rust-encoding (lifthrasiir/rust-encoding#20, rust-lang/rfcs#57). I'm still figuring out how to do that (for months :S), but I basically agree to your points.

@ehiggs
Copy link
Contributor

ehiggs commented Jan 12, 2015

With the changes to rustc-encodable in the 1.0 Alpha, does this make these changes more pressing?

@BurntSushi
Copy link
Owner Author

@ehiggs I don't understand what you mean? What is rustc-encodable? What has changed that makes the handling of character encodings different?

@ehiggs
Copy link
Contributor

ehiggs commented Jan 13, 2015

I had some code that I was struggling with. I asked on IRC and was told the library might be not up to date with regard to rustc-encodable:

16:07 < > so, Rust's decodable stuff has changed
16:07 < > and i think that's your issue
16:08 < > i think the README might be wrong, and it's leading you a bit astray
16:08 < > basically, you need to depend on rustc-encodable
16:08 < > and use deriving RustcEncodable instead

To be honest I didn't understand the situation myself but was aware that there were changes in Rust 1.0 alpha that affected the encoding. I have since fixed my code so it's working.

Sorry for the confusion!

@BurntSushi
Copy link
Owner Author

@ehiggs It's not rustc-encodable :-) It's rustc-serialize. This particular issue is entirely unrelated, as it is for character encodings, not the serialization infrastructure.

The README is right except for one small nit: d12983f --- Otherwise, this crate is updated to work with rustc-serialize.

@steveklabnik
Copy link

I was the mystery IRC person here.

@BurntSushi
Copy link
Owner Author

I think it's probably beyond the scope of this crate to explicitly support encodings other than ASCII compatible encodings. It's more likely that, as an ecosystem, we'll want generic adapters that allow us to plug in transcoders anywhere Read or Write impls are expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants