Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose an UTF-8 checking function that returns the index of the error #12168

Closed
ghost opened this issue Feb 10, 2014 · 12 comments
Closed

Expose an UTF-8 checking function that returns the index of the error #12168

ghost opened this issue Feb 10, 2014 · 12 comments

Comments

@ghost
Copy link

ghost commented Feb 10, 2014

Use case:

  • reporting the position of the error to the user

Possible candidates:

  • mark first_non_utf8_index() as public
  • change from_utf8{_owned}'s return type from Option to Result<_, uint>

Since it's possible to recover from such an error, it would also be nice if these returned all encoding errors.

@lilyball
Copy link
Contributor

If we have this, we'd probably also want to return the number of bytes corresponding to the invalid codepoint (e.g. the number of bytes starting from that index that should be replaced by \uFFFD, and is by from_utf8_lossy()). This would require some wrapper around first_non_utf8_index() though, as it doesn't bother trying to determine that information in the name of making is_utf8() as fast as it can.

@lifthrasiir
Copy link
Contributor

For from_utf8, it might make sense to return Result<&'a str, &'a str> where the Err variant returns a slice to the prefix of given [u8] which is a valid UTF-8.

@huonw
Copy link
Member

huonw commented Feb 11, 2014

from_utf8_owned needs to have the ~[u8] in the error return too, since it's consumed by the call.

@pnkfelix
Copy link
Member

Related bug: #12113

@pnkfelix
Copy link
Member

@lifthrasiir I like your idea, except that I suspect the Err variant will probably need to be a struct with named fields, so that it can carry both a valid_prefix field along with other contextual data. If nothing else, using a struct with meaningful field names will make one's code more self-documenting, IMO.

@lifthrasiir
Copy link
Contributor

@pnkfelix Agreed. I'm not sure how to handle the same error case for from_utf8_owned though (should we have a same struct for both from_utf8 and from_utf8_owned then? then what type of that struct should be?).

@ghost
Copy link
Author

ghost commented Feb 11, 2014

I'm thinking of something like

pub struct Utf8Error {
    // if there's an error, produce a sanitized output with replacement chars
    recovered: ~str, 

    // the byte position and exact sequence of each error
    errors: ~[(uint, ~[u8])], 
}

pub fn from_utf8_owned_explain(vv: ~[u8]) -> Result<~str, Utf8Error> { ... }

pub fn from_utf8_explain<'a>(v: &'a [u8]) -> Result<&'a str, Utf8Error> { ... }

where the _explain suffix signals that we care more about error reporting than speed. The originals could be left as they are.

@lilyball
Copy link
Contributor

I think this is far too complex. Why does from_utf8_owned need to carry all this baggage?

If you need to be able to recover from invalid utf-8 and either know the valid utf8 prefix, or keep the ~[u8], then you should be using something like the original proposed API that returns the index of the first invalid character. With that API you can produce the valid prefix and the remainder yourself. But most uses of from_utf8_owned don't seem to have any need for this extra baggage, and thus shouldn't be complicated by a crazy return value, especially not one that allocates. from_utf8_owned consumes the ~[u8] and does not allocate, and this error changes that.

@lifthrasiir
Copy link
Contributor

I agree to @kballard: if more control on the encoding error is required, you need to use a separate library for that. I thought returning a valid prefix does not cause overhead (since it is either a slice or set_len call) while it's useful for salvaging a partial information (e.g. caf[encoding error, remaining line omitted]) for stdlib, hence the suggestion.

@alexcrichton
Copy link
Member

cc @aturon, we were discussing this in the recent string stabilization meeting

@steveklabnik
Copy link
Member

I'm pulling a massive triage effort to get us ready for 1.0. As part of this, I'm moving stuff that's wishlist-like to the RFCs repo, as that's where major new things should get discussed/prioritized.

This issue has been moved to the RFCs repo: rust-lang/rfcs#860

@aturon
Copy link
Member

aturon commented Feb 16, 2015

Note: this has actually been resolved.

bors added a commit to rust-lang-ci/rust that referenced this issue Jul 25, 2022
internal: Remove unqualified_path completions module
flip1995 pushed a commit to flip1995/rust that referenced this issue Jan 25, 2024
[`default_numeric_fallback`]: improve const context detection

Fixes rust-lang#12159

The lint didn't actually recognize any of the associated consts (in the linked issue), because in those cases the parent is an `ImplItem` and not an `Item`, but it only actually emitted a lint for i32 and f64 because the other cases failed the very last check here
https://github.com/rust-lang/rust-clippy/blob/bb2d4973648b1af18d7ba6a3028ed7c92fde07fb/clippy_lints/src/default_numeric_fallback.rs#L91-L96

A better check for detecting constness would be using `body_const_context`, which is what this PR does.

changelog: [`default_numeric_fallback`]: recognize associated consts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants