Expose an UTF-8 checking function that returns the index of the error #12168

ghost · 2014-02-10T22:43:48Z

Use case:

reporting the position of the error to the user

Possible candidates:

mark first_non_utf8_index() as public
change from_utf8{_owned}'s return type from Option to Result<_, uint>

Since it's possible to recover from such an error, it would also be nice if these returned all encoding errors.

The text was updated successfully, but these errors were encountered:

lilyball · 2014-02-11T07:20:44Z

If we have this, we'd probably also want to return the number of bytes corresponding to the invalid codepoint (e.g. the number of bytes starting from that index that should be replaced by \uFFFD, and is by from_utf8_lossy()). This would require some wrapper around first_non_utf8_index() though, as it doesn't bother trying to determine that information in the name of making is_utf8() as fast as it can.

lifthrasiir · 2014-02-11T08:08:37Z

For from_utf8, it might make sense to return Result<&'a str, &'a str> where the Err variant returns a slice to the prefix of given [u8] which is a valid UTF-8.

huonw · 2014-02-11T12:20:44Z

from_utf8_owned needs to have the ~[u8] in the error return too, since it's consumed by the call.

pnkfelix · 2014-02-11T13:17:01Z

Related bug: #12113

pnkfelix · 2014-02-11T13:21:41Z

@lifthrasiir I like your idea, except that I suspect the Err variant will probably need to be a struct with named fields, so that it can carry both a valid_prefix field along with other contextual data. If nothing else, using a struct with meaningful field names will make one's code more self-documenting, IMO.

lifthrasiir · 2014-02-11T14:22:11Z

@pnkfelix Agreed. I'm not sure how to handle the same error case for from_utf8_owned though (should we have a same struct for both from_utf8 and from_utf8_owned then? then what type of that struct should be?).

ghost · 2014-02-11T15:06:48Z

I'm thinking of something like

pub struct Utf8Error {
    // if there's an error, produce a sanitized output with replacement chars
    recovered: ~str, 

    // the byte position and exact sequence of each error
    errors: ~[(uint, ~[u8])], 
}

pub fn from_utf8_owned_explain(vv: ~[u8]) -> Result<~str, Utf8Error> { ... }

pub fn from_utf8_explain<'a>(v: &'a [u8]) -> Result<&'a str, Utf8Error> { ... }

where the _explain suffix signals that we care more about error reporting than speed. The originals could be left as they are.

lilyball · 2014-02-11T18:14:18Z

I think this is far too complex. Why does from_utf8_owned need to carry all this baggage?

If you need to be able to recover from invalid utf-8 and either know the valid utf8 prefix, or keep the ~[u8], then you should be using something like the original proposed API that returns the index of the first invalid character. With that API you can produce the valid prefix and the remainder yourself. But most uses of from_utf8_owned don't seem to have any need for this extra baggage, and thus shouldn't be complicated by a crazy return value, especially not one that allocates. from_utf8_owned consumes the ~[u8] and does not allocate, and this error changes that.

lifthrasiir · 2014-02-12T06:47:54Z

I agree to @kballard: if more control on the encoding error is required, you need to use a separate library for that. I thought returning a valid prefix does not cause overhead (since it is either a slice or set_len call) while it's useful for salvaging a partial information (e.g. caf[encoding error, remaining line omitted]) for stdlib, hence the suggestion.

alexcrichton · 2014-09-29T17:18:55Z

cc @aturon, we were discussing this in the recent string stabilization meeting

steveklabnik · 2015-02-15T18:48:31Z

I'm pulling a massive triage effort to get us ready for 1.0. As part of this, I'm moving stuff that's wishlist-like to the RFCs repo, as that's where major new things should get discussed/prioritized.

This issue has been moved to the RFCs repo: rust-lang/rfcs#860

aturon · 2015-02-16T18:27:55Z

Note: this has actually been resolved.

internal: Remove unqualified_path completions module

[`default_numeric_fallback`]: improve const context detection Fixes rust-lang#12159 The lint didn't actually recognize any of the associated consts (in the linked issue), because in those cases the parent is an `ImplItem` and not an `Item`, but it only actually emitted a lint for i32 and f64 because the other cases failed the very last check here https://github.com/rust-lang/rust-clippy/blob/bb2d4973648b1af18d7ba6a3028ed7c92fde07fb/clippy_lints/src/default_numeric_fallback.rs#L91-L96 A better check for detecting constness would be using `body_const_context`, which is what this PR does. changelog: [`default_numeric_fallback`]: recognize associated consts

alexcrichton added the A-libs label Sep 29, 2014

steveklabnik mentioned this issue Feb 15, 2015

Expose an UTF-8 checking function that returns the index of the error rust-lang/rfcs#860

Closed

steveklabnik closed this as completed Feb 15, 2015

bors added a commit to rust-lang-ci/rust that referenced this issue Jul 25, 2022

Auto merge of rust-lang#12168 - Veykril:completion-rev, r=Veykril

dd3f5e0

internal: Remove unqualified_path completions module

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose an UTF-8 checking function that returns the index of the error #12168

Expose an UTF-8 checking function that returns the index of the error #12168

ghost commented Feb 10, 2014

lilyball commented Feb 11, 2014

lifthrasiir commented Feb 11, 2014

huonw commented Feb 11, 2014

pnkfelix commented Feb 11, 2014

pnkfelix commented Feb 11, 2014

lifthrasiir commented Feb 11, 2014

ghost commented Feb 11, 2014

lilyball commented Feb 11, 2014

lifthrasiir commented Feb 12, 2014

alexcrichton commented Sep 29, 2014

steveklabnik commented Feb 15, 2015

aturon commented Feb 16, 2015

Expose an UTF-8 checking function that returns the index of the error #12168

Expose an UTF-8 checking function that returns the index of the error #12168

Comments

ghost commented Feb 10, 2014

lilyball commented Feb 11, 2014

lifthrasiir commented Feb 11, 2014

huonw commented Feb 11, 2014

pnkfelix commented Feb 11, 2014

pnkfelix commented Feb 11, 2014

lifthrasiir commented Feb 11, 2014

ghost commented Feb 11, 2014

lilyball commented Feb 11, 2014

lifthrasiir commented Feb 12, 2014

alexcrichton commented Sep 29, 2014

steveklabnik commented Feb 15, 2015

aturon commented Feb 16, 2015