Relaxed parsing mode #483

vorner · 2019-02-11T11:52:27Z

Hello

I understand the library tries to follow the standard as close as possible.

However, there are URLs out there in the wild that exist, work and are rejected by this library as invalid. As an example, http://canada-region-70-24-.static-apple-com.center/ (rejected because of the trailing -) ‒ if you point a browser there, you'll get a content (not a very useful one, granted).

Would it be possible to have some relaxed parsing mode (if there is, I haven't found it in the documentation) where I would still get the methods to get the host and path and query out of the URL and canonicalize it, while allowing for certain violations against how a good URL looks like? I have little choice over the URLs that I need to handle ‒ basically, whatever happens in the wild might come my way and I have to do something meaningful with it.

The text was updated successfully, but these errors were encountered:

nox · 2019-02-11T11:58:53Z

The spec for parsing hosts says:

Let asciiDomain be the result of running domain to ASCII on domain.

The spec for the domain to ASCII algorithm says:

Let result be the result of running Unicode ToASCII with domain_name set to domain, UseSTD3ASCIIRules set to beStrict, CheckHyphens set to false, CheckBidi set to true, CheckJoiners set to true, Transitional_Processing set to false, and VerifyDnsLength set to beStrict.

The spec for Unicode ToASCII refers to the validity criteria section which says:

If CheckHyphens, the label must neither begin nor end with a U+002D HYPHEN-MINUS character.

But as mentioned earlier, the Unicode ToASCII algorithm is to be invoked with CheckHyphens set to false.

Seems like a rust-url bug to me.

nox · 2019-02-11T12:01:07Z

rust-url/idna/src/uts46.rs

Lines 253 to 262 in a1d8c88

    
           // NOTE: Spec says that the label must not contain a HYPHEN-MINUS character in both the 
        
           // third and fourth positions. But nobody follows this criteria. See the spec issue below: 
        
           // https://github.com/whatwg/url/issues/53 
        
           // 
        
           // TODO: Add *CheckHyphens* flag. 
        
           // V3: neither begin nor end with a U+002D HYPHEN-MINUS 
        
           else if label.starts_with("-") || label.ends_with("-") { 
        
               errors.push(Error::ValidityCriteria); 
        
           }

nox · 2019-02-11T12:05:11Z

Ouch. idna::uts46::Flags is a public struct with public fields, so adding one for the CheckHyphens flag is a breaking change for the idna crate, and idna::uts46::Errors is exposed in the url crate through the From<idna::uts46::Errors> for url::ParseError, meaning the breaking change should be propagated as a breaking bump in url too. 😕

nox · 2019-02-11T12:08:12Z

I suggest we add a type idna::uts46::Config with builder methods similar to the public fields of idna::uts46::Flags and two methods to_ascii and to_unicode.

We can then reimplement idna::uts46::to_ascii (and similarly do the same thing for to_unicode) as:

pub fn to_ascii(domain: &str, flags: Flags) -> Result<String, Errors> {
    Config::from(flags).to_ascii(domain)
}

SimonSapin · 2019-02-11T12:28:54Z

It is increasingly time for url 2.0: #463

)

mixalturek · 2019-02-12T11:31:11Z

Hi, the changes fixed the reported issue, all our unit tests are passing. Thank you very much for your extremely quick feedback! Do you have any ETA for release of the crate?

[patch.crates-io]
url = { git = "https://github.com/servo/rust-url.git", branch="hyphens" }

SimonSapin · 2019-02-14T18:09:25Z

In general, I think we probably shouldn’t have parsing modes. If we ever do, they should be precisely specified. Just calling it “relaxed” doesn’t say what exactly is accepted or not.

This library is intended to be used (among others) in browser a implementation, so if its behavior differs from interoperable behavior in other browsers, that’s a bug either in this implementation on in the specification https://url.spec.whatwg.org/.

In this case, it looks like the specification has changed and we hadn’t been keeping up. #484 fixes this.

nox added a commit that referenced this issue Feb 11, 2019

Bump idna to 0.1.6 (fixes #483)

cdc286a

nox added a commit that referenced this issue Feb 11, 2019

Don't check hyphens in domain_to_ascii and domain_to_unicode (fixes #483

fb3b957

)

nox mentioned this issue Feb 14, 2019

Implement CheckHyphens #484

Merged

mixalturek mentioned this issue Mar 1, 2019

URL crate is failing to parse these existing URLs #489

Open

bors-servo closed this as completed in #484 Jul 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relaxed parsing mode #483

Relaxed parsing mode #483

vorner commented Feb 11, 2019

nox commented Feb 11, 2019

nox commented Feb 11, 2019

nox commented Feb 11, 2019

nox commented Feb 11, 2019

SimonSapin commented Feb 11, 2019

mixalturek commented Feb 12, 2019

SimonSapin commented Feb 14, 2019 •

edited

Loading

Relaxed parsing mode #483

Relaxed parsing mode #483

Comments

vorner commented Feb 11, 2019

nox commented Feb 11, 2019

nox commented Feb 11, 2019

nox commented Feb 11, 2019

nox commented Feb 11, 2019

SimonSapin commented Feb 11, 2019

mixalturek commented Feb 12, 2019

SimonSapin commented Feb 14, 2019 • edited Loading

SimonSapin commented Feb 14, 2019 •

edited

Loading