Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refusing a mix of numeric-only and BIDI domains #543

Open
vorner opened this issue Sep 13, 2020 · 39 comments
Open

Refusing a mix of numeric-only and BIDI domains #543

vorner opened this issue Sep 13, 2020 · 39 comments

Comments

@vorner
Copy link

vorner commented Sep 13, 2020

Hello

Some time ago I was trying to figure out if the domains below were rejected by the Rust url crate, it is tracked here. It seems this is maybe accidentally disallowed by the standard. I was recommended to raise it here.

It's a bit old so I don't remember the exact details and would have to dig them up, I tried to describe it in this comment. I think the issue was the combination of numeric only label and BIDI label.

Now, my question is, should these be valid URLs? They certainly are valid domains, even though it might be discouraged to allow them and the URLs are (were at least when it was reported; I could provide new ones if needed) alive and reachable. Note that they are considered malware URLs, so be careful when handling them.

Parsing failed: invalid international domain name, http://mail.163.com.xn----9mcjf9b4dbm09f.com/iloystgnjfrgthteawvo/indexx.php
Parsing failed: invalid international domain name, http://shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/sitemap.html
Parsing failed: invalid international domain name, http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/bvv
Parsing failed: invalid international domain name, http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/index.php
Parsing failed: invalid international domain name, http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/iloystgnjfrgthteawvo/index.php
@TRowbotham
Copy link
Contributor

These domains are considered invalid because they don't meet the criteria from RFC 5893 Section 2. Specifically, the label "163" fails criteria 1, which requires the first character of a label to have a Bidi property of L, R, or AL. The digits 0-9 have a Bidi property of EN (European Number) 0030..0039 ; EN # Nd [10] DIGIT ZERO..DIGIT NINE according to the DerivedBidiProps.

The domain to ASCII algorithm sets the CheckBidi option to true, which causes the result of Step 2 to return a failure value due to not meeting the above criteria, which is then rejected in Step 3 and ultimately leads to the host parser returning a failure, which then causes the the URL parser to abort.

RFC 4920 Section 1 states:

If any step of the ToASCII operation
fails on any label in a domain name, that domain name MUST NOT be
used as an internationalized domain name.

So, the URL spec is doing the right thing here. The only 2 options for making these domains valid in terms of this spec, as far as I can tell, would be setting the CheckBidi option to false or allowing the options for the Unicode ToASCII steps to be user configurable.

@vorner
Copy link
Author

vorner commented Sep 14, 2020

I agree that the current spec disallows these URLs as invalid.

My question was more in the line of „Was the spec's/author's intention to disallow them, or did the spec got written in a way that it disallows them by accident?“

Several sections in there, pointed out in this comment seem to suggest that allowing such URLs as compatibility with existing deployments was considered.

So, in other words, my question is not „Are they invalid“, but „Should they be invalid“?

@domenic
Copy link
Member

domenic commented Sep 30, 2020

This might be the same underlying issue as #438.

@annevk
Copy link
Member

annevk commented Jan 10, 2023

I think @TRowbotham has the correct analysis here and indeed it very much depends on how CheckBidi is used.

To simplify from OP:

  • xn--mhb is fine in all browsers.
  • 1.xn--mhb errors (though does not error in Gecko, presumably it has CheckBidi set to false; also doesn't error in Chromium, presumably due to an erroneous ASCII fast path).

However, https://www.rfc-editor.org/rfc/rfc5893.html#section-2 (which UTS46 invokes) also says:

In a domain name consisting of only LDH labels (as defined in the
Definitions document [RFC5890]) and labels that satisfy the rule,
the requirements of Section 3 are satisfied as long as a label
that starts with an ASCII digit does not come after a
right-to-left label.

But that seems contradictory as a label that starts with an ASCII digit can never fulfill The Bidi Rule due to ASCII digits not having the correct Bidi property (they have EN according to https://unicode.org/reports/tr9/):

The first character must be a character with Bidi property L, R,
or AL. If it has the R or AL property, it is an RTL label; if it
has the L property, it is an LTR label.

I'm not sure what to make of this.

I would appreciate input from @achristensen07 @valenting @markusicu @macchiati @alvestrand. I would be somewhat inclined to set CheckBidi to false given that it matches most implementations, is more likely to match deployed content, and the bidi requirements appear contradictory, but I'm open to suggestions.

@annevk
Copy link
Member

annevk commented Jan 10, 2023

This might also be more complicated still as, e.g., يa is rejected by all browsers. Which I think is due to mixing L and R character properties and would not be rejected without CheckBidi being true?

And then 1.ي is only rejected in Chromium and WebKit. So Chromium can again be explained through an erroneous ASCII fast path. And it seems that Gecko has a different CheckBidi behavior when it comes to ASCII digits at least, perhaps due to the above contradiction. Or perhaps there is another check related to character properties unrelated to CheckBidi.

(All of this is only concerned with the ToASCII code path, for what it's worth.)

@alvestrand
Copy link

alvestrand commented Jan 10, 2023

This logic took quite a while to work out, including actually coding up the BIDI rule and running it through all the possible combinations of directions to make sure I had them all covered.....

The "bidi rule" in RFC 5893 section 2 applies to a single label. So a label (not a domain name) can either obey the rule or not.

The guarantees in the last two paragraphs are about the properties of a whole domain name. They are not part of the rule.

The practical consequence is that if you want sanity in your display, you can never have <RTL-label>.3com.com - because that would probably display as 3.<RTL-label>com.com, which is confusing.

So 1.ي should not be rejected, but 1.ي.3com should be. (inspect the order of the characters in that one!)

@annevk
Copy link
Member

annevk commented Jan 10, 2023

@alvestrand so do I understand it correctly that IDNA2008 doesn't take a stance as to whether all labels in a domain need to obey The Bidi Rule?

https://www.unicode.org/reports/tr46/#Validity_Criteria does which might explain the difference.

https://www.rfc-editor.org/rfc/rfc5891.html#section-4.2.3.4 seems to only enforce The Bidi Rule upon individual labels containing characters whose property is R whereas UTS46 enforces it upon all labels in a domain as long as CheckBidi is true.

I do think The Bidi Rule is somewhat confusing if that is the case as it itself states

The following rule, consisting of six conditions, applies to labels
in Bidi domain names.

which easily leads one to think it applies to all labels and has to be obeyed.

Also, it's not clear to me how from The Bidi Rule enforced only upon labels containing characters whose property is R the guarantee follows that labels starting with an ASCII digit do not come after the RTL label.

@alvestrand
Copy link

You are correct. IDNA2008 states only rules about single labels - this was a result of discussing the various ways in which labels can be put together into domain names.

There is a very explicit discussion of "what can happen if you concatenate labels into domain names" in https://www.rfc-editor.org/rfc/rfc5893#section-5 - it ends with "Rather than trying to suggest rules that disallow all such undesirable situations, this document merely warns about the possibility, and leaves it to application developers to take whatever measures they deem appropriate to avoid problematic situations."

TR46 was written by people who have far less DNS experience than the people who were involved in RFC 5891. The two groups did not agree at the time TR46 was first written, and while my impression is that TR46 has been revised to be more in line with IDNA2008 over time, I am not surprised that there are still cases where trying to interpret the two as saying exactly the same thing will fail.

@annevk
Copy link
Member

annevk commented Jan 10, 2023

@alvestrand okay, but that still doesn't address my last paragraph about the purported guarantees from IDNA2008.

@alvestrand
Copy link

The point is that no single entity can make that guarantee, as described in section 5.
Remember that IDNA2008 intends to impose requirements on people registering labels; it does not impose requirements on those who use domain names.

If you want to require that a certain application rejects domain names that don't obey the requirements, that's an application spec, not a DNS spec. Section 5 (and the "it follows" parts of section 2) are intended to give guidance on how to decide to reject such names.

(I suspect that I'm a victim of knowing what I intended when I wrote it, and being unable to see where it's unclear; to me, I'm just repeating what I already wrote in the RFC. But I still hope it's understandable.)

@annevk
Copy link
Member

annevk commented Jan 11, 2023

The point is that no single entity can make that guarantee, as described in section 5.

So why say they are guarantees?

Given that IDNA2003 was implemented by user agents it does seem somewhat irresponsible that IDNA2008 didn't try to address them at all, but I guess that's water under the bridge.

I guess I need input from @achristensen07 @valenting @markusicu @macchiati as to what exactly we'd like to enforce here. Banning numeric labels in domains containing RTL labels seems bad so I assume we want to change that part of UTS46.

Enforcing The Bidi Rule for labels containing a character whose property is R seems realistic and implemented by all user agents.

We could additionally try to enforce the second "guarantee" by not allowing a numeric label after an RTL label, but not sure.

@zackw

This comment was marked as off-topic.

@annevk

This comment was marked as off-topic.

@annevk
Copy link
Member

annevk commented Jan 16, 2023

My plan is to submit feedback to Unicode's April meeting to get this addressed. Draft:

Please change the processing model of CheckBidi to allow for more right-to-left domains.

Currently when CheckBidi is set to true and the input is determined to be a Bidi domain name it enforces all six subrules of The Bidi Rule https://www.rfc-editor.org/rfc/rfc5893.html#section-2 for each label of a domain. This has a couple of issues:

  • As discussed in Refusing a mix of numeric-only and BIDI domains #543 subrule 1 alone ends up disallowing EN code point labels in such domain names (e.g., 1.ي is a fatal error). This seems unnecessarily constraining.
  • Subrule 1 also creates undefined behavior for empty string labels (e.g., for a domain such as ي.), as it imposes requirements upon a character that is not there. (If the expectation is that trailing dots are removed before ToASCII is invoked that could use clearer documentation or an assert somewhere.)
  • As discussed in the URL Standard issue referenced one of the editors of IDNA2008 asserts The Bidi Rule was not aimed at client implementations, but rather at registries. While browsers have been enforcing it to varying degree nevertheless as suggested by UTS46, it's probably worth another close review to ensure this is actually what we want.

I don't have a recommendation here unfortunately as this is not my area of expertise. It's my hope Bidi experts on the committee can help out. One solution might be to not enforce subrule 1 for left-to-right labels, but do enforce that a label that starts with an EN code point cannot follow a right-to-left label.

If anyone here has suggestions for how to make this more concrete I'm all ears.

cc @ricea

annevk added a commit to web-platform-tests/wpt that referenced this issue Jan 16, 2023
As per whatwg/url#543. Not ready to merge until that has concluded.
@alvestrand
Copy link

The point on empty labels is actually not right.
The string "a.b.c." (trailing dot) does not represent a DNS name with an empty label; it is a syntactic convention saying "we know c is a top level domain, don't try to append your search path elements to it in order to find it".

It's largely fallen out of use.

My suggestion for a solution would be to add text in the URL standard as follows:

The IDNA2008 standard, in RFC 5893, gives a rule for evaluating whether or not a single label is suitable for use in a BIDI domain name, and some advice for applications.

RFC 5893 defines the terms "RTL label", "LTR label", "Bidi domain name", and "Bidi rule".

Based on this advice, the following domain names will be accepted by the URL standard:

  • Domains containing only labels that obey the Bidi rule
  • Domains containing RTL labels followed by an LTR label consisting only of ASCII characters, where the first character is not a digit.

That should be the necessary and sufficient rules for ensuring that display of domain names using the Unicode bidi algorithm don't contain characters that "jump the dot".

@annevk
Copy link
Member

annevk commented Jan 17, 2023

It's correct per how browsers and the URL Standard deal with it. The domain name is x.. That is passed to UTS46 which splits on . to get labels. At that point they have an empty label they need to deal with. Currently they pass it as-is to IDNA2008's The Bidi Rule where it goes wrong. (And as I suggest we could instead omit the empty label before passing it on to UTS46.)

Thanks for suggesting a set of rules. I'll incorporate that in the feedback.

@alvestrand
Copy link

This is the problematic statement in UTS#46: https://unicode.org/reports/tr46/#ProcessingStepBreak
"Break the string into labels at U+002E"

The problem is that a.b.c. is using the "preferred name syntax" from RFC 1035 section 2.3.1, where empty labels are disallowed - and UTS#46 is ignoring that.

The grammar rule is " ::= [ [ ] ]" - this was relaxed to allow leading digits in RFC 1123 section 2.1, but there was never a relaxation of the rule that there should be at least one character.

A competent DNS name processor should:

a) disallow any domain name with two consecutive dots
b) interpret a trailing dot as "this domain name is rooted at the DNS root", not as a trailing empty label

@annevk
Copy link
Member

annevk commented Jan 17, 2023

This is all way before DNS gets involved and also has other applications (such as the same-origin policy) so it's not quite that simple, but it might well be better if UTS46 is not invoked with a trailing dot. They just need to make that clear I think.

@annevk
Copy link
Member

annevk commented Jan 23, 2023

@alvestrand on reflection, it's not clear to me how your suggestion ends up allowing cases such as 1.ي. Not all labels there obey The Bidi Rule as discussed. Perhaps it needs to be something like this:

  • Bidi domain names need to obey these rules:
    • Their RTL labels need to obey The Bidi Rule.
    • If they contain an LTR label starting with an EN code point, that LTR label cannot follow an RTL label.

@alvestrand
Copy link

The example of 1.ي is not covered by my suggested rule:

Domains containing RTL labels followed by an LTR label consisting only of ASCII characters, where the first character is not a digit.

since the RTL domain is a top level domain in this case.
It is covered if reformulated as

Domains containing RTL labels where each RTL label is either the top level domain or directly followed by an LTR label consisting only of ASCII characters, where the first character is not a digit.

(the word "directly" makes it more obvious that 1.ي.foo.3.tld is allowed too)

@annevk
Copy link
Member

annevk commented Jan 26, 2023

@alvestrand but for that second now-reformulated case would the RTL labels still need to obey "The Bidi Rule"? And when you say "domains" do you mean "Bidi domain names" or all of them?

@annevk
Copy link
Member

annevk commented Feb 6, 2023

@alvestrand if you could give this another look that would help. Otherwise I'll submit feedback without a specific recommendation.

@alvestrand
Copy link

alvestrand commented Feb 6, 2023

The two paragraphs in my suggested rule are AND, not OR; domain names need to satisfy both.

All domain names with RTL labels are Bidi domain names.
Quoting RFC 5893 again:

A "Bidi domain name" is a domain name that contains at least one RTL
label. (Note: This definition includes domain names containing only
dots and right-to-left characters. Providing a separate category of
"RTL domain names" would not make this specification simpler, so it
has not been done.)

Domain names that don't contain RTL labels are out of scope for this recommendation.

@annevk
Copy link
Member

annevk commented Feb 6, 2023

@alvestrand how does AND work for LTR labels solely consisting of EN code points? They would violate The Bidi Rule.

@macchiati
Copy link

macchiati commented Feb 7, 2023 via email

@annevk
Copy link
Member

annevk commented Feb 7, 2023

Yeah, I guess we could accept it, but it seems unnecessarily constraining for RTL users and developers, and ends up rejecting domains known to exist (see OP). At the moment it also only matches WebKit, but unfortunately I haven't been able to get @ricea (Chromium) and @valenting (Gecko) to chime in thus far.

@alvestrand
Copy link

alvestrand commented Feb 7, 2023

OP had mail.163.com.xn----9mcjf9b4dbm09f.com - here, the RTL label is followed by an ASCII label that does not start with a digit. I don't see how that would fail the rule I suggested.

To @annevk : All-numeric labels start with a number. No need to consider anything more about them; if they follow an RTL label, they make the domain name fail the rule.

(Note: 1.ي.3.tld (the 3 is actually the subdomain of .tld) is an example of an all-numeric label. It will be rare for users to actually comprehend that.)

@annevk
Copy link
Member

annevk commented Feb 7, 2023

@alvestrand because the 163 label violates The Bidi Rule subrule 1 as we've said repeatedly in this thread. Mark just mentioned it again just now in his first paragraph.

@alvestrand
Copy link

alvestrand commented Feb 7, 2023

The 163 label does not follow an RTL label, so while it violates the bidi rule for a label, it doesn't violate the domain name rule I proposed. I think I've said that several times too. Quoting RFC 5893 again:

o In a domain name consisting of only LDH labels (as defined in the
Definitions document [RFC5890]) and labels that satisfy the rule,
the requirements of Section 3 are satisfied as long as a label
that starts with an ASCII digit does not come after a
right-to-left label.

Satisfying the requirements of section 3 should be the goal of a domain name verification filter.

@annevk
Copy link
Member

annevk commented Feb 7, 2023

In #543 (comment) you suggested two rules and later clarified they are AND. One of the rules is that all labels adhere to The Bidi Rule.

Could you please restate your rules in clearer terms?

@alvestrand
Copy link

Adding in the AND and "immediately" from the suggested clarifications gives the following text:

The IDNA2008 standard, in RFC 5893, gives a rule for evaluating whether or not a single label is suitable for use in a BIDI domain name, and some advice for applications.

RFC 5893 defines the terms "RTL label", "LTR label", "Bidi domain name", and "Bidi rule".

Based on this advice, the following domain names will be accepted by the URL standard:

  • Domains containing only labels that obey the Bidi rule, AND
  • Domains containing RTL labels immediately followed by an LTR label consisting only of ASCII characters, where the first character is not a digit.

The AND means that both kinds of domain will be accepted, it is "accept AND accept".

I don't understand where the comprehension difficulty is, but then English is not my first language.

@annevk
Copy link
Member

annevk commented Feb 7, 2023

Well, you also said:

The two paragraphs in my suggested rule are AND, not OR; domain names need to satisfy both.

But now you are saying domain names only need to satisfy one of the rules, right? (Which brings me back to my question about the lack of enforcement of The Bidi Rule on RTL labels with the second rule.)

@alvestrand
Copy link

I was wrong when I said "domain names need to satisfy both". I wasn't reading my own proposed text. Sorry!

Try N:

The IDNA2008 standard, in RFC 5893, gives a rule for evaluating whether or not a single label is suitable for use in a BIDI domain name, and some advice for applications.

RFC 5893 defines the terms "RTL label", "LTR label", "Bidi domain name", and "Bidi rule".

Based on this advice, the following domain names will be accepted by the URL standard:

  • Domains containing only labels that obey the Bidi rule, AND
  • Domains containing RTL labels immediately followed by an LTR label consisting only of ASCII characters, where the first character is not a digit, and where all labels are either LDH labels or obey the Bidi rule.

@annevk
Copy link
Member

annevk commented Feb 7, 2023

Thank you, that seems like an improvement, but LDH labels per https://datatracker.ietf.org/doc/html/rfc5890#section-2.3.1 contain A-labels and it seems that A-labels that are RTL labels should obey The Bidi Rule. So maybe LDH there should be LTR?

@alvestrand
Copy link

No, LDH should be LDH, because there are LTR labels that don't obey the Bidi rule, and we need to not permit those. (See bullets 5 and 6 of the Bidi rule).

I don't have proposed surrounding text for this rule, but it should probably say "this rule is evaluated after all A-labels have been converted to U-labels for testing" - meaning that xn-- labels should be decoded before evaluating; if we don't do that, explicit xn-- labels offer a way to sneak in bidi domains into unsuspecting places.

@annevk
Copy link
Member

annevk commented Feb 7, 2023

Okay that makes sense. I think that precondition means that LTR in your second rule can be LDH as well (which guarantees ASCII).

And to be clear, there is the (unstated) precondition that these domains are Bidi domain names, right? As presumably we will not impose these requirements on non-Bidi domain names.

I think with that we'd recommend these changes to UTS 46:

  1. Remove step 8 of https://unicode.org/reports/tr46/#Validity_Criteria as Validity Criteria only operates on a single label. (Although it somehow claims to have knowledge about the domain_name string as well...)

  2. Add a new step 5 to https://unicode.org/reports/tr46/#Processing. (Note that due to step 4 we will have U-labels.)

    1. If CheckBidi, and the domain_name string is a Bidi domain name, record there was an error if neither of the following conditions is true:
      • All labels in the domain_name string satisfy the 6 subrules of The Bidi Rule of RFC 5893, Section 2.
      • RTL labels in the domain_name string are immediately followed by an LDH label whose first code point is not of class EN and all labels in the domain_name string are either LDH labels or satisfy the 6 subrules of The Bidi Rule of RFC 5893, Section 2.

I'd appreciate your review and of anyone else still paying attention. 😅

@alvestrand
Copy link

Thanks for the context!

Yes, I think this is appropriate advice.

@annevk
Copy link
Member

annevk commented Feb 13, 2023

Thank you @alvestrand for coming up with the recommendation, @vorner for raising this, and everyone else who helped move this along! I submitted the feedback to Unicode for their April 2023 meeting. The final comment can be found at the bottom of OP in #744.

@hsivonen
Copy link
Member

  • 1.xn--mhb errors (though does not error in Gecko, presumably it has CheckBidi set to false;

The presumption is not correct. Previously (at the time of the quoted comment), CheckBidi was true in Gecko, but Gecko invoked UTS 46 processing on a per-label basis, so the bidiness status of the domain as a whole did not end up affecting LTR labels.

Gecko currently (well after the quoted comment) implements CheckBidi (still true) as described in the Unicode 15.1 version of UTS 46, so if there is RTL anywhere in the domain as a whole, the domain is rejected if there is an LTR label that starts with an ASCII digit. (This also has the effect that domains like 1password.com and 9to5mac.com fall off the fastest path, because at the time a fast-path decision about the first label needs to be made, it's not yet known if there is going to be RTL in subsequent labels.)

(This comment is not meant as disagreement with the feedback relayed to the UTC.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

8 participants