Should we forbid U+226E (≮) and U+226F (≯) in hosts? #733

annevk · 2023-01-12T16:29:20Z

From https://www.unicode.org/reports/tr46/#UseSTD3ASCIIRules:

There are a very small number of non-ASCII characters with the data file status disallowed_STD3_valid:

U+2260 ( ≠ ) NOT EQUAL TO
U+226E ( ≮ ) NOT LESS-THAN
U+226F ( ≯ ) NOT GREATER-THAN

Those characters are disallowed with UseSTD3ASCIIRules=true because the set of characters in their canonical decompositions are not entirely in the valid set (Step 7 of the Table Derivation). However, they are allowed with UseSTD3ASCIIRules=false, because the base characters of their canonical decompositions, U+003D ( = ) EQUALS SIGN, U+003C ( < ) LESS-THAN SIGN, and U+003E ( > ) GREATER-THAN SIGN, are each valid under that option. If an implementation uses UseSTD3ASCIIRules=false but disallows any of these three ASCII characters, then it must also disallow the corresponding precomposed character for its negation.

We allow =, but < and > are forbidden. All of the three non-ASCII code points listed above work fine in WebKit and I personally might not see the problem as strongly as UTS46 does. I added tests for them in web-platform-tests/wpt#37907. (The tests reflect the status quo.)

Thoughts?

cc @karwa @ricea @achristensen07 @valenting

The text was updated successfully, but these errors were encountered:

ricea · 2023-01-13T12:12:03Z

On my computer http://example≯ looks very similar to http://example>/, which is not great. But it's probably not a good enough reason to change the status quo.

karwa · 2023-01-13T13:19:51Z

Fundamentally, I'm not even sure why the decomposition of these characters is even relevant - UTS46 normalises them to a composed form and Punycodes that, so none of these characters should ever result in naked ASCII =/</> characters being sent over the wire -- and I think that's all that standards such as STD3, or DNS servers, routers, etc should care about; that it doesn't collide with other delimiters and whatnot.

So I see no technical reason why these characters should be disallowed. And I see no non-technical reason why we should disallow characters such as ≯, while allowing all of the following:

http://┴/ - box drawing character. Allowed => http://xn--qxh/
http://∫/ - integral symbol. Allowed => http://xn--jbh/
http://𝜢𝜠𝜰/ - Mathematical bold italic capitals. Allowed => http://xn--qxad7b/
http://𐦖.𓀡.𓀈/ - Ancient Egyptian hieroglyphics. Allowed => http://xn--6n9c.xn--3p7d.xn--ep7d/
http://helpme𓏎/ - Another hieroglyphic. U+133CE POT WITH LEGS. Allowed => http://xn--helpme-gt36b/

annevk · 2023-01-13T13:22:01Z

Thanks! I suppose this is another issue where it would be great to get input from @markusicu @macchiati.

macchiati · 2023-01-13T15:28:38Z

There's are good points. Markus, see any good reason to disallow, given that the result has to be NFC?

markusicu · 2023-01-13T17:35:27Z

I am not vested in these three characters, or possible future ones with this behavior. Clearly the UTS46 rule is based on their Decomposition_Mapping, but UTS46 does use NFC compositions, and there are no compositions with other combining marks that could block these.

Who decides on these things? Consensus of browser makers?

For a formal request to change this, please use https://www.unicode.org/reporting.html --> UTC / Report Error in Publication/Data

annevk · 2023-01-14T06:52:59Z

Thanks, I'll file feedback as well as for #543 in time for Unicode's April meeting.

In my experience of trying to make IDNA interoperable over the past decade browsers have not been super opinionated on ToASCII. (Now ToUnicode is another matter, but that algorithm isn't directly exposed.) As long as we err on the side of compatibility, i.e., making hosts resolve, I think it should work out.

And apparently the IETF hasn't been opinionated enough either as according to a comment in that other issue they gave up on standardizing the details of client behavior with IDNA2008. So I'm very thankful we have UTS46.

annevk · 2023-01-16T13:46:08Z

Tentative feedback (not submitted yet):

Please change U+2260 (≠), U+226E (≮), and U+226F (≯) from disallowed_STD3_valid to valid.

These code points are not decomposed so they can never conflict with =, <, and >. And they are not inherently more confusing than any of the other allowed code points, which include hieroglyphics and emoji. These code points also work as-is in all browser engines (while < and > are forbidden) and on balance preference ought to be given to retaining compatibility so end users are not prevented from visiting websites or seeing subresources that might use these code points in their domain for one reason or another.

For further background and discussion please see https://github.com/whatwg/url/issues/733.

Thank you!

macchiati · 2023-01-17T01:19:18Z

Sounds reasonable to me; what do you think, Markus?

…

On Mon, Jan 16, 2023 at 5:46 AM Anne van Kesteren ***@***.***> wrote: Tentative feedback (not submitted yet): Please change U+2260 (≠), U+226E (≮), and U+226F (≯) from disallowed_STD3_valid to valid. These code points are not decomposed so they can never conflict with =, <, and >. And they are not inherently more confusing than any of the other allowed code points, which include hieroglyphics and emoji. These code points also work as-is in all browser engines (while < and > are forbidden) and on balance preference ought to be given to retaining compatibility so end users are not prevented from visiting websites or seeing subresources that might use these code points in their domain for one reason or another. For further background and discussion please see #733. Thank you! — Reply to this email directly, view it on GitHub <#733 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMFB5PODQ24EIGPNHW3WSVGKZANCNFSM6AAAAAATZOHJVI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

markusicu · 2023-01-17T19:02:13Z

tentative feedback lgtm

annevk · 2023-01-23T12:38:17Z

Thanks, it's now submitted along with some other items, summarized in #744. I haven't yet submitted feedback on CheckBidi as I'm still not sure what to recommend. See #543.

rmisev · 2024-09-25T17:21:07Z

Please change U+2260 (≠), U+226E (≮), and U+226F (≯) from disallowed_STD3_valid to valid.

This has already been fixed in UTS 46 15.1.0, see https://www.unicode.org/reports/tr46/tr46-31.html#Modifications
So maybe this issue can be closed?

annevk · 2024-09-25T18:02:45Z

I guess we were already testing this? If so, agreed.

rmisev · 2024-09-25T18:51:32Z

Yes, there are tests for these characters, but we test with UseSTD3ASCIIRules=false:
https://github.com/web-platform-tests/wpt/blob/a19eaaf167389a79c8971fbd25c557965541bdfd/url/resources/toascii.json#L163-L175

annevk · 2024-09-25T21:48:09Z

That seems correct, no?

rmisev · 2024-09-26T18:36:25Z

Yes, the tests are correct.

annevk added topic: parser topic: idna labels Jan 12, 2023

This was referenced Jan 12, 2023

Issues with UTS #46 tests #341

Closed

IDNA: add a couple interesting ToASCII cases web-platform-tests/wpt#37907

Merged

annevk mentioned this issue Jan 23, 2023

Meta: UTS46 feedback #744

Open

annevk closed this as completed Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we forbid U+226E (≮) and U+226F (≯) in hosts? #733

Should we forbid U+226E (≮) and U+226F (≯) in hosts? #733

annevk commented Jan 12, 2023

ricea commented Jan 13, 2023

karwa commented Jan 13, 2023

annevk commented Jan 13, 2023

macchiati commented Jan 13, 2023

markusicu commented Jan 13, 2023

annevk commented Jan 14, 2023

annevk commented Jan 16, 2023

macchiati commented Jan 17, 2023 via email

markusicu commented Jan 17, 2023

annevk commented Jan 23, 2023

rmisev commented Sep 25, 2024

annevk commented Sep 25, 2024

rmisev commented Sep 25, 2024

annevk commented Sep 25, 2024

rmisev commented Sep 26, 2024

Should we forbid U+226E (≮) and U+226F (≯) in hosts? #733

Should we forbid U+226E (≮) and U+226F (≯) in hosts? #733

Comments

annevk commented Jan 12, 2023

ricea commented Jan 13, 2023

karwa commented Jan 13, 2023

annevk commented Jan 13, 2023

macchiati commented Jan 13, 2023

markusicu commented Jan 13, 2023

annevk commented Jan 14, 2023

annevk commented Jan 16, 2023

macchiati commented Jan 17, 2023 via email

markusicu commented Jan 17, 2023

annevk commented Jan 23, 2023

rmisev commented Sep 25, 2024

annevk commented Sep 25, 2024

rmisev commented Sep 25, 2024

annevk commented Sep 25, 2024

rmisev commented Sep 26, 2024