Issues with UTS #46 tests #341

TimothyGu · 2017-08-16T02:03:18Z

Wrote a letter to Unicode through their feedback form on July 30 with the details. Haven't heard back despite their promises that "You can expect an acknowledgement of your report within 2-3 business days."

Subject: Issues with UTS #46's conformance test file

To whoever it may concern,

While developing a product conforming to "UTS #46: Unicode IDNA Compatibility
Processing, Version 10.0.0" [UTS46], we noticed a few issues with the provided
conformance testing file (IdnaTest.txt). These issues are preventing us from
implementing UTS #46 in tr46.js [TR46JS-ISSUE].

The IdnaTest.txt file is formatted as a list of semicolon-separated values. The
meanings of the specific columns are given in UTS #46 Section 8.1, an excerpt
of which is hereby reproduced [UTS46]:

> No Field      Description
> ...
> 3  toUnicode  The result of applying toUnicode to the source, using
>               "nontransitional".
>               A blank value means the same as the source value; a value in
>               [...] is a set of error codes.
> 4  toASCII    The result of applying toASCII to the source, using the
>               specified type: T, N, or B.
>               A blank value means the same as the toUnicode value; a value in
>               [...] is a set of error codes.
>
> ...
>
> An error in toUnicode or toASCII is indicated by an error list of the form
> [...]. In such a case, the contents of that list are error codes based on the
> step numbers in UTS46 and IDNA2008:
>
>     ...
>     An for Section 4.2 ToASCII, step n
>     ...

Given that "An" applies only to the ToASCII algorithm, not the ToUnicode
algorithm, it seems appropriate for field "toUnicode" in IdnaTest.txt to never
have an error code of form An. Yet, in the published IdnaTest.txt file
corresponding to version 10.0.0 [IDNA-TEST], there exist 305 entries in
IdnaTest.txt where an "An" error code appears under "toUnicode". In particular,
there exist 36 entries with _only_ an "An" error code under "toUnicode" --
which, in other words, means that the only justification for erroring on those
entries from ToUnicode is not actually in ToUnicode.

This is particularly troubling, since while the Standard allows for ADDITIONAL
error cases than ones already specified in IdnaTest.txt, a product conforming
to UTS #46 must produce an error on ALL error cases in IdnaTest.txt, per lines
68-72 of IdnaTest.txt, again reproduced below:

> ... Thus to then verify conformance for the toASCII and toUnicode columns:
>
> - If the file indicates an error, the implementation must also have an error.
> - If the file does not indicate an error, then the implementation must either
>   have an error, or must have a matching result.

A close examination of the 36 entries mentioned above reveals that:

- 9 of the 36 entries have only "[A3]" error code under ToUnicode, which
  corresponds to the Punycode-encoding step in ToASCII. The source domains all
  have one label with invalid Punycode-encoding though, so they would in fact
  have already recorded an error in no. 4 of Processing Steps, which is called
  upon by ToUnicode as well. In other words, these entries merely have a faulty
  error code; ToUnicode would still record an error for these entries, just one
  at a different step than advertised.

  Some samples from these 9 entries are:

  Line 313: B;	xn--0.pt;	[A3];	[A3]
  Line 315: B;	xn--a-Ä.pt;	[A3];	[A3]
  Line 316: B;	xn--a-A\u0308.pt;	[A3];	[A3]

- The other 27 entries have only a "[A4_2]" error code under ToUnicode, which
  corresponds to the DNS length verification step under ToASCII. Some of them
  are:

  Line 201: B;	。;	[A4_2];	[A4_2]
  Line 202: B;	.;	[A4_2];	[A4_2]
  Line 434: B;	a..c;	[A4_2];	[A4_2]
  Line 439: B;	ä.\u00AD.c;	[A4_2];	[A4_2]

  While these domain names are all rather unlikely to be allowed by real-world
  UTS #46 implementations, most (if not all) of them are still strictly allowed
  by ToUnicode as defined in UTS #46.
  
  Take line 201, for example. Step 1 of ToUnicode call into the Processing
  Steps, whose step 1 will map '。' to '.', and which will then pass through
  the rest of Processing Steps without recording an error. Step 2 of ToUnicode
  will then produce a "converted Unicode string" of '.', and signal there was
  no error.

The 27 entries in IdnaTest.txt with [A4_2] are the real worrying ones, since
they seem to go against the algorithms defined in UTS #46, and prevent us from
creating a strict implementation of UTS #46 without passing its own conformance
tests.

To resolve these issues, I would like to see the following:

- A clarification whether the aforementioned 27 entries should record an error
  in ToUnicode.
- Corresponding changes to IdnaTest.txt or UTS #46 that accompany that
  clarification.
- There be no entries in IdnaTest.txt with a ToUnicode error code that point to
  steps in ToASCII.

Sincerely,

Timothy Gu

[UTS46]: http://www.unicode.org/reports/tr46/tr46-19.html
[IDNA-TEST]: http://www.unicode.org/Public/idna/10.0.0/IdnaTest.txt
[TR46JS-ISSUE]: https://github.com/Sebmaster/tr46.js/pull/13

The text was updated successfully, but these errors were encountered:

annevk · 2017-08-16T09:24:49Z

Thanks @TimothyGu. I don't think there's need to worry just yet, they replied to my May 24 email (about changing processing_option to a boolean) on Aug 11. It might be that they first need to have another meeting to figure out what to do.

As for your bug report, if you always applied ToASCII before ToUnicode (which I think is the intent), would you still run into issues?

TimothyGu · 2017-08-16T09:56:32Z

I see. Well by this point I just hope I wrote my email address correctly in the feedback form…

In the URL Standard we do use ToASCII before ToUnicode. But Unicode tests don't do that, I don't think.

annevk · 2018-03-14T13:34:04Z

An update was pushed out. http://www.unicode.org/draft/reports/tr46/tr46.html#Format lists changes to the test format including questions they'd like feedback on.

I'm told that if we can get back to them quickly that'd be good.

TimothyGu · 2018-03-14T13:57:16Z

I won’t be able to look at this until late next week.

annevk · 2020-05-06T16:55:03Z

@TimothyGu any chance you could poke at this again?

TimothyGu · 2020-05-07T01:42:39Z

An update is that the new tests look good, and we were able to adopt them for tr46 successfully. On the other hand, the newest version of UTS 46 tests (for Unicode 13) introduced new errors it looks like, but we don't need to track that here.

annevk · 2020-05-07T03:55:34Z

Well, let me reopen for a bit, is there a way we could import these tests into web-platform-tests? Happy to track that as a new issue. Just wondering.

karwa · 2022-05-03T11:23:10Z

I apologise for the bump, and I'd be happy to open a new issue if that would be better. But I've also encountered an issue with the UTS46 tests.

Is anybody successfully using the latest version of the tests?

For example, I've been getting error "P1". That corresponds to processing step 1, the mapping table. The specific test that was failing was "≠.", which should flag an error at this step to signal a codepoint with the state "disallowed", but for some reason it wasn't doing that.

Looking at the section on mapping, it appears that this specific codepoint is marked disallowed_STD3_valid - in other words:

the status is disallowed if UseSTD3ASCIIRules=true (the normal case); implementations that allow UseSTD3ASCIIRules=false would treat the code point as valid.

We apparently use the not-normal case, so my implementation is behaving correctly for a URL context, but I don't have the ability to ignore only those errors that are due to this difference.

I managed to find the offending code, and sure enough, it is considering code-points disallowed and throwing out error P1's regardless of the value of UseSTD3ASCIIRules.

I was about to file a bug, but this is the first time I'm encountering this algorithm so I'm not entirely sure if I messed up. It all seems to line up and the evidence appears to corroborate the story. But it obviously raises questions about what everybody else is doing to test conformance. Why did nobody else hit this before me?

annevk · 2022-05-03T12:17:30Z

I'm not sure. According to https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly/iiaAuLw==&base=YWJvdXQ6Ymxhbms= it seems that browsers and the reference implementation agree for that input (it becomes "xn--1ch.").

@macchiati @srl295 @markusicu could you perhaps help us out with these tests or providing guidance on how to best provide feedback on them? Thanks!

domenic · 2022-05-03T14:41:39Z

Note that per jsdom/whatwg-url#239 (also by @karwa) it seems we might not be able to trust the reference implementation for this sort of thing 😢.

https://jsdom.github.io/whatwg-url/#url=aHR0cDovL3huLS1sczhoPT09Lw==&base=YWJvdXQ6Ymxhbms= (from that issue) is another interesting case, where Chrome/Firefox/reference implementation all differ.

karwa · 2022-05-06T01:58:32Z

Hmm, I also found what I believe to be an ambiguity in UTS46, which is causing actual implementation divergence, so I sent a Unicode Error Report for that to be clarified.

I think it underscores why integrating these tests in to the WPT is important, and aligns with the overall goal of minimising divergence and promoting interoperability. As we well know, the standards are not perfect, implementations are not perfect, but having everybody use the same test-suite and ensuring that they all run it is a good way to catch imperfections in both the standards and the implementations early.

The issues that I'm seeing make me think that perhaps not everybody is running the tests, or they are running older versions, or they have hacks to exclude certain buggy tests, or (and this one can be really subtle) they are testing a different code-path (via a different set of flags) to what is actually used by the URL Standard. Integrating with the WPT would help guard against those issues, and make it simpler even for non-WPT implementations to match web behaviour.

Or I'm missing something. Unicode is complex, and when you notice an issue, I find it's either because (a) you're just starting to understand how things work, or (b) because you totally misunderstood/forgot something. It's hard to tell. It would be easier to tell if there were a single, correct way of running the tests, and I could see proof that nobody else has a problem with it.

Anyway, here's the report I sent on the (possible) UTS46 ambiguity, for the curious.

> Expand: Unicode Error Report

UTS 46
Version 14.0.0
Date 2021-08-24
Revision 27
https://www.unicode.org/reports/tr46/

I only just started writing my own implementation of this recently, so apologies if I'm misunderstanding, but there are two locations where code-points are checked. Using the same format as the IdnaTestV2.txt file for describing those locations, they would be P1 and V6.

P1 is applied to the entire domain, as given. So it may see (decoded) Unicode text, or Punycode. It takes the value of UseSTD3ASCIIRules in to account, so a domain like "≠ᢙ≯.com" triggers the error at P1 only if UseSTD3ASCIIRules=true. "xn--jbf911clb.com" will never trigger the error at this location, regardless of UseSTD3ASCIIRules, because it is encoded as ASCII.
V6 is applied to the result of Punycode-decoding a domain label, so it will only see decoded Unicode text. As written, it would appear not to take UseSTD3ASCIIRules in to consideration, meaning that both "≠ᢙ≯.com" and "xn--jbf911clb.com" would trigger errors at this location, regardless of UseSTD3ASCIIRules.

Here is the text of Section 4.1, Validity Criteria (https://www.unicode.org/reports/tr46/#Validity_Criteria), Step 6:

Each code point in the label must only have certain status values according to Section 5, IDNA Mapping Table:

For Transitional Processing, each value must be valid.

For Nontransitional Processing, each value must be either valid or deviation.

It is not clear whether these status values are supposed to take the value of UseSTD3ASCIIRules in to account. As described above, if V6 does not consider UseSTD3ASCIIRules, "≠ᢙ≯.com" and "xn--jbf911clb.com" will always be invalid domains. It does not matter that P1 considers UseSTD3ASCIIRules, because it will be caught by V6 later anyway.
This leads me to believe that it should respect UseSTD3ASCIIRules (otherwise the parameter would be meaningless). But it's not clear.

I'll have to apologise again because I am not very familiar with the codebases I am about to cite, but from what I can glean this is leading to divergence in the wild:

Unicode-org implementation of IDNA not appear to consider UseSTD3ASCIIRules here:
https://github.com/unicode-org/unicodetools/blob/main/unicodetools/src/main/java/org/unicode/idna/Uts46.java#L610-L625
This seems to be confirmed by the IdnaTestV2 file. For example, Version 14.0.0 (Date: 2021-08-17, 19:34:01 GMT) lines 571 and 573:
```
[571] xn--jbf911clb.xn----p9j493ivi4l; ≠ᢙ≯.솣-ᡴⴀ; [V6]; xn--jbf911clb.xn----p9j493ivi4l; ; ;  # ≠ᢙ≯.솣-ᡴⴀ
[573] xn--jbf911clb.xn----6zg521d196p; ≠ᢙ≯.솣-ᡴႠ; [V6]; xn--jbf911clb.xn----6zg521d196p; ; ;  # ≠ᢙ≯.솣-ᡴႠ
```
"V6" is not an optional validation step tied to any parameter; implementations cannot decide whether or not it applies to them. It always applies, and these tests are saying these domains should always be considered invalid IIUC.
JSDOM implementation does consider UseSTD3ASCIIRules here, thinks this domain is valid:
https://github.com/jsdom/tr46/blob/e937be8d9c04b7938707fc3701e50118b7c023a5/index.js#L100
Browsers effectively do the same in URLs. Safari 15 and JSOM both consider "http://≠ᢙ≯.com.xn--jbf911clb" to be a perfectly fine URL:
https://jsdom.github.io/whatwg-url/#url=aHR0cDovL+KJoOGimeKJry5jb20ueG4tLWpiZjkxMWNsYg==&base=YWJvdXQ6Ymxhbms=

annevk · 2022-05-06T13:36:38Z

Thank you @karwa. Really appreciate the time you are putting into this.

annevk · 2023-01-10T13:09:12Z

How do people feel about our own coverage in resources/toascii.json? Can we expand upon that instead?

Identified in whatwg/url#341 by karwa.

annevk · 2023-01-12T16:34:48Z

I created #733 and web-platform-tests/wpt#37907 based on the more recent discussion here.

I guess we'll keep this open for people to report more issues they run into with UTS46 tests and then hopefully they get clarified upstream, but if not we'll cover them in WPT somehow.

Identified in whatwg/url#341 by karwa.

…ases, a=testonly Automatic update from web-platform-tests IDNA: add a couple interesting ToASCII cases Identified in whatwg/url#341 by karwa. -- wpt-commits: 3d997a3ff43a545b3d36e12d55478fb264e6d0df wpt-pr: 37907

For whatwg/url#341.

annevk · 2023-01-20T11:14:37Z

I now created a test runner again for IdnaTestV2.txt: web-platform-tests/wpt#38080. I also wrote down a bunch of feedback there in a separate file I plan to submit to Unicode. Feedback appreciated!

…ases, a=testonly Automatic update from web-platform-tests IDNA: add a couple interesting ToASCII cases Identified in whatwg/url#341 by karwa. -- wpt-commits: 3d997a3ff43a545b3d36e12d55478fb264e6d0df wpt-pr: 37907

macchiati · 2023-01-20T17:09:42Z

It appears (at a first glance!) that most of the issues are mislabeling of the reasons for rejection. Is that correct?

…

On Fri, Jan 20, 2023 at 3:14 AM Anne van Kesteren ***@***.***> wrote: I now created a test runner again for IdnaTestV2.txt: web-platform-tests/wpt#38080 <web-platform-tests/wpt#38080>. I also wrote down a bunch of feedback there in a separate file I plan to submit to Unicode. Feedback appreciated! — Reply to this email directly, view it on GitHub <#341 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMEPKY35SURGV3YWFN3WTJXSTANCNFSM4DXBJUXQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

annevk · 2023-01-20T17:31:24Z

Yeah, mislabeling and conflicts with the URL parser:

Hosts that end with an ASCII digit label will get parsed as an IPv4 address. And thus fail whereas some of these are expected to not fail.
Some inputs contain ? which I could perhaps percent-encode first.

With that a lot of the remaining failures in WebKit are #733. I haven't gone through all of them yet as there are >100 failures still.

macchiati · 2023-01-20T17:51:37Z

I didn't think you were done yet!

…

On Fri, Jan 20, 2023 at 9:31 AM Anne van Kesteren ***@***.***> wrote: Yeah, mislabeling and conflicts with the URL parser: - Hosts that end with an ASCII digit label will get parsed as an IPv4 address. And thus fail whereas some of these are expected to not fail. - Some inputs contain ? which I could perhaps percent-encode first. With that a lot of the remaining failures in WebKit are #733 <#733>. I haven't gone through all of them yet as there are >100 failures still. — Reply to this email directly, view it on GitHub <#341 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMD3CWEZD5A4HEXNFKLWTLDXRANCNFSM4DXBJUXQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

annevk · 2023-01-23T12:42:34Z

I analyzed the remaining WebKit failures and they all seemed to be about UseSTD3ASCIIRules in one way or another. So mainly a status annotation problem, coupled with #733. I went ahead and submitted feedback to Unicode on the IdnaTestV2.txt file. My feedback is in #744.

Per https://unicode.org/timesens/calendar.html it looks like it will be discussed at the end of April so we'll have to wait a bit before we get a resolution here.

This excludes various tests for now due to the open issues mentioned at the top of IdnaTestV2-parser.py. For whatwg/url#341.

Automatic update from web-platform-tests URL: run a subset of IdnaTestV2.txt in WPT This excludes various tests for now due to the open issues mentioned at the top of IdnaTestV2-parser.py. For whatwg/url#341. -- wpt-commits: 9216115f5621b04a27e0f2e9bbf1ce44dd7d3b9e wpt-pr: 38080

This excludes various tests for now due to the open issues mentioned at the top of IdnaTestV2-parser.py. For whatwg/url#341.

Automatic update from web-platform-tests URL: run a subset of IdnaTestV2.txt in WPT This excludes various tests for now due to the open issues mentioned at the top of IdnaTestV2-parser.py. For whatwg/url#341. -- wpt-commits: 9216115f5621b04a27e0f2e9bbf1ce44dd7d3b9e wpt-pr: 38080

annevk · 2024-11-29T11:38:10Z

Let's fold this into the aforementioned issues and web-platform-tests/wpt#48301. I don't think there's anything actionable remaining here.

TimothyGu added the topic: idna label Aug 16, 2017

annevk added the i18n-tracking label Oct 13, 2017

himorin mentioned this issue Nov 12, 2019

Issues with UTS #46 tests w3c/i18n-activity#815

Open

annevk removed the i18n-tracking label Nov 21, 2019

TimothyGu closed this as completed May 7, 2020

annevk reopened this May 7, 2020

annevk added a commit to web-platform-tests/wpt that referenced this issue Jan 12, 2023

IDNA: add a couple interesting ToASCII cases

ac58dc2

Identified in whatwg/url#341 by karwa.

annevk mentioned this issue Jan 12, 2023

IDNA: add a couple interesting ToASCII cases web-platform-tests/wpt#37907

Merged

annevk added a commit to web-platform-tests/wpt that referenced this issue Jan 12, 2023

IDNA: add a couple interesting ToASCII cases

3d997a3

Identified in whatwg/url#341 by karwa.

annevk added a commit to web-platform-tests/wpt that referenced this issue Jan 20, 2023

URL: run IdnaTestV2.txt in WPT

4708e9a

For whatwg/url#341.

annevk mentioned this issue Jan 20, 2023

URL: run IdnaTestV2.txt in WPT web-platform-tests/wpt#38080

Merged

annevk added a commit to web-platform-tests/wpt that referenced this issue Jan 24, 2023

URL: run a subset of IdnaTestV2.txt in WPT

9216115

This excludes various tests for now due to the open issues mentioned at the top of IdnaTestV2-parser.py. For whatwg/url#341.

marcoscaceres pushed a commit to web-platform-tests/wpt that referenced this issue Mar 28, 2023

URL: run a subset of IdnaTestV2.txt in WPT

4e6bdcc

This excludes various tests for now due to the open issues mentioned at the top of IdnaTestV2-parser.py. For whatwg/url#341.

annevk closed this as completed Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with UTS #46 tests #341

Issues with UTS #46 tests #341

TimothyGu commented Aug 16, 2017

annevk commented Aug 16, 2017

TimothyGu commented Aug 16, 2017

annevk commented Mar 14, 2018

TimothyGu commented Mar 14, 2018

annevk commented May 6, 2020

TimothyGu commented May 7, 2020

annevk commented May 7, 2020

karwa commented May 3, 2022 •

edited

Loading

annevk commented May 3, 2022

domenic commented May 3, 2022

karwa commented May 6, 2022 •

edited

Loading

annevk commented May 6, 2022

annevk commented Jan 10, 2023

annevk commented Jan 12, 2023

annevk commented Jan 20, 2023

macchiati commented Jan 20, 2023 via email

annevk commented Jan 20, 2023

macchiati commented Jan 20, 2023 via email

annevk commented Jan 23, 2023

annevk commented Nov 29, 2024

Issues with UTS #46 tests #341

Issues with UTS #46 tests #341

Comments

TimothyGu commented Aug 16, 2017

annevk commented Aug 16, 2017

TimothyGu commented Aug 16, 2017

annevk commented Mar 14, 2018

TimothyGu commented Mar 14, 2018

annevk commented May 6, 2020

TimothyGu commented May 7, 2020

annevk commented May 7, 2020

karwa commented May 3, 2022 • edited Loading

annevk commented May 3, 2022

domenic commented May 3, 2022

karwa commented May 6, 2022 • edited Loading

annevk commented May 6, 2022

annevk commented Jan 10, 2023

annevk commented Jan 12, 2023

annevk commented Jan 20, 2023

macchiati commented Jan 20, 2023 via email

annevk commented Jan 20, 2023

macchiati commented Jan 20, 2023 via email

annevk commented Jan 23, 2023

annevk commented Nov 29, 2024

karwa commented May 3, 2022 •

edited

Loading

karwa commented May 6, 2022 •

edited

Loading