Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with UTS #46 tests #341

Closed
TimothyGu opened this issue Aug 16, 2017 · 20 comments
Closed

Issues with UTS #46 tests #341

TimothyGu opened this issue Aug 16, 2017 · 20 comments

Comments

@TimothyGu
Copy link
Member

Wrote a letter to Unicode through their feedback form on July 30 with the details. Haven't heard back despite their promises that "You can expect an acknowledgement of your report within 2-3 business days."

Subject: Issues with UTS #46's conformance test file

To whoever it may concern,

While developing a product conforming to "UTS #46: Unicode IDNA Compatibility
Processing, Version 10.0.0" [UTS46], we noticed a few issues with the provided
conformance testing file (IdnaTest.txt). These issues are preventing us from
implementing UTS #46 in tr46.js [TR46JS-ISSUE].

The IdnaTest.txt file is formatted as a list of semicolon-separated values. The
meanings of the specific columns are given in UTS #46 Section 8.1, an excerpt
of which is hereby reproduced [UTS46]:

> No Field      Description
> ...
> 3  toUnicode  The result of applying toUnicode to the source, using
>               "nontransitional".
>               A blank value means the same as the source value; a value in
>               [...] is a set of error codes.
> 4  toASCII    The result of applying toASCII to the source, using the
>               specified type: T, N, or B.
>               A blank value means the same as the toUnicode value; a value in
>               [...] is a set of error codes.
>
> ...
>
> An error in toUnicode or toASCII is indicated by an error list of the form
> [...]. In such a case, the contents of that list are error codes based on the
> step numbers in UTS46 and IDNA2008:
>
>     ...
>     An for Section 4.2 ToASCII, step n
>     ...

Given that "An" applies only to the ToASCII algorithm, not the ToUnicode
algorithm, it seems appropriate for field "toUnicode" in IdnaTest.txt to never
have an error code of form An. Yet, in the published IdnaTest.txt file
corresponding to version 10.0.0 [IDNA-TEST], there exist 305 entries in
IdnaTest.txt where an "An" error code appears under "toUnicode". In particular,
there exist 36 entries with _only_ an "An" error code under "toUnicode" --
which, in other words, means that the only justification for erroring on those
entries from ToUnicode is not actually in ToUnicode.

This is particularly troubling, since while the Standard allows for ADDITIONAL
error cases than ones already specified in IdnaTest.txt, a product conforming
to UTS #46 must produce an error on ALL error cases in IdnaTest.txt, per lines
68-72 of IdnaTest.txt, again reproduced below:

> ... Thus to then verify conformance for the toASCII and toUnicode columns:
>
> - If the file indicates an error, the implementation must also have an error.
> - If the file does not indicate an error, then the implementation must either
>   have an error, or must have a matching result.

A close examination of the 36 entries mentioned above reveals that:

- 9 of the 36 entries have only "[A3]" error code under ToUnicode, which
  corresponds to the Punycode-encoding step in ToASCII. The source domains all
  have one label with invalid Punycode-encoding though, so they would in fact
  have already recorded an error in no. 4 of Processing Steps, which is called
  upon by ToUnicode as well. In other words, these entries merely have a faulty
  error code; ToUnicode would still record an error for these entries, just one
  at a different step than advertised.

  Some samples from these 9 entries are:

  Line 313: B;	xn--0.pt;	[A3];	[A3]
  Line 315: B;	xn--a-Ä.pt;	[A3];	[A3]
  Line 316: B;	xn--a-A\u0308.pt;	[A3];	[A3]

- The other 27 entries have only a "[A4_2]" error code under ToUnicode, which
  corresponds to the DNS length verification step under ToASCII. Some of them
  are:

  Line 201: B;	。;	[A4_2];	[A4_2]
  Line 202: B;	.;	[A4_2];	[A4_2]
  Line 434: B;	a..c;	[A4_2];	[A4_2]
  Line 439: B;	ä.\u00AD.c;	[A4_2];	[A4_2]

  While these domain names are all rather unlikely to be allowed by real-world
  UTS #46 implementations, most (if not all) of them are still strictly allowed
  by ToUnicode as defined in UTS #46.
  
  Take line 201, for example. Step 1 of ToUnicode call into the Processing
  Steps, whose step 1 will map '。' to '.', and which will then pass through
  the rest of Processing Steps without recording an error. Step 2 of ToUnicode
  will then produce a "converted Unicode string" of '.', and signal there was
  no error.

The 27 entries in IdnaTest.txt with [A4_2] are the real worrying ones, since
they seem to go against the algorithms defined in UTS #46, and prevent us from
creating a strict implementation of UTS #46 without passing its own conformance
tests.

To resolve these issues, I would like to see the following:

- A clarification whether the aforementioned 27 entries should record an error
  in ToUnicode.
- Corresponding changes to IdnaTest.txt or UTS #46 that accompany that
  clarification.
- There be no entries in IdnaTest.txt with a ToUnicode error code that point to
  steps in ToASCII.

Sincerely,

Timothy Gu

[UTS46]: http://www.unicode.org/reports/tr46/tr46-19.html
[IDNA-TEST]: http://www.unicode.org/Public/idna/10.0.0/IdnaTest.txt
[TR46JS-ISSUE]: https://github.com/Sebmaster/tr46.js/pull/13
@annevk
Copy link
Member

annevk commented Aug 16, 2017

Thanks @TimothyGu. I don't think there's need to worry just yet, they replied to my May 24 email (about changing processing_option to a boolean) on Aug 11. It might be that they first need to have another meeting to figure out what to do.

As for your bug report, if you always applied ToASCII before ToUnicode (which I think is the intent), would you still run into issues?

@TimothyGu
Copy link
Member Author

I see. Well by this point I just hope I wrote my email address correctly in the feedback form…

In the URL Standard we do use ToASCII before ToUnicode. But Unicode tests don't do that, I don't think.

@annevk
Copy link
Member

annevk commented Mar 14, 2018

An update was pushed out. http://www.unicode.org/draft/reports/tr46/tr46.html#Format lists changes to the test format including questions they'd like feedback on.

I'm told that if we can get back to them quickly that'd be good.

@TimothyGu
Copy link
Member Author

I won’t be able to look at this until late next week.

@annevk
Copy link
Member

annevk commented May 6, 2020

@TimothyGu any chance you could poke at this again?

@TimothyGu
Copy link
Member Author

An update is that the new tests look good, and we were able to adopt them for tr46 successfully. On the other hand, the newest version of UTS 46 tests (for Unicode 13) introduced new errors it looks like, but we don't need to track that here.

@annevk
Copy link
Member

annevk commented May 7, 2020

Well, let me reopen for a bit, is there a way we could import these tests into web-platform-tests? Happy to track that as a new issue. Just wondering.

@annevk annevk reopened this May 7, 2020
@karwa
Copy link
Contributor

karwa commented May 3, 2022

I apologise for the bump, and I'd be happy to open a new issue if that would be better. But I've also encountered an issue with the UTS46 tests.

Is anybody successfully using the latest version of the tests?

For example, I've been getting error "P1". That corresponds to processing step 1, the mapping table. The specific test that was failing was "≠.", which should flag an error at this step to signal a codepoint with the state "disallowed", but for some reason it wasn't doing that.

Looking at the section on mapping, it appears that this specific codepoint is marked disallowed_STD3_valid - in other words:

the status is disallowed if UseSTD3ASCIIRules=true (the normal case); implementations that allow UseSTD3ASCIIRules=false would treat the code point as valid.

We apparently use the not-normal case, so my implementation is behaving correctly for a URL context, but I don't have the ability to ignore only those errors that are due to this difference.

I managed to find the offending code, and sure enough, it is considering code-points disallowed and throwing out error P1's regardless of the value of UseSTD3ASCIIRules.

I was about to file a bug, but this is the first time I'm encountering this algorithm so I'm not entirely sure if I messed up. It all seems to line up and the evidence appears to corroborate the story. But it obviously raises questions about what everybody else is doing to test conformance. Why did nobody else hit this before me?

@annevk
Copy link
Member

annevk commented May 3, 2022

I'm not sure. According to https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly/iiaAuLw==&base=YWJvdXQ6Ymxhbms= it seems that browsers and the reference implementation agree for that input (it becomes "xn--1ch.").

@macchiati @srl295 @markusicu could you perhaps help us out with these tests or providing guidance on how to best provide feedback on them? Thanks!

@domenic
Copy link
Member

domenic commented May 3, 2022

Note that per jsdom/whatwg-url#239 (also by @karwa) it seems we might not be able to trust the reference implementation for this sort of thing 😢.

https://jsdom.github.io/whatwg-url/#url=aHR0cDovL3huLS1sczhoPT09Lw==&base=YWJvdXQ6Ymxhbms= (from that issue) is another interesting case, where Chrome/Firefox/reference implementation all differ.

@karwa
Copy link
Contributor

karwa commented May 6, 2022

Hmm, I also found what I believe to be an ambiguity in UTS46, which is causing actual implementation divergence, so I sent a Unicode Error Report for that to be clarified.

I think it underscores why integrating these tests in to the WPT is important, and aligns with the overall goal of minimising divergence and promoting interoperability. As we well know, the standards are not perfect, implementations are not perfect, but having everybody use the same test-suite and ensuring that they all run it is a good way to catch imperfections in both the standards and the implementations early.

The issues that I'm seeing make me think that perhaps not everybody is running the tests, or they are running older versions, or they have hacks to exclude certain buggy tests, or (and this one can be really subtle) they are testing a different code-path (via a different set of flags) to what is actually used by the URL Standard. Integrating with the WPT would help guard against those issues, and make it simpler even for non-WPT implementations to match web behaviour.

Or I'm missing something. Unicode is complex, and when you notice an issue, I find it's either because (a) you're just starting to understand how things work, or (b) because you totally misunderstood/forgot something. It's hard to tell. It would be easier to tell if there were a single, correct way of running the tests, and I could see proof that nobody else has a problem with it.

Anyway, here's the report I sent on the (possible) UTS46 ambiguity, for the curious.

> Expand: Unicode Error Report

UTS 46
Version 14.0.0
Date 2021-08-24
Revision 27
https://www.unicode.org/reports/tr46/

I only just started writing my own implementation of this recently, so apologies if I'm misunderstanding, but there are two locations where code-points are checked. Using the same format as the IdnaTestV2.txt file for describing those locations, they would be P1 and V6.

  • P1 is applied to the entire domain, as given. So it may see (decoded) Unicode text, or Punycode. It takes the value of UseSTD3ASCIIRules in to account, so a domain like "≠ᢙ≯.com" triggers the error at P1 only if UseSTD3ASCIIRules=true. "xn--jbf911clb.com" will never trigger the error at this location, regardless of UseSTD3ASCIIRules, because it is encoded as ASCII.

  • V6 is applied to the result of Punycode-decoding a domain label, so it will only see decoded Unicode text. As written, it would appear not to take UseSTD3ASCIIRules in to consideration, meaning that both "≠ᢙ≯.com" and "xn--jbf911clb.com" would trigger errors at this location, regardless of UseSTD3ASCIIRules.

Here is the text of Section 4.1, Validity Criteria (https://www.unicode.org/reports/tr46/#Validity_Criteria), Step 6:

Each code point in the label must only have certain status values according to Section 5, IDNA Mapping Table:

  • For Transitional Processing, each value must be valid.
  • For Nontransitional Processing, each value must be either valid or deviation.

It is not clear whether these status values are supposed to take the value of UseSTD3ASCIIRules in to account. As described above, if V6 does not consider UseSTD3ASCIIRules, "≠ᢙ≯.com" and "xn--jbf911clb.com" will always be invalid domains. It does not matter that P1 considers UseSTD3ASCIIRules, because it will be caught by V6 later anyway.
This leads me to believe that it should respect UseSTD3ASCIIRules (otherwise the parameter would be meaningless). But it's not clear.

I'll have to apologise again because I am not very familiar with the codebases I am about to cite, but from what I can glean this is leading to divergence in the wild:

@annevk
Copy link
Member

annevk commented May 6, 2022

Thank you @karwa. Really appreciate the time you are putting into this.

@annevk
Copy link
Member

annevk commented Jan 10, 2023

How do people feel about our own coverage in resources/toascii.json? Can we expand upon that instead?

annevk added a commit to web-platform-tests/wpt that referenced this issue Jan 12, 2023
@annevk
Copy link
Member

annevk commented Jan 12, 2023

I created #733 and web-platform-tests/wpt#37907 based on the more recent discussion here.

I guess we'll keep this open for people to report more issues they run into with UTS46 tests and then hopefully they get clarified upstream, but if not we'll cover them in WPT somehow.

annevk added a commit to web-platform-tests/wpt that referenced this issue Jan 12, 2023
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue Jan 19, 2023
…ases, a=testonly

Automatic update from web-platform-tests
IDNA: add a couple interesting ToASCII cases

Identified in whatwg/url#341 by karwa.
--

wpt-commits: 3d997a3ff43a545b3d36e12d55478fb264e6d0df
wpt-pr: 37907
annevk added a commit to web-platform-tests/wpt that referenced this issue Jan 20, 2023
@annevk
Copy link
Member

annevk commented Jan 20, 2023

I now created a test runner again for IdnaTestV2.txt: web-platform-tests/wpt#38080. I also wrote down a bunch of feedback there in a separate file I plan to submit to Unicode. Feedback appreciated!

jamienicol pushed a commit to jamienicol/gecko that referenced this issue Jan 20, 2023
…ases, a=testonly

Automatic update from web-platform-tests
IDNA: add a couple interesting ToASCII cases

Identified in whatwg/url#341 by karwa.
--

wpt-commits: 3d997a3ff43a545b3d36e12d55478fb264e6d0df
wpt-pr: 37907
@macchiati
Copy link

macchiati commented Jan 20, 2023 via email

@annevk
Copy link
Member

annevk commented Jan 20, 2023

Yeah, mislabeling and conflicts with the URL parser:

  • Hosts that end with an ASCII digit label will get parsed as an IPv4 address. And thus fail whereas some of these are expected to not fail.
  • Some inputs contain ? which I could perhaps percent-encode first.

With that a lot of the remaining failures in WebKit are #733. I haven't gone through all of them yet as there are >100 failures still.

@macchiati
Copy link

macchiati commented Jan 20, 2023 via email

@annevk
Copy link
Member

annevk commented Jan 23, 2023

I analyzed the remaining WebKit failures and they all seemed to be about UseSTD3ASCIIRules in one way or another. So mainly a status annotation problem, coupled with #733. I went ahead and submitted feedback to Unicode on the IdnaTestV2.txt file. My feedback is in #744.

Per https://unicode.org/timesens/calendar.html it looks like it will be discussed at the end of April so we'll have to wait a bit before we get a resolution here.

annevk added a commit to web-platform-tests/wpt that referenced this issue Jan 24, 2023
This excludes various tests for now due to the open issues mentioned at the top of IdnaTestV2-parser.py.

For whatwg/url#341.
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue Feb 1, 2023
Automatic update from web-platform-tests
URL: run a subset of IdnaTestV2.txt in WPT

This excludes various tests for now due to the open issues mentioned at the top of IdnaTestV2-parser.py.

For whatwg/url#341.
--

wpt-commits: 9216115f5621b04a27e0f2e9bbf1ce44dd7d3b9e
wpt-pr: 38080
jamienicol pushed a commit to jamienicol/gecko that referenced this issue Feb 3, 2023
Automatic update from web-platform-tests
URL: run a subset of IdnaTestV2.txt in WPT

This excludes various tests for now due to the open issues mentioned at the top of IdnaTestV2-parser.py.

For whatwg/url#341.
--

wpt-commits: 9216115f5621b04a27e0f2e9bbf1ce44dd7d3b9e
wpt-pr: 38080
marcoscaceres pushed a commit to web-platform-tests/wpt that referenced this issue Mar 28, 2023
This excludes various tests for now due to the open issues mentioned at the top of IdnaTestV2-parser.py.

For whatwg/url#341.
i3roly pushed a commit to i3roly/firefox-dynasty that referenced this issue Jun 1, 2024
Automatic update from web-platform-tests
URL: run a subset of IdnaTestV2.txt in WPT

This excludes various tests for now due to the open issues mentioned at the top of IdnaTestV2-parser.py.

For whatwg/url#341.
--

wpt-commits: 9216115f5621b04a27e0f2e9bbf1ce44dd7d3b9e
wpt-pr: 38080
@annevk
Copy link
Member

annevk commented Nov 29, 2024

Let's fold this into the aforementioned issues and web-platform-tests/wpt#48301. I don't think there's anything actionable remaining here.

@annevk annevk closed this as completed Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants