Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaving ICU behind (proof of concept) #68

Closed
wants to merge 2 commits into from
Closed

Conversation

lemire
Copy link
Member

@lemire lemire commented Jan 19, 2023

This is not yet quite correct, but it passes 'most' of our tests (it is currently only failing toascii_encoding). It requires much more work but this PR shows how we could do away with the ICU dependency by implementing just what we need. Essentially, we need these components...

  1. Validating exiting punycode.
  2. Transcribing non-ASCII into punycode.
  3. An up-to-data list of code points that are either invalid (must be rejected) or that must be omitted. And other fancy internationalization issues. For Unicode 15, the table is https://www.unicode.org/Public/idna/15.0.0/IdnaMappingTable.txt

RFC 3454 tells us that we need to

  • Map characters (see section 3),
  • Normalize (see section 4),
  • Reject forbidden characters,
  • Check for right-to-left characters and if so, check all requirements (see section 6),
  • Optionally reject based on unassigned code points (section 7).

Reference to the standard: https://www.unicode.org/reports/tr46/#ToUnicode

You can verify that it is 'relatively' far along... comment out toascii_encoding() and it works...

index 4dd7dd8..f3d79bc 100644
--- a/tests/wpt_tests.cpp
+++ b/tests/wpt_tests.cpp
@@ -312,7 +312,7 @@ bool urltestdata_encoding() {
 int main() {
   std::cout << "Running WPT tests.\n" << std::endl;
 
-  if (percent_encoding() && setters_tests_encoding() && toascii_encoding() &&
+  if (percent_encoding() && setters_tests_encoding() && //toascii_encoding() &&
       urltestdata_encoding()) {
     std::cout << "WPT tests are ok." << std::endl;
     return EXIT_SUCCESS;

However, I stress that the unicode validation and processing is only a sketch. It needs to be completed. It is not conceptually very hard. I did not look into how people implement these things. But it is mostly a matter of having the right tables. I rolled my own for now.

Whether we want to go down this route is up to us. Having fewer dependencies is clearly a positive. However, rolling our own requires us to do more work and more testing.

while (!input.empty()) {
size_t loc_dot = input.find('.');
size_t loc_full_stop = input.find("\u3002"); // complete as needed
size_t loc = std::min(loc_dot, loc_full_stop);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes 2 find operations and a std::min. We can reduce it. First finding either characters and depending on that finding the other. Wonder it will be faster...

Copy link
Member Author

@lemire lemire Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anonrig

That's the least of your concerns. My code is only an illustration. The specification says...

The domain labels of a domain domain are the result of strictly splitting domain on U+002E (.).

But it is not sufficient, the tests say...

"Ideographic full stop (full-width period for Chinese, etc.) should be treated as a dot. U+3002 is mapped to U+002E (dot)"

And this refers to the following specification...

https://www.unicode.org/reports/tr46/#ToUnicode

I did not implement it, I only sketched it (to show how it might be done).

The ICU implementation is here...

https://github.com/unicode-org/icu/blob/bb0e745e25c99cc57055caf45c81b95ef63b25d4/icu4c/source/common/uts46.cpp#L1298

The list of characters that are mapped to the dot include...

  • U+3002 (ideographic full stop)
  • U+FF0E (fullwidth full stop)
  • U+FF61 (halfwidth ideographic full stop)

I only included 3002 here.

The specification says...

In this document, a label is a substring of a domain name. That substring is bounded on both sides by either the start or the end of the string, or any of the following characters, called label-separators:

U+002E ( . ) FULL STOP
U+FF0E ( . ) FULLWIDTH FULL STOP
U+3002 ( 。 ) IDEOGRAPHIC FULL STOP
U+FF61 ( 。 ) HALFWIDTH IDEOGRAPHIC FULL STOP

if (!label_string.has_value()) {
return std::nullopt;
}
if (label_string.value().empty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be simplified to label_string.value_or.empty(). Removes the previous branch

// and labels cannot exceed 64 characters. However, wpt_tests seems to want
// to allow more general cases?
// Though wpt_tests does not want limits, let us put one anyhow. If someone
// has a domain with over 1MB, we refuse to work on it (safety!).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add this to readme? Sounds like a good feature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ICU will check for violation of these conditions...

https://github.com/unicode-org/icu/blob/bb0e745e25c99cc57055caf45c81b95ef63b25d4/icu4c/source/common/uts46.cpp#L32

// A domain name label is longer than 63 bytes.

But the tests (WPT) seem to expect that this be ignored... Search for the tests...

"Label longer than 63 code points",

So it wants to allow longer labels. I am not sure I understand.

auto loc = input.rfind('-');
uint32_t count = 0;
if (loc != std::string_view::npos) {
for(uint8_t c : input.substr(0, loc)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::any_of would make this easier to read

for (auto iterator = input.begin(); iterator != input.end();) {
int start_i = i;
int w = 1;
for (int k = 36;; k += 36) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use simd for this. Nice 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The punycode algorithm is not easily vectorizable. It is designed from the ground up for character-by-character processing. I am not saying it could not be accelerated, but it would be a major challenge.


out.value().resize(length);
out = ada::puny::convert_domain_to_puny(input, be_strict);
if(!out.has_value()) { return false; }
if(std::any_of(out.value().begin(), out.value().end(), ada::unicode::is_forbidden_domain_code_point)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this inside the punycode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this line is probably useless.

return false;
}

bool is_ignorable(uint32_t code_point) noexcept {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a documentation for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the unicode version you want to implement. For Unicode 15, the table is there:

https://www.unicode.org/Public/idna/15.0.0/IdnaMappingTable.txt

So you have mapped values (you need to transform it from one code point to another), valid values (you just keep it) and forbidden values (you must produce an error).

@lemire
Copy link
Member Author

lemire commented Jan 19, 2023

I am going to close this for now.

@lemire lemire closed this Jan 19, 2023
@anonrig anonrig deleted the dlemire/leaving_icu branch May 25, 2023 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants