Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ICU4X to run parts of util.unicode.org #1004

Open
sffc opened this issue Jan 26, 2025 · 5 comments
Open

Use ICU4X to run parts of util.unicode.org #1004

sffc opened this issue Jan 26, 2025 · 5 comments

Comments

@sffc
Copy link
Member

sffc commented Jan 26, 2025

Currently util.unicode.org runs on top of ICU4J. It works fine, but sometimes it is slow or hits rate limits that we've imposed to cap server costs, as it is doing as I write this message:

Image

We should add ICU4X-backed tooling to parts of util.unicode.org via WebAssembly. This has the benefit of reducing latency (all calculations are client-side) and serving costs (the ICU4X wasm file can be cost-efficiently cached and served in a CDN).

The Unicode Tools are designed to run on the latest (even unreleased) version of the Unicode Standard, and so part of this project may involve improving some of the ICU4X tooling so that it can read raw UCD files. See unicode-org/icu4x#4602

CC @josh-hadley @eggrobin

@eggrobin
Copy link
Member

eggrobin commented Jan 26, 2025

There is much more to it than using draft data. We also expose properties that should never be part of APIs, things that are not properties, etc., and we use an extremely featureful version of UnicodeSet that should not be part of general-purpose libraries (in particular, this allows you to look at past versions of Unicode, or search property values using regular expressions).

In particular, the fact that the properties library is the same that we use to actually generate and test the standard matters when it comes to being confident that we know what we are publishing.

I suspect that the traffic we are seeing is some sort of crawling bot though:
Image

All of these queries are « the characters that have some specific value of some property property » (typically with one result), but without much rhyme or reason to what values and properties are queried. Queries of this form are linked from the character.jsp page, so I suspect this is something following the links there.

@eggrobin
Copy link
Member

Here’s the current traffic from one specific (slightly odd) user agent:

Image

@macchiati
Copy link
Member

macchiati commented Jan 26, 2025 via email

@srl295
Copy link
Member

srl295 commented Jan 27, 2025

@sffc is this ticket a dup? or have we just talked about it without an issue?

if node is nodejs maybe someone is scraping.

@sffc
Copy link
Member Author

sffc commented Jan 27, 2025

This issue is to track migrating parts of until.unicode.org to ICU4X, which has been discussed in various forums, but I couldn't find a canonical issue.

Investigation on other server cost mitigation or rate limiting techniques could be discussed elsewhere. That doesn't however invalidate the motivation for making popular parts of the site run client-side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants