-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two-stage unicode tables for UTF-8 Char format #25653
Comments
An important point about the new |
In a quick benchmark, the cost of looking something up in utf8proc's 2-stage tables (which operate on the In 0.6 (no decoding needed): julia> @btime Base.UTF8proc.category_code('ϵ');
3.380 ns (0 allocations: 0 bytes)
julia> @btime Base.UTF8proc.category_code('🍕');
3.592 ns (0 allocations: 0 bytes)
julia> @btime Base.UTF8proc.category_code('a');
3.796 ns (0 allocations: 0 bytes) In 0.7 (requires decoding to julia> @btime Base.Unicode.category_code('ϵ');
3.482 ns (0 allocations: 0 bytes)
julia> @btime Base.Unicode.category_code('🍕');
3.792 ns (0 allocations: 0 bytes)
julia> @btime Base.Unicode.category_code('a');
3.482 ns (0 allocations: 0 bytes) Moreover, this is not an entirely fair comparison, because normally such functions would be applied to characters found in a julia> function sumcats(s)
i = 0
for c in s
i += Base.UTF8proc.category_code(c)
end
return i
end
sumcats (generic function with 1 method)
julia> @btime sumcats($("asciiαβγ∀ 🐨 🍕 ∑"));
97.354 ns (0 allocations: 0 bytes) In 0.7 (replacing julia> @btime sumcats($("asciiαβγ∀ 🐨 🍕 ∑"));
120.045 ns (0 allocations: 0 bytes) I'm not sure why it would be 20% slower in 0.7, given that the |
In the long run, it would be nice to have things like character-query functions (
isalpha
, grapheme breaks, etcetera) that are based directly on the newChar
format, rather than requiring conversion/decoding toUInt32
. Since we already maintain our own Unicode tables in utf8proc, it seems reasonable to switch to "natively" employing the new format at some point.The standard way to do this is a two-stage table, since a single lookup table of all Unicode code points would be too big/slow.
Before we lock ourselves into the new
Char
format, however, it would be good to think about how it affects two-stage tables. In particular, since two-stage tables are based on dividing codepoints into blocks via the low-order bits, the fact that the encodedChar
values are zero-padded may be a concern.For many codepoints in this format, the least-significant bits will provide no information. Does that mean that traditional two-stage tables won't work? Is there an easy fix?
I haven't really thought about this much, but I think it's important to take a look to make sure we aren't creating any headaches for later. cc @StefanKarpinski
The text was updated successfully, but these errors were encountered: