Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two-stage unicode tables for UTF-8 Char format #25653

Open
stevengj opened this issue Jan 20, 2018 · 2 comments
Open

Two-stage unicode tables for UTF-8 Char format #25653

stevengj opened this issue Jan 20, 2018 · 2 comments
Labels
performance Must go faster strings "Strings!" unicode Related to unicode characters and encodings

Comments

@stevengj
Copy link
Member

stevengj commented Jan 20, 2018

In the long run, it would be nice to have things like character-query functions (isalpha, grapheme breaks, etcetera) that are based directly on the new Char format, rather than requiring conversion/decoding to UInt32. Since we already maintain our own Unicode tables in utf8proc, it seems reasonable to switch to "natively" employing the new format at some point.

The standard way to do this is a two-stage table, since a single lookup table of all Unicode code points would be too big/slow.

Before we lock ourselves into the new Char format, however, it would be good to think about how it affects two-stage tables. In particular, since two-stage tables are based on dividing codepoints into blocks via the low-order bits, the fact that the encoded Char values are zero-padded may be a concern.

julia> reinterpret(UInt32, 'a')
0x61000000

For many codepoints in this format, the least-significant bits will provide no information. Does that mean that traditional two-stage tables won't work? Is there an easy fix?

I haven't really thought about this much, but I think it's important to take a look to make sure we aren't creating any headaches for later. cc @StefanKarpinski

@stevengj stevengj added the unicode Related to unicode characters and encodings label Jan 20, 2018
@StefanKarpinski
Copy link
Member

Before we lock ourselves into the new Char format, however, it would be good to think about how it affects two-stage tables.

An important point about the new Char format is that we are not locked into it. #24999 has already broken code that makes assumptions about the representation of Char being in terms of UTF-32 code units. Instead, code should make comparisons in terms of characters as much as possible and not rely on a particular representation of Char, and convert to code points as little as possible. Now that people are not relying on the representation of Char, in the future we can change the representation of Char to whatever we like, including back to UInt32 code point values, as long as we retain some way of encoding invalid sequences.

@stevengj
Copy link
Member Author

stevengj commented Jan 20, 2018

In a quick benchmark, the cost of looking something up in utf8proc's 2-stage tables (which operate on the UInt32 codepoint value) appears to completely swamp the cost of decoding the new Char type.

In 0.6 (no decoding needed):

julia> @btime Base.UTF8proc.category_code('ϵ');
  3.380 ns (0 allocations: 0 bytes)
julia> @btime Base.UTF8proc.category_code('🍕');
  3.592 ns (0 allocations: 0 bytes)
julia> @btime Base.UTF8proc.category_code('a');
  3.796 ns (0 allocations: 0 bytes)

In 0.7 (requires decoding to UInt32), I get about the same times, modulo 10% "noise":

julia> @btime Base.Unicode.category_code('ϵ');
  3.482 ns (0 allocations: 0 bytes)
julia> @btime Base.Unicode.category_code('🍕');
  3.792 ns (0 allocations: 0 bytes)
julia> @btime Base.Unicode.category_code('a');
  3.482 ns (0 allocations: 0 bytes)

Moreover, this is not an entirely fair comparison, because normally such functions would be applied to characters found in a String, which required a more expensive decoding in 0.6 —the current design can be thought of as simply deferring the decoding in such cases (although it may require decoding multiple times if you call multiple utf8proc lookups on the characters). Here is a simple function that computes the category codes of a string. In 0.6:

julia> function sumcats(s)
           i = 0
           for c in s
               i += Base.UTF8proc.category_code(c)
           end
           return i
       end
sumcats (generic function with 1 method)

julia> @btime sumcats($("asciiαβγ∀  🐨 🍕 ∑"));
  97.354 ns (0 allocations: 0 bytes)

In 0.7 (replacing UTF8proc with Unicode), I get:

julia> @btime sumcats($("asciiαβγ∀  🐨 🍕 ∑"));
  120.045 ns (0 allocations: 0 bytes)

I'm not sure why it would be 20% slower in 0.7, given that the category_code lookup seems to be about the same speed. Is there something else that would explain the regression here, since I thought character iteration should be faster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster strings "Strings!" unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

2 participants