Two-stage unicode tables for UTF-8 Char format #25653

stevengj · 2018-01-20T14:30:44Z

In the long run, it would be nice to have things like character-query functions (isalpha, grapheme breaks, etcetera) that are based directly on the new Char format, rather than requiring conversion/decoding to UInt32. Since we already maintain our own Unicode tables in utf8proc, it seems reasonable to switch to "natively" employing the new format at some point.

The standard way to do this is a two-stage table, since a single lookup table of all Unicode code points would be too big/slow.

Before we lock ourselves into the new Char format, however, it would be good to think about how it affects two-stage tables. In particular, since two-stage tables are based on dividing codepoints into blocks via the low-order bits, the fact that the encoded Char values are zero-padded may be a concern.

julia> reinterpret(UInt32, 'a')
0x61000000

For many codepoints in this format, the least-significant bits will provide no information. Does that mean that traditional two-stage tables won't work? Is there an easy fix?

I haven't really thought about this much, but I think it's important to take a look to make sure we aren't creating any headaches for later. cc @StefanKarpinski

The text was updated successfully, but these errors were encountered:

StefanKarpinski · 2018-01-20T16:30:25Z

Before we lock ourselves into the new Char format, however, it would be good to think about how it affects two-stage tables.

An important point about the new Char format is that we are not locked into it. #24999 has already broken code that makes assumptions about the representation of Char being in terms of UTF-32 code units. Instead, code should make comparisons in terms of characters as much as possible and not rely on a particular representation of Char, and convert to code points as little as possible. Now that people are not relying on the representation of Char, in the future we can change the representation of Char to whatever we like, including back to UInt32 code point values, as long as we retain some way of encoding invalid sequences.

stevengj · 2018-01-20T22:19:59Z

In a quick benchmark, the cost of looking something up in utf8proc's 2-stage tables (which operate on the UInt32 codepoint value) appears to completely swamp the cost of decoding the new Char type.

In 0.6 (no decoding needed):

julia> @btime Base.UTF8proc.category_code('ϵ');
  3.380 ns (0 allocations: 0 bytes)
julia> @btime Base.UTF8proc.category_code('🍕');
  3.592 ns (0 allocations: 0 bytes)
julia> @btime Base.UTF8proc.category_code('a');
  3.796 ns (0 allocations: 0 bytes)

In 0.7 (requires decoding to UInt32), I get about the same times, modulo 10% "noise":

julia> @btime Base.Unicode.category_code('ϵ');
  3.482 ns (0 allocations: 0 bytes)
julia> @btime Base.Unicode.category_code('🍕');
  3.792 ns (0 allocations: 0 bytes)
julia> @btime Base.Unicode.category_code('a');
  3.482 ns (0 allocations: 0 bytes)

Moreover, this is not an entirely fair comparison, because normally such functions would be applied to characters found in a String, which required a more expensive decoding in 0.6 —the current design can be thought of as simply deferring the decoding in such cases (although it may require decoding multiple times if you call multiple utf8proc lookups on the characters). Here is a simple function that computes the category codes of a string. In 0.6:

julia> function sumcats(s)
           i = 0
           for c in s
               i += Base.UTF8proc.category_code(c)
           end
           return i
       end
sumcats (generic function with 1 method)

julia> @btime sumcats($("asciiαβγ∀  🐨 🍕 ∑"));
  97.354 ns (0 allocations: 0 bytes)

In 0.7 (replacing UTF8proc with Unicode), I get:

julia> @btime sumcats($("asciiαβγ∀  🐨 🍕 ∑"));
  120.045 ns (0 allocations: 0 bytes)

I'm not sure why it would be 20% slower in 0.7, given that the category_code lookup seems to be about the same speed. Is there something else that would explain the regression here, since I thought character iteration should be faster?

stevengj added the unicode Related to unicode characters and encodings label Jan 20, 2018

stevengj added the performance Must go faster label Jan 20, 2018

StefanKarpinski added the strings "Strings!" label Jan 21, 2018

digital-carver mentioned this issue Apr 29, 2018

isalpha should use Unicode property Alphabetic; rename to isletter #26932

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two-stage unicode tables for UTF-8 Char format #25653

Two-stage unicode tables for UTF-8 Char format #25653

stevengj commented Jan 20, 2018 •

edited

Loading

StefanKarpinski commented Jan 20, 2018

stevengj commented Jan 20, 2018 •

edited

Loading

Two-stage unicode tables for UTF-8 Char format #25653

Two-stage unicode tables for UTF-8 Char format #25653

Comments

stevengj commented Jan 20, 2018 • edited Loading

StefanKarpinski commented Jan 20, 2018

stevengj commented Jan 20, 2018 • edited Loading

stevengj commented Jan 20, 2018 •

edited

Loading

stevengj commented Jan 20, 2018 •

edited

Loading