-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize size/speed of Unicode datasets #68232
Conversation
(rust_highfive has picked a reviewer for you, use r? to override) |
cc @jamesmunns per our discussions about this being helpful for embedded |
21efaf9
to
74ca7ed
Compare
The job Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
74ca7ed
to
efcda04
Compare
Turns out adding in the case conversions at the last minute in a way that I thought was 1-1 with the old code is not a good idea. The bug should be fixed now. |
Very nice! @bors r+ |
📌 Commit efcda04 has been approved by |
Not for this series, but looking at the implementation of escape_debug, how critical is it (and how much does backward compatibility depend on) that it treats extended graphemes specially, rather than escaping all Unicode or all non-printable Unicode? Omitting that table entirely from most Rust programs seems worthwhile. |
That would be
That's today's The problem being solved by the grapheme table is (in theory, anyway) that the printability of any specific Unicode codepoint depends on its position in a grapheme. Just as a simple example, ZERO WIDTH JOINER is an "unprintable" that has no meaning and is invisible when just in the middle of whitespace. But when between two emoji, it signifies that the user wants a singular emoji containing the attributes of both emoji. Similarly, the country code codepoints on their own are "unprintables", but in pairs they represent country flag emoji. (In practice: Unicode is hard and you don't even know what's actually printable until you ask the font. |
…oshtriplett Optimize size/speed of Unicode datasets The overall implementation has the same general idea as the prior approach, which was based on a compressed trie structure, but modified to use less space (and, coincidentally, be an overall performance improvement). Sizes | Old | New | New/current -- | -- | -- | -- Alphabetic | 4616 | 2982 | 64.60% Case_Ignorable | 3144 | 2112 | 67.18% Cased | 2376 | 934 | 39.31% Cc | 19 | 43 | 226.32% Grapheme_Extend | 3072 | 1734 | 56.45% Lowercase | 2328 | 985 | 42.31% N | 2648 | 1239 | 46.79% Uppercase | 1978 | 934 | 47.22% White_Space | 241 | 140 | 58.09% | | | Total | 20422 | 11103 | 54.37% This table shows the size of the old and new tables in bytes. The most important of these tables is "Grapheme_Extend", as it is present in essentially all Rust programs due to being called from `str`'s Debug impl (`char::escape_debug`). In a representative case given by this [blog post] for the embedded world, the shrinking in this PR shrinks the final binary by 1,604 bytes, from 14,440 to 12,836. The performance of these new tables, based on the (rough) benchmark of linearly scanning the entire valid set of chars, querying for each `is_*`, is roughly ~50% better, though in some cases is either on par or slightly (3-5%) worse. In practice, I believe the size benefits of this PR are the main concern. The new implementation has been tested to be equivalent to the current nightly in terms of returned values on the set of valid chars. A (relatively) high-level explanation of the specific compression scheme used can be found [in the generator]. This is split into three commits -- the first adds the generator which produces the Rust code for the tables, the second adds support code for the lookup, and the third actually swaps the current implementation out for the new one. [blog post]: https://jamesmunns.com/blog/fmt-unreasonably-expensive/ [in the generator]: https://github.com/Mark-Simulacrum/rust/blob/unicode-tables/src/tools/unicode-table-generator/src/raw_emitter.rs
Rollup of 6 pull requests Successful merges: - #68123 (Implement Cursor for linked lists. (RFC 2570).) - #68212 (Suggest to shorten temporary lifetime during method call inside generator) - #68232 (Optimize size/speed of Unicode datasets) - #68236 (Add some regression tests) - #68237 (Account for `Path`s in `is_suggestable_infer_ty`) - #68252 (remove redundant clones, found by clippy) Failed merges: r? @ghost
@CAD97 What I'm suggesting is that for Debug output, it's potentially acceptable to escape characters that might have been printable. And if doing so means we can drop a fairly large table from the majority of Rust binaries, that seems worth doing. |
…tolnay Shrink Unicode tables (even more) This shrinks the Unicode tables further, building upon the wins in rust-lang#68232 (the previous counts differ due to an interim Unicode version update, see rust-lang#69929. The new data structure is slower by around 3x, on the benchmark of looking up every Unicode scalar value in each data set sequentially in every data set included. Note that for ASCII, the exposed functions on `char` optimize with direct branches, so ASCII will retain the same performance regardless of internal optimizations (or the reverse). Also, note that the size reduction due to the skip list (from where the performance losses come) is around 40%, and, as a result, I believe the performance loss is acceptable, as the routines are still quite fast. Anywhere where this is hot, should probably be using a custom data structure anyway (e.g., a raw bitset) or something optimized for frequently seen values, etc. This PR updates both the bitset data structure, and introduces a new data structure similar to a skip list. For more details, see the [main.rs] of the table generator, which describes both. The commits mostly work individually and document size wins. As before, this is tested on all valid chars to have the same results as nightly (and the canonical Unicode data sets), happily, no bugs were found. [main.rs]: https://github.com/rust-lang/rust/blob/fb4a715e18b/src/tools/unicode-table-generator/src/main.rs Set | Previous | New | % of old | Codepoints | Ranges | ----------------|---------:|------:|-----------:|-----------:|-------:| Alphabetic | 3055 | 1599 | 52% | 132875 | 695 | Case Ignorable | 2136 | 949 | 44% | 2413 | 410 | Cased | 934 | 359 | 38% | 4286 | 141 | Cc | 43 | 9 | 20% | 65 | 2 | Grapheme Extend | 1774 | 813 | 46% | 1979 | 344 | Lowercase | 985 | 867 | 88% | 2344 | 652 | N | 1266 | 419 | 33% | 1781 | 133 | Uppercase | 934 | 777 | 83% | 1911 | 643 | White_Space | 140 | 37 | 26% | 25 | 10 | ----------------|----------|-------|------------|------------|--------| Total | 11267 | 5829 | 51% | - | - |
The overall implementation has the same general idea as the prior approach,
which was based on a compressed trie structure, but modified to use less space
(and, coincidentally, be an overall performance improvement).
This table shows the size of the old and new tables in bytes. The most important
of these tables is "Grapheme_Extend", as it is present in essentially all Rust
programs due to being called from
str
's Debug impl (char::escape_debug
). Ina representative case given by this blog post for the embedded world, the
shrinking in this PR shrinks the final binary by 1,604 bytes, from 14,440 to
12,836.
The performance of these new tables, based on the (rough) benchmark of linearly
scanning the entire valid set of chars, querying for each
is_*
, is roughly~50% better, though in some cases is either on par or slightly (3-5%) worse. In
practice, I believe the size benefits of this PR are the main concern. The new
implementation has been tested to be equivalent to the current nightly in terms
of returned values on the set of valid chars.
A (relatively) high-level explanation of the specific compression scheme used
can be found in the generator.
This is split into three commits -- the first adds the generator which produces
the Rust code for the tables, the second adds support code for the lookup, and
the third actually swaps the current implementation out for the new one.