Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize size/speed of Unicode datasets #68232

Merged
merged 3 commits into from
Jan 15, 2020

Conversation

Mark-Simulacrum
Copy link
Member

@Mark-Simulacrum Mark-Simulacrum commented Jan 14, 2020

The overall implementation has the same general idea as the prior approach,
which was based on a compressed trie structure, but modified to use less space
(and, coincidentally, be an overall performance improvement).

Sizes Old New New/current
Alphabetic 4616 2982 64.60%
Case_Ignorable 3144 2112 67.18%
Cased 2376 934 39.31%
Cc 19 43 226.32%
Grapheme_Extend 3072 1734 56.45%
Lowercase 2328 985 42.31%
N 2648 1239 46.79%
Uppercase 1978 934 47.22%
White_Space 241 140 58.09%
Total 20422 11103 54.37%

This table shows the size of the old and new tables in bytes. The most important
of these tables is "Grapheme_Extend", as it is present in essentially all Rust
programs due to being called from str's Debug impl (char::escape_debug). In
a representative case given by this blog post for the embedded world, the
shrinking in this PR shrinks the final binary by 1,604 bytes, from 14,440 to
12,836.

The performance of these new tables, based on the (rough) benchmark of linearly
scanning the entire valid set of chars, querying for each is_*, is roughly
~50% better, though in some cases is either on par or slightly (3-5%) worse. In
practice, I believe the size benefits of this PR are the main concern. The new
implementation has been tested to be equivalent to the current nightly in terms
of returned values on the set of valid chars.

A (relatively) high-level explanation of the specific compression scheme used
can be found in the generator.

This is split into three commits -- the first adds the generator which produces
the Rust code for the tables, the second adds support code for the lookup, and
the third actually swaps the current implementation out for the new one.

@rust-highfive
Copy link
Collaborator

r? @joshtriplett

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jan 14, 2020
@Mark-Simulacrum
Copy link
Member Author

cc @jamesmunns per our discussions about this being helpful for embedded

@rust-highfive
Copy link
Collaborator

The job x86_64-gnu-llvm-7 of your PR failed (pretty log, raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
2020-01-14T21:28:27.7918909Z ##[command]git remote add origin https://github.com/rust-lang/rust
2020-01-14T21:28:27.8009967Z ##[command]git config gc.auto 0
2020-01-14T21:28:27.8094159Z ##[command]git config --get-all http.https://github.com/rust-lang/rust.extraheader
2020-01-14T21:28:27.8162698Z ##[command]git config --get-all http.proxy
2020-01-14T21:28:27.8362080Z ##[command]git -c http.extraheader="AUTHORIZATION: basic ***" fetch --force --tags --prune --progress --no-recurse-submodules --depth=2 origin +refs/heads/*:refs/remotes/origin/* +refs/pull/68232/merge:refs/remotes/pull/68232/merge
---
2020-01-14T22:23:43.7230124Z .................................................................................................... 1600/9519
2020-01-14T22:23:49.1675536Z .................................................................................................... 1700/9519
2020-01-14T22:23:58.4061788Z .................................................................................................... 1800/9519
2020-01-14T22:24:07.7904235Z ........i........................................................................................... 1900/9519
2020-01-14T22:24:14.9825123Z ..................................................................................................ii 2000/9519
2020-01-14T22:24:31.4096467Z iii................................................................................................. 2100/9519
2020-01-14T22:24:39.8061543Z .................................................................................................... 2300/9519
2020-01-14T22:24:42.3382941Z .................................................................................................... 2400/9519
2020-01-14T22:24:48.2445277Z .................................................................................................... 2500/9519
2020-01-14T22:25:08.8398786Z .................................................................................................... 2600/9519
---
2020-01-14T22:27:51.2570276Z .........................................i...............i.......................................... 4900/9519
2020-01-14T22:28:00.6625733Z .................................................................................................... 5000/9519
2020-01-14T22:28:07.1660952Z ....................................................................................i............... 5100/9519
2020-01-14T22:28:12.7632732Z .................................................................................................... 5200/9519
2020-01-14T22:28:23.3509875Z .......................................................ii.ii...........i............................ 5300/9519
2020-01-14T22:28:32.9396986Z .................................................................................................... 5500/9519
2020-01-14T22:28:43.5477184Z .................................................................................................... 5600/9519
2020-01-14T22:28:50.2443148Z ........................................i........................................................... 5700/9519
2020-01-14T22:28:57.2595371Z .................................................................................................... 5800/9519
2020-01-14T22:28:57.2595371Z .................................................................................................... 5800/9519
2020-01-14T22:29:08.2740792Z .................................................................................................... 5900/9519
2020-01-14T22:29:18.2862387Z ...............................ii...i..ii...........i............................................... 6000/9519
2020-01-14T22:29:37.6067722Z .................................................................................................... 6200/9519
2020-01-14T22:29:45.9629012Z .................................................................................................... 6300/9519
2020-01-14T22:29:45.9629012Z .................................................................................................... 6300/9519
2020-01-14T22:29:53.8757046Z ...........................................................i..ii.................................... 6400/9519
2020-01-14T22:30:21.9863564Z .................................................................................................... 6600/9519
2020-01-14T22:30:24.2375970Z ...................................i................................................................ 6700/9519
2020-01-14T22:30:26.5710051Z .................................................................................................... 6800/9519
2020-01-14T22:30:29.1810539Z ...................................i................................................................ 6900/9519
---
2020-01-14T22:32:08.9142276Z .................................................................................................... 7500/9519
2020-01-14T22:32:13.4122534Z .................................................................................................... 7600/9519
2020-01-14T22:32:19.6054794Z .................................................................................................... 7700/9519
2020-01-14T22:32:26.8730354Z .................................................................................................... 7800/9519
2020-01-14T22:32:36.9979843Z ....................................................................................iiii............ 7900/9519
2020-01-14T22:32:53.7448148Z ..................i......i.......................................................................... 8100/9519
2020-01-14T22:32:59.1143144Z .................................................................................................... 8200/9519
2020-01-14T22:33:12.7743361Z .................................................................................................... 8300/9519
2020-01-14T22:33:23.0670789Z .................................................................................................... 8400/9519
---
2020-01-14T22:35:53.7772330Z  finished in 7.371
2020-01-14T22:35:53.7983663Z Check compiletest suite=codegen mode=codegen (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
2020-01-14T22:35:54.0083800Z 
2020-01-14T22:35:54.0084045Z running 166 tests
2020-01-14T22:35:57.1173691Z iiii......i........ii..iiii...i....i...........i............i..i..................i....i............ 100/166
2020-01-14T22:35:59.4483678Z i.i.i...iii..iiiiiii.......................iii............ii......
2020-01-14T22:35:59.4490247Z 
2020-01-14T22:35:59.4490348Z  finished in 5.650
2020-01-14T22:35:59.4695103Z Check compiletest suite=codegen-units mode=codegen-units (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
2020-01-14T22:35:59.6466477Z 
---
2020-01-14T22:36:01.6360395Z  finished in 2.166
2020-01-14T22:36:01.6555255Z Check compiletest suite=assembly mode=assembly (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
2020-01-14T22:36:01.8813210Z 
2020-01-14T22:36:01.8813454Z running 9 tests
2020-01-14T22:36:01.8814305Z iiiiiiiii
2020-01-14T22:36:01.8822973Z 
2020-01-14T22:36:01.8828408Z  finished in 0.226
2020-01-14T22:36:01.9035317Z Check compiletest suite=incremental mode=incremental (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
2020-01-14T22:36:02.1010658Z 
---
2020-01-14T22:36:22.7261907Z  finished in 20.822
2020-01-14T22:36:22.7492617Z Check compiletest suite=debuginfo mode=debuginfo-gdb+lldb (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
2020-01-14T22:36:22.9469170Z 
2020-01-14T22:36:22.9469432Z running 116 tests
2020-01-14T22:36:49.0532021Z .iiiii..i.....i..i...i..i.i.i..i..i..ii....i.i....ii..........iiii..........i.....i..i.......ii.i.ii 100/116
2020-01-14T22:36:52.6991584Z .....iiii.....ii
2020-01-14T22:36:52.6994440Z 
2020-01-14T22:36:52.6998914Z  finished in 29.950
2020-01-14T22:36:52.7007332Z Uplifting stage1 rustc (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
2020-01-14T22:36:52.7007961Z Copying stage2 rustc from stage1 (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu / x86_64-unknown-linux-gnu)
---
2020-01-14T22:50:08.7696607Z ..................................................................................................
2020-01-14T22:50:08.7697372Z failures:
2020-01-14T22:50:08.7697874Z 
2020-01-14T22:50:08.7698805Z ---- char::test_to_lowercase stdout ----
2020-01-14T22:50:08.7699672Z thread 'char::test_to_lowercase' panicked at 'assertion failed: `(left == right)`
2020-01-14T22:50:08.7700052Z   left: `"\u{0}"`,
2020-01-14T22:50:08.7700596Z  right: `"ß"`', src/libcore/../libcore/tests/char.rs:88:5
2020-01-14T22:50:08.7701381Z 
2020-01-14T22:50:08.7701564Z failures:
2020-01-14T22:50:08.7701741Z     char::test_to_lowercase
2020-01-14T22:50:08.7701923Z 
2020-01-14T22:50:08.7701923Z 
2020-01-14T22:50:08.7702108Z test result: FAILED. 995 passed; 1 failed; 2 ignored; 0 measured; 0 filtered out
2020-01-14T22:50:08.7702267Z 
2020-01-14T22:50:08.7718329Z error: test failed, to rerun pass '-p core --test coretests'
2020-01-14T22:50:08.7743321Z 
2020-01-14T22:50:08.7744276Z command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "test" "-Zconfig-profile" "--target" "x86_64-unknown-linux-gnu" "-Zbinary-dep-depinfo" "-j" "2" "--release" "--locked" "--color" "always" "--features" "panic-unwind backtrace compiler-builtins-c" "--manifest-path" "/checkout/src/libtest/Cargo.toml" "-p" "core" "--" "--quiet"
2020-01-14T22:50:08.7744640Z expected success, got: exit code: 101
2020-01-14T22:50:08.7744838Z 
---
2020-01-14T22:50:08.7823508Z   local time: Tue Jan 14 22:50:08 UTC 2020
2020-01-14T22:50:08.9454476Z   network time: Tue, 14 Jan 2020 22:50:08 GMT
2020-01-14T22:50:08.9461199Z == end clock drift check ==
2020-01-14T22:50:09.2679907Z 
2020-01-14T22:50:09.2780968Z ##[error]Bash exited with code '1'.
2020-01-14T22:50:09.2833567Z ##[section]Starting: Checkout
2020-01-14T22:50:09.2835478Z ==============================================================================
2020-01-14T22:50:09.2835520Z Task         : Get sources
2020-01-14T22:50:09.2835574Z Description  : Get sources from a repository. Supports Git, TfsVC, and SVN repositories.

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@Mark-Simulacrum
Copy link
Member Author

Turns out adding in the case conversions at the last minute in a way that I thought was 1-1 with the old code is not a good idea. The bug should be fixed now.

@joshtriplett
Copy link
Member

Very nice!

@bors r+

@bors
Copy link
Contributor

bors commented Jan 15, 2020

📌 Commit efcda04 has been approved by joshtriplett

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 15, 2020
@joshtriplett
Copy link
Member

Not for this series, but looking at the implementation of escape_debug, how critical is it (and how much does backward compatibility depend on) that it treats extended graphemes specially, rather than escaping all Unicode or all non-printable Unicode? Omitting that table entirely from most Rust programs seems worthwhile.

@CAD97
Copy link
Contributor

CAD97 commented Jan 15, 2020

rather than escaping all Unicode

That would be str::escape_default

or all non-printable Unicode

That's today's str::escape_debug (roughly) 😛

The problem being solved by the grapheme table is (in theory, anyway) that the printability of any specific Unicode codepoint depends on its position in a grapheme. Just as a simple example, ZERO WIDTH JOINER is an "unprintable" that has no meaning and is invisible when just in the middle of whitespace. But when between two emoji, it signifies that the user wants a singular emoji containing the attributes of both emoji. Similarly, the country code codepoints on their own are "unprintables", but in pairs they represent country flag emoji. (In practice: Unicode is hard and you don't even know what's actually printable until you ask the font. escape_default is best effort to show all meaningful codepoints in the string while maintaining legibility for strings including non-ascii characters.)

Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this pull request Jan 15, 2020
…oshtriplett

Optimize size/speed of Unicode datasets

The overall implementation has the same general idea as the prior approach,
which was based on a compressed trie structure, but modified to use less space
(and, coincidentally, be an overall performance improvement).

Sizes | Old | New | New/current
-- | -- | -- | --
Alphabetic | 4616 | 2982 | 64.60%
Case_Ignorable | 3144 | 2112 | 67.18%
Cased | 2376 | 934 | 39.31%
Cc | 19 | 43 | 226.32%
Grapheme_Extend | 3072 | 1734 | 56.45%
Lowercase | 2328 | 985 | 42.31%
N | 2648 | 1239 | 46.79%
Uppercase | 1978 | 934 | 47.22%
White_Space | 241 | 140 | 58.09%
  |   |   |
Total | 20422 | 11103 | 54.37%

This table shows the size of the old and new tables in bytes. The most important
of these tables is "Grapheme_Extend", as it is present in essentially all Rust
programs due to being called from `str`'s Debug impl (`char::escape_debug`). In
a representative case given by this [blog post] for the embedded world, the
shrinking in this PR shrinks the final binary by 1,604 bytes, from 14,440 to
12,836.

The performance of these new tables, based on the (rough) benchmark of linearly
scanning the entire valid set of chars, querying for each `is_*`, is roughly
~50% better, though in some cases is either on par or slightly (3-5%) worse. In
practice, I believe the size benefits of this PR are the main concern. The new
implementation has been tested to be equivalent to the current nightly in terms
of returned values on the set of valid chars.

A (relatively) high-level explanation of the specific compression scheme used
can be found [in the generator].

This is split into three commits -- the first adds the generator which produces
the Rust code for the tables, the second adds support code for the lookup, and
the third actually swaps the current implementation out for the new one.

[blog post]: https://jamesmunns.com/blog/fmt-unreasonably-expensive/
[in the generator]: https://github.com/Mark-Simulacrum/rust/blob/unicode-tables/src/tools/unicode-table-generator/src/raw_emitter.rs
bors added a commit that referenced this pull request Jan 15, 2020
Rollup of 6 pull requests

Successful merges:

 - #68123 (Implement Cursor for linked lists. (RFC 2570).)
 - #68212 (Suggest to shorten temporary lifetime during method call inside generator)
 - #68232 (Optimize size/speed of Unicode datasets)
 - #68236 (Add some regression tests)
 - #68237 (Account for `Path`s in `is_suggestable_infer_ty`)
 - #68252 (remove redundant clones, found by clippy)

Failed merges:

r? @ghost
@bors bors merged commit efcda04 into rust-lang:master Jan 15, 2020
@Mark-Simulacrum Mark-Simulacrum deleted the unicode-tables branch January 23, 2020 20:51
@joshtriplett
Copy link
Member

@CAD97 What I'm suggesting is that for Debug output, it's potentially acceptable to escape characters that might have been printable. And if doing so means we can drop a fairly large table from the majority of Rust binaries, that seems worth doing.

Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this pull request Mar 28, 2020
…tolnay

Shrink Unicode tables (even more)

This shrinks the Unicode tables further, building upon the wins in rust-lang#68232 (the previous counts differ due to an interim Unicode version update, see rust-lang#69929.

The new data structure is slower by around 3x, on the benchmark of looking up every Unicode scalar value in each data set sequentially in every data set included. Note that for ASCII, the exposed functions on `char` optimize with direct branches, so ASCII will retain the same performance regardless of internal optimizations (or the reverse). Also, note that the size reduction due to the skip list (from where the performance losses come) is around 40%, and, as a result, I believe the performance loss is acceptable, as the routines are still quite fast. Anywhere where this is hot, should probably be using a custom data structure anyway (e.g., a raw bitset) or something optimized for frequently seen values, etc.

This PR updates both the bitset data structure, and introduces a new data structure similar to a skip list. For more details, see the [main.rs] of the table generator, which describes both. The commits mostly work individually and document size wins.

As before, this is tested on all valid chars to have the same results as nightly (and the canonical Unicode data sets), happily, no bugs were found.

[main.rs]: https://github.com/rust-lang/rust/blob/fb4a715e18b/src/tools/unicode-table-generator/src/main.rs

Set             | Previous |  New  |  % of old  | Codepoints | Ranges |
----------------|---------:|------:|-----------:|-----------:|-------:|
Alphabetic      |     3055 |  1599 |        52% |     132875 |    695 |
Case Ignorable  |     2136 |   949 |        44% |       2413 |    410 |
Cased           |      934 |   359 |        38% |       4286 |    141 |
Cc              |       43 |     9 |        20% |         65 |      2 |
Grapheme Extend |     1774 |   813 |        46% |       1979 |    344 |
Lowercase       |      985 |   867 |        88% |       2344 |    652 |
N               |     1266 |   419 |        33% |       1781 |    133 |
Uppercase       |      934 |   777 |        83% |       1911 |    643 |
White_Space     |      140 |    37 |        26% |         25 |     10 |
----------------|----------|-------|------------|------------|--------|
Total           |    11267 |  5829 |        51% |     -      |   -    |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants