Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for the compressed bitset format that rustc uses for some tables in libcore #30

Open
thomcc opened this issue Mar 13, 2020 · 4 comments
Labels
question Further information is requested

Comments

@thomcc
Copy link
Contributor

thomcc commented Mar 13, 2020

Documented here: https://github.com/rust-lang/rust/blob/master/src/tools/unicode-table-generator/src/raw_emitter.rs, added in rust-lang/rust#68232

So I actually experimented a bit with this a bit shortly after I heard about it. Unfortunately, it actually requires a bit of tweaking to behave well for certain tables. You'll note that it says

//! The last byte of this top-level array is pulled out to a separate static
//! and trailing zeros are dropped; this is simply because grapheme_extend and
//! case_ignorable have a single entry in the 896th entry, so this shrinks them
//! down considerably.

Indicating the sort of tweaking you sometimes need to do. Additionally:

  • Some tables seem work better with a group length of 16, others 32, probably not limited to these sizes either.
  • A few work better with a bitset of u32 instead of u64.
  • etc.

And even without that, there are many tables that are obviously way larger than they need to be (anything that characters with both high and low codepoints) -- I ended up with several tables looking like:

static BITSET_CHUNKS_MAP: [u8; 544] = [
    5, 0, 0, 0, 6, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9,
    9, 9, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
    9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
    9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8,
];

which very clearly can be improved (for example, you could imagine solving it by some sort of wrap-around offset, and then trimming any trailing zeros after the wraparound). I also have very little doubt that if the generation code in rustc had a table like this, the generator would be tweaked so that it did better.

I'm half filing this just to document this investigation, and that I don't think it's as simple as just taking the generator from rustc and dropping it in.

I'm not sure how much it's ucd-generate's place to try to solve these issues, it would be great if it could -- right now there's no compelling reason to switch to ucd-generate for crates that already use a python generator, but support for a format which is both smaller and faster would be that (not to mention -- it would be great to reduce the size of these tables in common rust programs).

So yeah it would be great, and is something I'm interested in, but is subtle and might be more trouble than it's worth...

@Mark-Simulacrum might have some insight if they aren't too busy coming up with further clever methods of encoding data :p

@BurntSushi
Copy link
Owner

Have you looked at and compared this with the --trie-set output? It seems similarish to what the bitset idea is doing. It's also something that folks could potentially switch to, but that doesn't seem to be enticing enough to abandon the Python generators. :-) (There is also the FST format, although I suppose that's largely been a failure outside of needing to store all Unicode codepoint names or something like that.)

In any case, yes, I'm generally in favor of loading ucd-generate up with different formats, even if they aren't ideal in all circumstances. It would be nice to figure out a way to make it a bit easier to add new formats than what we have today. I kind of feel like the printer is, for example, buckling under its own weight. But, I don't think the situation is that bad---I don't see that sort of improvement as a blocker for something like this.

In the worst case, it would be OK for ucd-generate to return an error if it couldn't produce a bitset from a particular set. (From reading the bitset comment, it sounded like this might be possible.)

@BurntSushi BurntSushi added the question Further information is requested label Mar 13, 2020
@thomcc
Copy link
Contributor Author

thomcc commented Mar 13, 2020

It seems similarish to what the bitset idea is doing

Yeah, for the cases where it does well, I believe it's meaningfully smaller for a lot of cases than the trie-set, and faster too. The trie-set is pretty similar to what rustc used to use before switching for this.

They're different data structures too -- trie set is more interested in prefix compression, this is more about multiple levels of deduplication.

In the worst case, it would be OK for ucd-generate to return an error if it couldn't produce a bitset from a particular set. (From reading the bitset comment, it sounded like this might be possible.

I managed to run out of the u8 limit only when I tried to generate a set for "all assigned codepoints" (mostly curiosity...). That said, there's an extent to which input entropy is more relevant than size... That said, it's pretty easy to emit code using u16 if u8 doesn't fit, although it somewhat may defeats the point.

It's also something that folks could potentially switch to, but that doesn't seem to be enticing enough to abandon the Python generators

Some other reasons for that are probably:

  1. have to still have code to download the UCD (ucd-generate can't auto-download it).
  2. requires the whole UCD instead of just a couple files.

But mostly that it's already written and there's no compelling improvement (trie-set is also a bit of a wash since it means adding a dependency for a small amount of code you currently control).

I'm generally in favor of loading ucd-generate up with different formats, even if they aren't ideal in all circumstances

My hesitation is mostly that it's really obviously suboptimal for a lot of these (the current version is very clearly been tuned for just those tables) -- I played around with having the code try a few and emit the one with the smallest size though, which is possibly viable though (it would mean we'd have to emit the code to access it too, but that's probably not that bad since its quite small -- small enough to be inline(always)ed in the libcore version).

It would be nice to figure out a way to make it a bit easier to add new formats than what we have today

100% agree, although IDK what else to do -- I'm not aware of template libs that are good for code generation, even just "comma separate this, newline and indent when it gets too long" seems to be annoying for many, even though that's a ridiculously common thing for this case. That said I haven't looked that closely, I just badly want not to make things worse.

@Mark-Simulacrum
Copy link

As a brief note, this thread sparked my interest in trying to optimize a bit more, and I have some patches that I'm working on that so far shrink overall tables down another 16% :)

I managed to run out of the u8 limit only when I tried to generate a set for "all assigned codepoints" (mostly curiosity...). That said, there's an extent to which input entropy is more relevant than size... That said, it's pretty easy to emit code using u16 if u8 doesn't fit, although it somewhat may defeats the point.

This is a concern of mine as well -- in particular the alphabetic range is at 255 bitset (even after deduplication), which is really close to 256 that we can allow. I've not spent a huge amount of time on this though as in almost all cases my guess is that the easiest scheme is to simply split into input ranges into two or however many pieces you need and encode them separately, with a table at the beginning which branches into each. This should be quite cheap and fairly easy to implement (especially if non-generically).

With regards to "how" to implement generators -- yeah, I think personally I have the most time spent semi-offline thinking about algorithms or looking at generated tables to come up with compression schemes; I don't know that frameworks etc. are all that helpful. The amount of code for formatting lists/writing out the Rust code in the table generator in the Rust project is trivial -- probably 20-30 lines or so (out of ~1000 lines total).


As an aside,

Some tables seem work better with a group length of 16, others 32, probably not limited to these sizes either.

This is actually something I was thinking of when writing the original code, and I've since made the change to automatically choose an optimal size (from 1 to 64, currently) by just bruteforce search and with the criteria of minimal bytes "on disk".

    Alphabetic     : 3036 bytes    (- 19 bytes)
    Case_Ignorable : 2136 bytes
    Cased          : 934 bytes
    Cc             : 32 bytes      (- 11 bytes)
    Grapheme_Extend: 1774 bytes
    Lowercase      : 985 bytes
    N              : 1225 bytes    (- 41 bytes)
    Uppercase      : 934 bytes
    White_Space    : 97 bytes      (- 43 bytes)
    Total table sizes: 11153 bytes (-114 bytes)

A few work better with a bitset of u32 instead of u64.

I've not actually tried this recently! Currently the generator is pretty tightly tied to u64 (though this is mostly a mistake, I should refactor it to be more general).

I very much appreciate more ideas to experiment with, though! One thing that would be super interesting for me to try -- though I'm not sure how applicable/easy it is, as I've not looked into it much -- is to use ANS to encode the bitset and see how slow it is to access and what kind of size we get. My understanding is that we're likely to be pretty slow (there's no random access at all, though some sort of block format could help alleviate that) and I got scared off by the indications that you need some pretty big tables (kilobytes in size -- and most of ours encode down to less than a kilobyte already). But it could be interesting as a generic scheme.

@thomcc
Copy link
Contributor Author

thomcc commented May 12, 2020

Far too many words on applying entropy coding to unicode tables below

@Mark-Simulacrum, earlier you said:

though I'm not sure how applicable/easy it is, as I've not looked into it much -- is to use ANS to encode the bitset and see how slow it is to access and what kind of size we get

So I was a little skeptical of this for libcore's case when you said this -- I have some background in compression, and figured the random access was just too fundamentally important.

But... After seeing your updated table desing (in particular the skiplist) I ended up getting it in my head that maybe you could compress the skiplist offset "runs". [0]

Anyway, this is half for the benefit of others, but half so that I can back to it later (Uh, perhaps I should not use @BurntSushi's github issues as a notebook... 😅)

Compressing "runs"

Anyway, the reason I got stuck thinking about "runs" insided the skiplist OFFSETS is:

  • They appear to have rather low entropy[1].
  • You access runs randomly, but within a run you traverse it linearly.
  • Each symbol a hypothetical decoder spat out could be applied directly, saving you from needing to store things in a temporary buffer (I think nobody would be happy with a solution here that required calling malloc here...).

Furthermore, we can easily control the "symbol alphabet" of the run -- the maximum value that appears in one. Right now we split the run if a u8::try_from fails, but we could easily adjust that if it simplifies things (as it certainly will, as I'll get into).

However, on the downside, the runs are so short that it may be very hard to get any value out of compressing them -- especially using adaptive algorithms like these.

Anyway, this is basically pointing to some form of entropy coding applied to the runs being what we want. In practice this means one of:

  • Huffman
  • rANS (either with adaptive or non-adaptive probabilities).
  • tANS
  • Arithmetic coding.

So, some of these need a table of data themselves. Those tables are smaller, but might be upwards of [u32; 256]... Perhaps they could optimized, but... we're already trying to figure out how to optimize a table of data for size, the last thing I want right now is even more tables of data which themselves need to be optimized for size...[2] So I'm throwing algorithms that need a table out.

That disqualifies Huffman, tANS, and rANS with non-adaptive probabilities, as they all need tables. This leaves us with adaptive rANS or arithmetic coding, which end up being extremely similar both in theory and practice. They even have basically the same drawback: They aren't great with larger alphabets.

Arithmetic coding wishes it only had to compress individual bits... in practice, you represent larger alphabets using a "tree" of models. LZMA (7zip and xz) does this. When you see a symbol, you then need to adjust O(log N) binary models (where N is the alphabet size).

Is rANS viable here?

In general rANS is great at working with larger symbol alphabets -- that's one of the main benefits it offers over arithmetic coding... but if you remove it's lookup table of symbol frequency info, you have to update them as you go. And sadly, instead of just having to update O(log N) values like with arithmetic coding, rANS requires you update all of them.

A solution for this is suggested in this blog post near the end: Use nibbles instead of bytes for your symbols, and (potentially) use a (small, like 2 level) tree[3] of them -- similar to what I mentioned about arithmetic coding above.

This reduces the number of updates you must do to, adapt in response to seeing a new symbol drastically, and makes it SIMD amenable -- your adaptive info would be something like one or two [u16; 16] or [u8; 16]s stored in a handful of __m128is.

Unfortunately, setting the limit to 16... seems to balloon the SHORT_OFFSET_RUNS tables, to the point where it seems very unlikely that compressing the runs could recover the size difference.

I think this kills adaptive-rANS's viability for this use case in practice, although non-adaptive might work if we could amortize the cost of storing a symbol info table by using the same one for different boolean properties.

So... RIP rANS (for now 👻).

What about Arithmetic Coding?

That said, arithmetic coding still lives.

It has one main challenge remaining (aside from performance, which I didn't investigate but have no confidence in) which is that it has a per-entry overhead of some amount. In practice generally 4 bytes[4], and it's surprising that some runs should be considered too short to compress well.

So we need a way of having a per-run is_compressed flag. There aren't... really any free bits in SHORT_OFFSET_RUNS entries we can use.... We could store a separate bitmap for this, or use the padding 0 byte inserted between runs for it, I guess.

Either way, I'm running out of steam, so I'm just assuming that gets solved somehow, and there's a way to only apply compression to the runs which actually benefit from it. Beyond that, I don't think there are any non-performance-related remaining snags for arithmetic coding. In fact, I cobbled together some code (most of which I wrote years ago) to test this theory:

See this codepen[5], which can run arithmetic coding over whatever input you give it. If you choose "byte array" as the input format, you can try pasting example runs from the skiplist in.

I used https://gist.github.com/thomcc/7af9089b4ec77985af799f6d6c81f468 to inspect the runs of various tables, which also includes the dump of the runs for the alphabetic table.

In practice I get around 40% compression for longer runs, and in general fairly good values, especially if I adjust limit offset to 1 << 7 or 1 << 6 (and tune down bits on that page accordingly). That said, doing so adds a few hundred bytes to the table each time, so it would have to be a net benefit.

Closing remarks for now

So, I'm going to say that even despite it being "viable", I expect that in practice the perf will be too bad here with the arithmetic coding... Wish I had tried non-adaptive rANS instead, and may still.

That said, maybe it will be fine -- I'd have thought the skiplist perf would be too much of a hit (I wonder if it's worth making the libcores tests look like: c.is_ascii_foo() || (!c.is_ascii() && unicode_data::foo::lookup(c)), if they don't already...).

All that said, I'm probably going to investigate more, but a lot of my concerns are around reducing the size required for use cases that basically require all the tables (regex impls, high level language runtimes, etc...). This was just one idea I had that gave me a starting point for thingking about compression again, and reading about rANS.

Finally, some thoughts about libcore's existing tables:

  1. I feel like you should be able to get rid of, or at least find a use for, the trailing 0 in each run that the comments say is present for indexing...

  2. Some boolean property tables are actually smaller if you use more traditional range pairs if you use separate tables for (u16, u16) and (u32, u32) ranges (ucd-generate will learn how to do this soon, I have a WIP patch).

    For example Decimal_Number and White_Space are both this way. That said, the savings is small, but maybe more substantial in terms of how much code is required for the searching.

  3. I find the BITSET_CANONICAL arrays weirdly beautiful. Surprising they have that pattern...


Footnotes, because I can't help myself

[0] Note to anybody unfamiliar, the skiplist works similar to the following: do a binary search on one array, which produces a range of indices into a second array. Do a linear search over this range.

This is an oversimplification, but close enough to for this discussion. See this doc comment for a better description.

[1]: To the point that even using RLE almost feels worth it, but in practice genreally doesnt seem to be -- even though there are sequences like 1, 1, 1, 1, 1 and similar, it's way more common to not have repetition -- that said, they are low entropy.

[2]:Note: In reprospect, this was a mistake, as it puts ANS variants at a big disadvantage which could probably be accounted for by using a single table shared by all skiplists for offsets -- as it definitly appears that different tables still have correlated offset frequencies. That said, I still have a bunch of open questions here on how this works in practice...

This blog post talks some about techniques for using rANS with static distributions, but I'm not entirely sure it applies here.

[3]: I don't think we'd need to use the tree, since we control the input data, and so saving the complexity and overhead of initializing the model would be worth it.

Actually, maybe this part is worth investigating again? Obviously just "use nibbles" didn't work great... It still seems like way to much complexity in this area, though...

[4]: This is also true with ANS, but for different reasons (need to read in initial decoder state).

[5]: ... It's in javascript since I had some JS arithmetic coding stuff lying around, and it's a langauge I use for experimenting sometimes. Don't judge me -- I know it being a rust playground link would be much more useful to everybody.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants