-
-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for the compressed bitset format that rustc uses for some tables in libcore #30
Comments
Have you looked at and compared this with the In any case, yes, I'm generally in favor of loading In the worst case, it would be OK for |
Yeah, for the cases where it does well, I believe it's meaningfully smaller for a lot of cases than the trie-set, and faster too. The trie-set is pretty similar to what rustc used to use before switching for this. They're different data structures too -- trie set is more interested in prefix compression, this is more about multiple levels of deduplication.
I managed to run out of the u8 limit only when I tried to generate a set for "all assigned codepoints" (mostly curiosity...). That said, there's an extent to which input entropy is more relevant than size... That said, it's pretty easy to emit code using u16 if u8 doesn't fit, although it somewhat may defeats the point.
Some other reasons for that are probably:
But mostly that it's already written and there's no compelling improvement (trie-set is also a bit of a wash since it means adding a dependency for a small amount of code you currently control).
My hesitation is mostly that it's really obviously suboptimal for a lot of these (the current version is very clearly been tuned for just those tables) -- I played around with having the code try a few and emit the one with the smallest size though, which is possibly viable though (it would mean we'd have to emit the code to access it too, but that's probably not that bad since its quite small -- small enough to be
100% agree, although IDK what else to do -- I'm not aware of template libs that are good for code generation, even just "comma separate this, newline and indent when it gets too long" seems to be annoying for many, even though that's a ridiculously common thing for this case. That said I haven't looked that closely, I just badly want not to make things worse. |
As a brief note, this thread sparked my interest in trying to optimize a bit more, and I have some patches that I'm working on that so far shrink overall tables down another 16% :)
This is a concern of mine as well -- in particular the alphabetic range is at 255 bitset (even after deduplication), which is really close to 256 that we can allow. I've not spent a huge amount of time on this though as in almost all cases my guess is that the easiest scheme is to simply split into input ranges into two or however many pieces you need and encode them separately, with a table at the beginning which branches into each. This should be quite cheap and fairly easy to implement (especially if non-generically). With regards to "how" to implement generators -- yeah, I think personally I have the most time spent semi-offline thinking about algorithms or looking at generated tables to come up with compression schemes; I don't know that frameworks etc. are all that helpful. The amount of code for formatting lists/writing out the Rust code in the table generator in the Rust project is trivial -- probably 20-30 lines or so (out of ~1000 lines total). As an aside,
This is actually something I was thinking of when writing the original code, and I've since made the change to automatically choose an optimal size (from 1 to 64, currently) by just bruteforce search and with the criteria of minimal bytes "on disk".
I've not actually tried this recently! Currently the generator is pretty tightly tied to u64 (though this is mostly a mistake, I should refactor it to be more general). I very much appreciate more ideas to experiment with, though! One thing that would be super interesting for me to try -- though I'm not sure how applicable/easy it is, as I've not looked into it much -- is to use ANS to encode the bitset and see how slow it is to access and what kind of size we get. My understanding is that we're likely to be pretty slow (there's no random access at all, though some sort of block format could help alleviate that) and I got scared off by the indications that you need some pretty big tables (kilobytes in size -- and most of ours encode down to less than a kilobyte already). But it could be interesting as a generic scheme. |
Far too many words on applying entropy coding to unicode tables below@Mark-Simulacrum, earlier you said:
So I was a little skeptical of this for libcore's case when you said this -- I have some background in compression, and figured the random access was just too fundamentally important. But... After seeing your updated table desing (in particular the skiplist) I ended up getting it in my head that maybe you could compress the skiplist offset "runs". [0] Anyway, this is half for the benefit of others, but half so that I can back to it later (Uh, perhaps I should not use @BurntSushi's github issues as a notebook... 😅) Compressing "runs"Anyway, the reason I got stuck thinking about "runs" insided the skiplist
Furthermore, we can easily control the "symbol alphabet" of the run -- the maximum value that appears in one. Right now we split the run if a u8::try_from fails, but we could easily adjust that if it simplifies things (as it certainly will, as I'll get into). However, on the downside, the runs are so short that it may be very hard to get any value out of compressing them -- especially using adaptive algorithms like these. Anyway, this is basically pointing to some form of entropy coding applied to the runs being what we want. In practice this means one of:
So, some of these need a table of data themselves. Those tables are smaller, but might be upwards of [u32; 256]... Perhaps they could optimized, but... we're already trying to figure out how to optimize a table of data for size, the last thing I want right now is even more tables of data which themselves need to be optimized for size...[2] So I'm throwing algorithms that need a table out. That disqualifies Huffman, tANS, and rANS with non-adaptive probabilities, as they all need tables. This leaves us with adaptive rANS or arithmetic coding, which end up being extremely similar both in theory and practice. They even have basically the same drawback: They aren't great with larger alphabets. Arithmetic coding wishes it only had to compress individual bits... in practice, you represent larger alphabets using a "tree" of models. LZMA (7zip and xz) does this. When you see a symbol, you then need to adjust O(log N) binary models (where N is the alphabet size). Is rANS viable here?In general rANS is great at working with larger symbol alphabets -- that's one of the main benefits it offers over arithmetic coding... but if you remove it's lookup table of symbol frequency info, you have to update them as you go. And sadly, instead of just having to update O(log N) values like with arithmetic coding, A solution for this is suggested in this blog post near the end: Use nibbles instead of bytes for your symbols, and (potentially) use a (small, like 2 level) tree[3] of them -- similar to what I mentioned about arithmetic coding above. This reduces the number of updates you must do to, adapt in response to seeing a new symbol drastically, and makes it SIMD amenable -- your adaptive info would be something like one or two Unfortunately, setting the limit to 16... seems to balloon the SHORT_OFFSET_RUNS tables, to the point where it seems very unlikely that compressing the runs could recover the size difference. I think this kills adaptive-rANS's viability for this use case in practice, although non-adaptive might work if we could amortize the cost of storing a symbol info table by using the same one for different boolean properties. So... RIP rANS (for now 👻). What about Arithmetic Coding?That said, arithmetic coding still lives. It has one main challenge remaining (aside from performance, which I didn't investigate but have no confidence in) which is that it has a per-entry overhead of some amount. In practice generally 4 bytes[4], and it's surprising that some runs should be considered too short to compress well. So we need a way of having a per-run is_compressed flag. There aren't... really any free bits in SHORT_OFFSET_RUNS entries we can use.... We could store a separate bitmap for this, or use the padding Either way, I'm running out of steam, so I'm just assuming that gets solved somehow, and there's a way to only apply compression to the runs which actually benefit from it. Beyond that, I don't think there are any non-performance-related remaining snags for arithmetic coding. In fact, I cobbled together some code (most of which I wrote years ago) to test this theory: See this codepen[5], which can run arithmetic coding over whatever input you give it. If you choose "byte array" as the input format, you can try pasting example runs from the skiplist in. I used https://gist.github.com/thomcc/7af9089b4ec77985af799f6d6c81f468 to inspect the runs of various tables, which also includes the dump of the runs for the alphabetic table. In practice I get around 40% compression for longer runs, and in general fairly good values, especially if I adjust limit offset to Closing remarks for nowSo, I'm going to say that even despite it being "viable", I expect that in practice the perf will be too bad here with the arithmetic coding... Wish I had tried non-adaptive rANS instead, and may still. That said, maybe it will be fine -- I'd have thought the skiplist perf would be too much of a hit (I wonder if it's worth making the libcores tests look like: All that said, I'm probably going to investigate more, but a lot of my concerns are around reducing the size required for use cases that basically require all the tables (regex impls, high level language runtimes, etc...). This was just one idea I had that gave me a starting point for thingking about compression again, and reading about rANS. Finally, some thoughts about libcore's existing tables:
Footnotes, because I can't help myself [0] Note to anybody unfamiliar, the skiplist works similar to the following: do a binary search on one array, which produces a range of indices into a second array. Do a linear search over this range. This is an oversimplification, but close enough to for this discussion. See this doc comment for a better description. [1]: To the point that even using RLE almost feels worth it, but in practice genreally doesnt seem to be -- even though there are sequences like [2]:Note: In reprospect, this was a mistake, as it puts ANS variants at a big disadvantage which could probably be accounted for by using a single table shared by all skiplists for offsets -- as it definitly appears that different tables still have correlated offset frequencies. That said, I still have a bunch of open questions here on how this works in practice... This blog post talks some about techniques for using rANS with static distributions, but I'm not entirely sure it applies here. [3]: I don't think we'd need to use the tree, since we control the input data, and so saving the complexity and overhead of initializing the model would be worth it. Actually, maybe this part is worth investigating again? Obviously just "use nibbles" didn't work great... It still seems like way to much complexity in this area, though... [4]: This is also true with ANS, but for different reasons (need to read in initial decoder state). [5]: ... It's in javascript since I had some JS arithmetic coding stuff lying around, and it's a langauge I use for experimenting sometimes. Don't judge me -- I know it being a rust playground link would be much more useful to everybody. |
Documented here: https://github.com/rust-lang/rust/blob/master/src/tools/unicode-table-generator/src/raw_emitter.rs, added in rust-lang/rust#68232
So I actually experimented a bit with this a bit shortly after I heard about it. Unfortunately, it actually requires a bit of tweaking to behave well for certain tables. You'll note that it says
Indicating the sort of tweaking you sometimes need to do. Additionally:
And even without that, there are many tables that are obviously way larger than they need to be (anything that characters with both high and low codepoints) -- I ended up with several tables looking like:
which very clearly can be improved (for example, you could imagine solving it by some sort of wrap-around offset, and then trimming any trailing zeros after the wraparound). I also have very little doubt that if the generation code in rustc had a table like this, the generator would be tweaked so that it did better.
I'm half filing this just to document this investigation, and that I don't think it's as simple as just taking the generator from rustc and dropping it in.
I'm not sure how much it's ucd-generate's place to try to solve these issues, it would be great if it could -- right now there's no compelling reason to switch to ucd-generate for crates that already use a python generator, but support for a format which is both smaller and faster would be that (not to mention -- it would be great to reduce the size of these tables in common rust programs).
So yeah it would be great, and is something I'm interested in, but is subtle and might be more trouble than it's worth...
@Mark-Simulacrum might have some insight if they aren't too busy coming up with further clever methods of encoding data :p
The text was updated successfully, but these errors were encountered: