-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compress documentation uploaded to S3 #379
Comments
Could you populate decompression times as well? In some sense those are considerably more important for us to decide whether this is viable. Compression times for all of those are not great though :/ |
@Mark-Simulacrum Why would we need decompression? You can send compressed files as is (I think only gzip compression is supported on web-browser though?). |
That's true, in general, certainly. I guess presumably ~all clients are I do think this shouldn't be that hard to add -- presumably we'd add a "compressed" column to the files table and if it's set access files in S3 at |
Added decompression time: gzip is slower than zstd and brotli, and those two are roughly the same. Of course production performance for decompression are going to be way better, as you don't have to write to my hard disk (:sweat_smile:), everything is in memory, the binary is not restarted every time and we will decompress way less files. Considering the decompression time brotli is probably the best choice?
No, because docs.rs needs to tweak those HTML files (for example to add the top bar). |
Given those decompression times, this seems problematic. Even if they were an order of magnitude lower, that's still ~250ms/file we read - and slowing down all requests to docs.rs by that much seems poor. In very unscientific benchmarking (web developer tools), it looks like we respond in about ~60ms on js/css content today and ~150ms on HTML content -- this would significantly increase that. |
Let me create a proper benchmark. |
By the way, the numbers are for compressing and decompressing 2040 files, not a single file. |
Ah, yeah, that makes sense. I did think the numbers were awfully large :) We probably want single-file timing information as that's whats important for the primary use case of serving them. |
Ok, scratch that, wrote a proper benchmark, and the results are way more accurate:
|
Okay, those numbers look great! We can definitely afford microseconds of decompression time. I'll investigate doing this legwork, though I'll keep the upload going in the meantime. I think re-compressing existing files and such can happen in parallel and over time if needed, not sure. |
Did some more brotli benchmarks:
Based on that, I think the two options we should consider are
I'm sort of preferring |
putting the cached content in an iframe avoids decompression and post processing. It would require an extra http request, but then since it's cached, it's possible to give it a unique name and cache control immutable so browsers never try to reload it if it's in cache. That might well make up for the extra requests. |
|
Preliminary results on just a few hundred crates (specifically, |
Unfortunately it isn't this simple, see #679 (comment) for why iframes aren't a good option. |
Benchmarks based on those @pietroalbini linked above (code here), adding in
|
What's the timing of generating dictionaries and how often does it have to happen? |
That dictionary took about a minute. Though if we can get a dictionary trainer that can handle more than 4GB of data to load the entire archive @namibj made I assume it'll take longer. Preferably we would never generate new dictionaries. As soon as one is used to compress some data that then needs to be part of docs.rs forever. Maybe if in the future rustdoc completely changes how it generates documentation it'd be worth regenerating a new dictionary, but for any minor changes the learnt data should hopefully still be relevant (one idea might be to include docs from a range of old rustdoc versions to potentially reduce overfit on how the latest encodes its docs). |
I think we should also train on more crates than winapi, which is a little special. Maybe we could add an embedded crate like |
|
At the moment we don't compress the documentation uploaded to S3, wasting a lot of space. While money for S3 isn't an issue at the moment, avoiding compression could hurt docs.rs's sustainability in the future.
I ran some very rough benchmarks onreqwest 0.9.3
, compressing each.html
file separately:Bad benchmark
-9
(best)-1
(fast)-19
(best)-1
(fast)-9
(best)-0
(fast)Looking at the numbers, on average if we compress the uploaded docs we're going to save 63% of storage space, which is great from a sustainability point of view. I think we should compress all the uploaded docs going forward, and try to compress (part of) the initial import as well.For the algoritm choice, I'd say we can go withgzip
: there isn't much difference between the resulting sizes and the compression time delta between gzip's fast and best modes is the smallest. We can compress the initial import with-1
to speed it up, and all the new crates with-9
.cc @Mark-Simulacrum @QuietMisdreavus
Benchmark method
Installed compression tools on Ubuntu 18.04 LTS:
Downloaded locally the reqwest documentation:
Compressed every
.html
file withfind
:The text was updated successfully, but these errors were encountered: