-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compress documentation per-crate, not per-file #1004
Comments
Another idea @pietroalbini had was to have split archives for very large crates: have one file storing an 'index' of the byte offset in the archive for each file. That would allow doing range-requests for that specific file, without having to download the whole archive. This would require compressing each individual file and not compressing the archive, but should make it scalable even to crates with many gigabytes of documentation. For small crates (say, < 3 MB), we could still have the index as part of the archive itself. |
I'm not sure it was my idea (I remember reading it on Discord a while ago) :) |
Recent stats from the #1019 metrics: The drops are from when the service restarted during a deploy yesterday. I'm not sure what caused the spike, seems likely to be some kind of crawler active for an hour. One interesting stat we can draw from this: non-default platforms are used, but relatively rarely. Over the last hour before the screenshots there were 6130 different versions of crates accessed, and 7340 different platforms of those versions, so ~1.19 platforms per version (compared with the 5 platforms per version that are built). That does imply that we definitely want to compress documentation per-platform, since the majority of alternative platforms are unlikely to be loaded and we don't want to waste space caching their indexes locally (maybe also relevant for #343 @jyn514). The main thing we can draw from these stats: if we can hit a 10k item MRU cache then we get an ~1 hour eviction period, for 5k items ~30 minute eviction. I've started experimenting with a library + CLI to handle the archive and indexing at https://github.com/Nemo157/oubliette. |
From discussion on discord:
|
I was intrigued by the topic and digging into this. I'm not sure about the goals of topic. After talking to @jyn514 and @pietroalbini, I think primary goals
secondary goals
Coming from these, I'm not sure why we chose these approaches in the previous comments. IMHO inventing a custom archive format or deduping only serves the purpose of even smaller size, while we then need to recompress to offer downloadable docs (or let users use some custom format). Wouldn't a simple approach be better to start with?
Wouldn't that give us all the primary goals? (I understand that it would not use much less space, since ZIP would only compress by-file, to give us the advantage of downloading single files out of the archive). Even using What am I missing? |
This was not a primary goal AIUI, just something that this could potentially make possible. I would personally rate improving response times higher on the list than it.
Using |
In particular, if we stored files in a single archive, it would be feasible to re-upload docs for old crates (#464). Right now that costs several thousand dollars. |
At the least I was right in thinking that I hear conflicting goals on this topic 😄. Or that what I heard from @jyn514 and @pietroalbini didn't match with the discussions here on this issue. (If I misunderstood, please correct me)
When thinking world-wide by far the biggest lever on site speed is IMHO not the S3 download, but using the CDN. I've done multiple setups with Fastly (there is a special open source program, which CloudFront doesn't have), which (without the OS program) were all cheaper and faster than CloudFront. We could have worldwide, stable, response times <30ms for Even when optimising the hell out of server-side response times, for the most part of the world it would only make a difference of 10-20% of the response time, while building the server-local caching . so to sum it up |
+1 for focusing on shrinking the number of files rather than improving response times. I think it would be nice to improve response times but not the primary focus. I don't think the difference in size between zstd and DEFLATE is worth giving up range-requests. We haven't had issues with storage size in quite a while, it's been between 3 and 4 TB since about a year ago which seems reasonable (it dropped off quite a bit after we started compressing files). |
If I remember the code by @Nemo157 correctly, it also would have allowed range requests, since the files were compressed separately in the archive. It was using zstd file-by-file and was concatenating the compressed streams into a single archive. With the focus on file-numbers we could start with compressing after build and range-requests. While of course having the option to add webserver-local archive-caching later. |
Out of curiosity, what are the crates with gigabytes of documentation? |
@jsha most of the stm32* crates are enormous |
You could also setup a CloudFront distribution for the S3 bucket and serve docs from cached edge locations. Just keep in mind that you'll be increasing complexity on the AWS side of thing. When an object in the S3 bucket is replaced the cache in the CloudFront distribution will persist until the path they reside in is invalidated. Each path invalidation call can take from 60 - 300 seconds, first 1000 are free, $0.005 USD per after that. A lambda job can "push" these path invalidations when an object is updated (nice writeup on this approach here: https://kupczynski.info/2019/01/09/invalidate-cloudfront-with-lambda-s3.html). |
Currently docs.rs is rewriting the rustdoc HTML code for example to add the footer and the header. |
Doing some back-of-the-envelope calculations, according to https://aws.amazon.com/s3/pricing/, uploading costs $0.005 per 1k objects uploaded, and storage is $0.023 per GB/month. So if the average crate is 200MB and 1000 files, that's $0.0046 in monthly storage costs, and $0.005 in per-upload costs. So uploading all crates every 6 weeks would (assuming these numbers, which I don't have a good basis for) approximately double costs. @syphar is this still something you're interested in working on? I'd love to get #464 unblocked. We've been making a bunch of UI changes to rustdoc output and I'm worried folks will be confused seeing a variety of subtly different interfaces on docs.rs. |
Yes, I'm still working on it, though progress wasn't as fast as I planned it to be in the last weeks. I have a working prototype which still needs some work, but I would say I'm half there. |
Currently, docs.rs stores each generated HTML file individually on S3. This has the advantage that downloading a single page is fast and efficient, but means that it's very expensive to download all files for a crate (cc #174), particularly because some crates can have many thousands of generated files. It also makes uploads more expensive, since S3 charges per file stored.
Docs.rs should instead store a single archive per crate and compress the entire archive. That would decrease storage costs, upload times, and allow retrieving the entire crates' documentation efficiently. It would have the downside that for crates with many gigabytes of documentation, loading a single page would take much longer - perhaps some crates could be exempted from archives if they're over a certain size?
This would also make it more feasible to implement #464, since the upload costs would be greatly decreased.
The text was updated successfully, but these errors were encountered: