Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compress documentation per-crate, not per-file #1004

Closed
jyn514 opened this issue Aug 25, 2020 · 17 comments · Fixed by #1342
Closed

Compress documentation per-crate, not per-file #1004

jyn514 opened this issue Aug 25, 2020 · 17 comments · Fixed by #1342
Labels
A-builds Area: Building the documentation for a crate C-enhancement Category: This is a new feature

Comments

@jyn514
Copy link
Member

jyn514 commented Aug 25, 2020

Currently, docs.rs stores each generated HTML file individually on S3. This has the advantage that downloading a single page is fast and efficient, but means that it's very expensive to download all files for a crate (cc #174), particularly because some crates can have many thousands of generated files. It also makes uploads more expensive, since S3 charges per file stored.

Docs.rs should instead store a single archive per crate and compress the entire archive. That would decrease storage costs, upload times, and allow retrieving the entire crates' documentation efficiently. It would have the downside that for crates with many gigabytes of documentation, loading a single page would take much longer - perhaps some crates could be exempted from archives if they're over a certain size?

This would also make it more feasible to implement #464, since the upload costs would be greatly decreased.

@jyn514 jyn514 added A-builds Area: Building the documentation for a crate C-enhancement Category: This is a new feature labels Aug 25, 2020
@jyn514
Copy link
Member Author

jyn514 commented Aug 25, 2020

Another idea @pietroalbini had was to have split archives for very large crates: have one file storing an 'index' of the byte offset in the archive for each file. That would allow doing range-requests for that specific file, without having to download the whole archive.

This would require compressing each individual file and not compressing the archive, but should make it scalable even to crates with many gigabytes of documentation. For small crates (say, < 3 MB), we could still have the index as part of the archive itself.

@pietroalbini
Copy link
Member

I'm not sure it was my idea (I remember reading it on Discord a while ago) :)

@Nemo157
Copy link
Member

Nemo157 commented Oct 31, 2020

Recent stats from the #1019 metrics:

image

image

image

The drops are from when the service restarted during a deploy yesterday. I'm not sure what caused the spike, seems likely to be some kind of crawler active for an hour.

One interesting stat we can draw from this: non-default platforms are used, but relatively rarely. Over the last hour before the screenshots there were 6130 different versions of crates accessed, and 7340 different platforms of those versions, so ~1.19 platforms per version (compared with the 5 platforms per version that are built). That does imply that we definitely want to compress documentation per-platform, since the majority of alternative platforms are unlikely to be loaded and we don't want to waste space caching their indexes locally (maybe also relevant for #343 @jyn514).

The main thing we can draw from these stats: if we can hit a 10k item MRU cache then we get an ~1 hour eviction period, for 5k items ~30 minute eviction.

I've started experimenting with a library + CLI to handle the archive and indexing at https://github.com/Nemo157/oubliette.

@Nemo157
Copy link
Member

Nemo157 commented Nov 5, 2020

From discussion on discord:

  • It's likely worth deduping all platforms into the same archive file
  • But that would make it more likely to hit the threshold for caching the full archive locally
  • We could build multiple indexes pointing into the same archive file to avoid bloating the index size
  • We could decide per-crate whether to split across multiple archives based on how close the default target is to the caching threshold
  • Or we could always put the default target into a separate archive, and archive the other platforms together
  • or a more complex scheme where it isn't deduped against but other targets can point to its files, and allow truncating the archive after one target (this is hairy but is the kind of thing you can do if its a totally custom format)

    you could imagine the archive format as basically a bunch of concatenated "target archives" one after the other, where a target archive is only allowed to reference files either in its own archive or one before. then if you put the default first, you can produce a valid archive by truncating it on the end

@syphar
Copy link
Member

syphar commented Feb 1, 2021

I was intrigued by the topic and digging into this. I'm not sure about the goals of topic.

After talking to @jyn514 and @pietroalbini, I think

primary goals

  • maintenance burden of many files on S3 (coming from @pietroalbini )
  • downloadable docs for offline readers ( coming from Downloadable docs #174 )
  • constraints: not needing more storage space, roughly the current speed (coming from @jyn514 )

secondary goals

  • better response times for rustdoc pages (caching a full archive for the version/target and serving from the local archive)
  • needing less storage space on S3

Coming from these, I'm not sure why we chose these approaches in the previous comments. IMHO inventing a custom archive format or deduping only serves the purpose of even smaller size, while we then need to recompress to offer downloadable docs (or let users use some custom format).

Wouldn't a simple approach be better to start with?

  • a simple archive format which supports range-requests (ZIP for example)
  • just an archive per crate/version/target
  • storing an index (file name + offset) next to it (can be regenerated from the ZIP whenever we want)
  • caching the indexes locally (then we even could answer exist-queries directly from the index)
  • downloadable docs can directly use the ZIP.

Wouldn't that give us all the primary goals? (I understand that it would not use much less space, since ZIP would only compress by-file, to give us the advantage of downloading single files out of the archive).
When optimising a little we could even not have the index on S3 and just the last bytes of the ZIP to get its directory.

Even using zstd and a custom archive format wouldn't benefit much from the overlap between all the HTML files arcross versions / targets because to be able to access single files without decompressing the whole archive, we always have to compress the files on their own.

What am I missing?

@Nemo157
Copy link
Member

Nemo157 commented Feb 1, 2021

downloadable docs for offline readers ( coming from #174 )

This was not a primary goal AIUI, just something that this could potentially make possible. I would personally rate improving response times higher on the list than it.

Even using zstd and a custom archive format wouldn't benefit much from the overlap between all the HTML files arcross versions / targets because to be able to access single files without decompressing the whole archive, we always have to compress the files on their own.

Using zstd with a custom dictionary and per-file compression gives a large space savings, something around 1/4 or 1/5th the total compressed size—benchmark results here. That doesn't matter so much in terms of S3 usage, but might help a little if data transfer rates from S3 are slowing us down (I assume it's all lookup overhead and the actual data transfer is miniscule). The place it really helps is if it can reduce some archives small enough that we can trivially cache them locally on the web server and avoid the remote lookup at all. According to grafana that S3 lookup is currently about 82 of the 105ms on average to render a rustdoc page at the 95th percentile, with a locally cached archive I would expect that to be sub-ms reducing the total down to like 23ms.

@jyn514
Copy link
Member Author

jyn514 commented Feb 1, 2021

maintenance burden of many files on S3

In particular, if we stored files in a single archive, it would be feasible to re-upload docs for old crates (#464). Right now that costs several thousand dollars.

@syphar
Copy link
Member

syphar commented Feb 2, 2021

downloadable docs for offline readers ( coming from #174 )

This was not a primary goal AIUI, just something that this could potentially make possible. I would personally rate improving response times higher on the list than it.

At the least I was right in thinking that I hear conflicting goals on this topic 😄. Or that what I heard from @jyn514 and @pietroalbini didn't match with the discussions here on this issue. (If I misunderstood, please correct me)

Even using zstd and a custom archive format wouldn't benefit much from the overlap between all the HTML files arcross versions / targets because to be able to access single files without decompressing the whole archive, we always have to compress the files on their own.

Using zstd with a custom dictionary and per-file compression gives a large space savings, something around 1/4 or 1/5th the total compressed size—benchmark results here. That doesn't matter so much in terms of S3 usage, but might help a little if data transfer rates from S3 are slowing us down (I assume it's all lookup overhead and the actual data transfer is miniscule). The place it really helps is if it can reduce some archives small enough that we can trivially cache them locally on the web server and avoid the remote lookup at all. According to grafana that S3 lookup is currently about 82 of the 105ms on average to render a rustdoc page at the 95th percentile, with a locally cached archive I would expect that to be sub-ms reducing the total down to like 23ms.

When thinking world-wide by far the biggest lever on site speed is IMHO not the S3 download, but using the CDN.
For a normal rustdoc page, most of the time is not spent on the server, but on the roundtrip to the US (100-150ms), and content-download (100-600ms). Add another roundtrip for every redirect that users hit, depending on where they come from.

I've done multiple setups with Fastly (there is a special open source program, which CloudFront doesn't have), which (without the OS program) were all cheaper and faster than CloudFront.

We could have worldwide, stable, response times <30ms for all most pages, with <1s time between a release and the page being updated, with perhaps a day of work (mostly around returning correct caching headers and purging the right parts automatically), while likely saving money. Fastly can also serve stale content and fetch the new page in the background, still the new content is live after 1-2 seconds.

Even when optimising the hell out of server-side response times, for the most part of the world it would only make a difference of 10-20% of the response time, while building the server-local caching .

so to sum it up
IMHO in making speed improvements a secondary goal for this issue would reduce effort and simplify risk (in using a standard archive format), while being able to support #464 , #174 , and reducing the maintenance burden)

@jyn514
Copy link
Member Author

jyn514 commented Feb 2, 2021

+1 for focusing on shrinking the number of files rather than improving response times. I think it would be nice to improve response times but not the primary focus. I don't think the difference in size between zstd and DEFLATE is worth giving up range-requests. We haven't had issues with storage size in quite a while, it's been between 3 and 4 TB since about a year ago which seems reasonable (it dropped off quite a bit after we started compressing files).

@syphar
Copy link
Member

syphar commented Feb 2, 2021

If I remember the code by @Nemo157 correctly, it also would have allowed range requests, since the files were compressed separately in the archive. It was using zstd file-by-file and was concatenating the compressed streams into a single archive.

With the focus on file-numbers we could start with compressing after build and range-requests. While of course having the option to add webserver-local archive-caching later.

@jsha
Copy link
Contributor

jsha commented Jun 3, 2021

Out of curiosity, what are the crates with gigabytes of documentation?

@jyn514
Copy link
Member Author

jyn514 commented Jun 3, 2021

@jsha most of the stm32* crates are enormous

@beautifulentropy
Copy link

beautifulentropy commented Jun 4, 2021

You could also setup a CloudFront distribution for the S3 bucket and serve docs from cached edge locations. Just keep in mind that you'll be increasing complexity on the AWS side of thing. When an object in the S3 bucket is replaced the cache in the CloudFront distribution will persist until the path they reside in is invalidated. Each path invalidation call can take from 60 - 300 seconds, first 1000 are free, $0.005 USD per after that. A lambda job can "push" these path invalidations when an object is updated (nice writeup on this approach here: https://kupczynski.info/2019/01/09/invalidate-cloudfront-with-lambda-s3.html).

@syphar
Copy link
Member

syphar commented Jun 4, 2021

You could also setup a CloudFront distribution for the S3 bucket and serve docs from cached edge locations.

Currently docs.rs is rewriting the rustdoc HTML code for example to add the footer and the header.
Since this is happening in the web and not the build-process, directly serving from CloudFront to S3 won't work for us.

@jsha
Copy link
Contributor

jsha commented Jun 8, 2021

Doing some back-of-the-envelope calculations, according to https://aws.amazon.com/s3/pricing/, uploading costs $0.005 per 1k objects uploaded, and storage is $0.023 per GB/month. So if the average crate is 200MB and 1000 files, that's $0.0046 in monthly storage costs, and $0.005 in per-upload costs. So uploading all crates every 6 weeks would (assuming these numbers, which I don't have a good basis for) approximately double costs.

@syphar is this still something you're interested in working on? I'd love to get #464 unblocked. We've been making a bunch of UI changes to rustdoc output and I'm worried folks will be confused seeing a variety of subtly different interfaces on docs.rs.

@jyn514
Copy link
Member Author

jyn514 commented Jun 8, 2021

@jsha see #1342

@syphar
Copy link
Member

syphar commented Jun 9, 2021

Yes, I'm still working on it, though progress wasn't as fast as I planned it to be in the last weeks.

I have a working prototype which still needs some work, but I would say I'm half there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-builds Area: Building the documentation for a crate C-enhancement Category: This is a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants