-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request: blob level compression #147
Comments
Huge +1 from me. I was talking about this just today! We want to compress data in our in-memory CAS implementation. We have been experimenting with using compression at the gRPC layer but that is, as you said, completely undone when you put it in the CAS. My immediate thought was to model these the same way we model the different digesting algorithms: via capabilities. As I understand things today, you cannot switch between digesting algorithms without restarting everything from scratch (including the CAS, although that can support multiple digests simultaneously per capabilities), so, if we want to support it in v2, we could just add a new capability called Compression. That could be dictated by the server just like digesting function is today. Apart from the inelegance, what would the downsides be there that I have missed? |
Convenient timing - we've just been working through trying to leverage gRPC transport-level compression ourselves, and finding it very difficult to capitalize on in practice, to the point we're giving up on it for now. So +1 to exploring API-level compression instead. What we've gotten to in our exploration thus far is:
Mostyn, to your ideas specifically:
|
One problem with the Capabilities API is that it only describes the server side- the server can't use it to check capabilites of the client. The client would need to specify its preferences/capabilites separately, eg in the resource_name for the byte stream API, or maybe via new fields in the other protos. (This isn't a problem for digests because they have mostly non-overlapping key sizes, but I would prefer a nicer fix for this in v3 too, see eg #136.) |
+1, I think we should always offer uncompressed blobs, since the cache is oblivious to what is being stored. Only clients have the information about which blobs are likely to benefit from compression, so IMO it should be opt-in from the client side.
I'm also interested in exploring other potential solutions, eg: introduce a new Blob proto which contains a list of one or more Digests with corresponding encodings. That might make seekable compressed byte streams easy, since we could address compressed blobs directly.
Hmm, right. It would be good to avoid needing to look for magic bytes in the response.
+1 to focusing on ByteStream first, and worrying about the other services later. |
A note for implementers: With this change you're likely to need to keep a map from:
The reason being, you're likely sooner or later to want to cache computation by re-compressing files into the algorithm(s) that (a) provide the best compression ratio, and (b) are most provided in the accepted content format list from clients. Kind of like if you're streaming online video to a 4K TV and a phone with low res, the back-end might dynamically recode for the small screen on first call to provide fewer bits on the wire, but keep the result (or kick off a parallel job to create) and store a low res version of the high res content. |
Hi! @Erikma pointed me to this thread. I'm the one responsible for the weird This gets tricky with compression. Some options:
To be clear, I'm not saying "don't do compression because of |
(VSO0 == VsoHash in the digest algorithm list in the RE protobuf) |
Compress each block individually. This could be ok, but it means the
downloader would need to understand that the content is not an uncompressed
stream, nor a compressed stream, but is a concatenation of multiple
compressed streams.
Many compression algorithms, including zstd (my recommendation for folk to
choose), are concatenatable: you can pipe a sequence of individual
compressed streams to a single decompressor. This gives you the option of
uploading or storing individual chunks if you wish, but streaming them out
to a single client in sequence without intermediate recompression.
at no point does the server need to stream the data to a temporary
location as it hashes it and then only copy it to a final location once the
hash is verified.
Assuming you can rename a file in the destination system, this is not
really a bad thing imho, and frequently desired anyways - streaming to the
*final* location directly causes problems with concurrent uploads, slow
uploads, partial uploads, etc. If you're storing in independent chunks with
a small enough chunk size this is maybe not a problem (10s of kilobytes
say), but that's then bad for disk IO and volume of metadata since you have
highly fragmented files. But even at 1MB chunks I'd still recommend a
temp-file-and-rename pattern. (Though storage systems differ wildly; your
mileage may vary).
That said, there are definitely valid reasons to chunk on upload, including
getting single-file parallelism if you can upload multiple chunks faster
than streaming a whole file in sequence. Not currently supported in the CAS
API directly, but if you want to side-channel data into your storage
through some other API and *expose* it through the CAS for subsequent
interactions, should work well!
|
For single-file parallelism: We should consider a ByteStream change that specifies start offset and size within an uploading file. I have a note in our implementation of ByteStream to go experiment with custom headers for that, as we're often uploading one large file and overlapping chunks already works well for saturating the pipe for downloads (which do support reading from offsets). |
If cache implementations were to expose that to clients somehow, then compressed blobs would be directly addressable and we would not need any changes to the download part of the bytestream API, and I suspect VSO0 would not pose any trouble. If so, then I wonder if we can find a reasonable way to do this in v2? A new rpc in the ContentAddressableStorage service could do this at the cost of an extra roundtrip. Otherwise we could add new fields to existing messages like OutputFile. |
Unfortunately neither Azure Blob nor S3 offer object rename 😕 They do offer partial uploads that handle concurrency well: "When you upload a block to a blob in your storage account, it is associated with the specified block blob, but it does not become part of the blob until you commit a list of blocks that includes the new block's ID" |
John: ahh right, didn't realize that. I took a quick look at the APIs: for
S3 you'd start (potentially multiple independent) multi-part uploads for a
blob, upload bytes as they come hashing along the way, and then when
complete and valid one writer would 'commit' the blob making it visible.
That's similar to the concept of temp file + rename, you just don't have to
pick your own id for the temp file (it's provided by S3 in the form of the
upload ID). For Azure it looks like you would have multiple writers writing
(potentially redundant) blocks to the same logical blob under different
block IDs, and when one of them was done, it'd commit the block list with
all the blocks it uploaded and knew were correct, implicitly discarding the
rest. So you're right, no 'rename', but the same abstract idea of "*stream
bytes in without making them visible, hashing as you go, and then commit
the finalized blob when it's present and valid*" should work fine.
Erik: IIRC, VSO0 is a hash-of-hashes? For single-file parallelism, using
ByteStream to upload non-contiguous chunks of the same file independently
seems tricky. Independent uploaders uploading independent chunks of the
same file would be tricky, as the server couldn't validate anything about
them till all chunks were seen, and it might have seen multiple different
candidates for any given byte range (since it does not yet know what
digests to expect for chunks). I guess the server would keep track of what
chunks of the whole had been uploaded as candidates, and when it thought
the whole file was covered, it could try using them to complete the
top-level hash and make the whole blob visible?
I might suggest a different pattern for the same - upload each of those N
chunks as different blobs, then use another (not currently present) API to
synthesize one blob out of multiple - akin to Azure's use of uploading a
chunk list to create a "final" blob. In your case you could have hashed
each of those N blobs pre-validated server-side with the appropriate
algorithm on upload, such that creating an N-blob synthesis with VSO0 hash
would be quick to stitch (validate all referenced blobs are actually
present; hash the hashes of them to ensure it produces the right overall
hash, then store metadata reflecting that this blob is actually composed of
<X Y Z>).
…On Wed, Jul 15, 2020 at 8:59 PM John Erickson ***@***.***> wrote:
Assuming you can rename a file in the destination system
Unfortunately neither Azure Blob nor S3 offer object rename 😕
They do offer partial uploads that handle concurrency well: "When you
upload a block to a blob in your storage account, it is associated with the
specified block blob, but it does not become part of the blob until you
commit a list of blocks that includes the new block's ID"
https://docs.microsoft.com/en-us/rest/api/storageservices/understanding-block-blobs--append-blobs--and-page-blobs
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#147 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABREWYC77ZDLC57NSR5Y6TR3ZGIXANCNFSM4OY3FI7Q>
.
|
@EricBurnett This is exactly what my service (https://azure.microsoft.com/en-us/services/devops/artifacts/) does 👍
This works in our service as Azure Blob lets you name each of the chunks - so we name each chunk by the hash of the chunk. The overall flow:
|
Here's an initial proposal, which makes it possible to address blobs by their compressed digests, so read/write offsets still work: |
Thanks Mostyn! I'm quite happy with how the proposal is shaping up, personally - does anyone else have comments on Mostyn's current proposal? (Note that it has been materially revised since first linked 2w ago). I'll note the current proposal does allow offset reads still (a request of @Erikma ), but otherwise doesn't specifically consider block algorithms, and doesn't do anything specific around block uploads or concurrent partial uploads (as @johnterickson floated). I'm personally fine with this - it's consistent with the existing upload API, and (as I noted above) my suggested approach to block uploads would be to upload independent chunks as independent blobs and provide an additional API to stitch them. In that sense I think chunk uploads are out of scope for this Compression discussion, except to the extent anyone feels this proposal precludes features they want to propose. |
I think the proposal sounds good in regard to integration with the bytestream APIs, and I like the idea overall and think it can be useful. I'm a little concerned about the interaction with other CAS RPCs which would presumably require the server to decompress before serving them (although presumably this is the same for a client that requests a blob via the uncompressed bytestream paths, or a compressed one that isn't actually stored compressed). In general (as I noted on the doc) it gives the decision-making power to the client, but the server has better info to make that decision. I also have a bit of a concern on focusing just on bytestream; our artifacts consist mostly of Go objects (which are nearly all compressible and mostly small enough to be batched) or Python and Java (which are often incompressible and large enough to require streaming). I worry a bit that we'd not get a lot of benefit given that profile. Maybe others are in a different situation though? |
Thanks for reviewing Peter.
I hope we can distill the decision making to roughly "prefer compressed", making that a no-op. E.g. if the server is going to store data internally compressed anyways, it's a choice between "server decompresses and then sends" vs "server sends, client decompresses", where the latter is ~always better. Mostyn and I also discussed incompressible data, but we felt it was sufficient to use a faster compressor (e.g. one of the zstd negative levels) and/or a framing-only "compression" (e.g. gzip level 0 if you support gzip, which just wraps the uncompressed data in gzip headers) rather than to try to push knowledge of compressibility into the client.
That's a fair point, if your use-case is biased more towards smaller batch reads than bytestream reads, you won't benefit as much right away. We're definitely not closing the door to adding a compressed version of the batch APIs, simply not starting there. Could I ask you to gather stats on how many bytes you move by ByteStream vs batch? I'll be interested to know how much of a concern it is for you in practice. On my server it's ~14:1 for reads, but you may have materially different numbers. |
Sure, I can have a look at that. And good point on the fast or framing-only compression levels, that makes sense. Thanks. |
IMO the client has more context to decide, eg:
Whereas cache servers have at most compressed and uncompressed data sizes.
If it looks like it's worth the effort, we could add a new |
…resolve to physical inputs (bazelbuild#148)
OK so I have a couple of days worth of metrics now and it's approximately 3.5x in favour of streams for reads (more for writes). I don't have a lot of insight into compressibility of them but that definitely supports the idea that streams are indeed more valuable as you say. Thanks for the discussion - this SGTM then! |
Great, thanks for the data Peter!
I don't know what they'll be for you, but on my side I observe compression ratios around 0.4 for writes and 0.333 for reads. If that held ballpark true for your data, and if your streams and batch reads were equally compressible, you'd expect an ideal reduction of about 67% bytes read, and an effective reduction of about 52% with streams alone. That suggests it's definitely worth starting with streams to get the most impact with the least effort, but that you'll probably still desire to follow on with batch APIs too, as you'd expect an incremental 30% reduction going from streams-only compression to streams-and-batch compression. |
It sounds like we're generally in agreement on Mostyn's proposal as written. @mostynb, can you turn your doc into a PR? |
Will do. |
In many cases, it is desirable for blobs to be sent in compressed form to and from the cache. While gRPC supports channel-level compression, the generated bindings APIs require that implementers provide data in unserialized and uncompressed form. By allowing compressed data at the REAPI level instead, we can avoid re-compressing the same data on each request. The ByteStream API stands to benefit the most from this, with the least amount of effort. Implements bazelbuild#147.
In many cases, it is desirable for blobs to be sent in compressed form to and from the cache. While gRPC supports channel-level compression, the generated bindings APIs require that implementers provide data in unserialized and uncompressed form. By allowing compressed data at the REAPI level instead, we can avoid re-compressing the same data on each request. The ByteStream API stands to benefit the most from this, with the least amount of effort. Thanks to Eric Burnett and Grzegorz Lukasik for helping with this. Implements bazelbuild#147.
In many cases, it is desirable for blobs to be sent in compressed form to and from the cache. While gRPC supports channel-level compression, the generated bindings APIs require that implementers provide data in unserialized and uncompressed form. By allowing compressed data at the REAPI level instead, we can avoid re-compressing the same data on each request. The ByteStream API stands to benefit the most from this, with the least amount of effort. Thanks to Eric Burnett and Grzegorz Lukasik for helping with this. Implements bazelbuild#147.
In many cases, it is desirable for blobs to be sent in compressed form to and from the cache. While gRPC supports channel-level compression, the generated bindings APIs require that implementers provide data in unserialized and uncompressed form. By allowing compressed data at the REAPI level instead, we can avoid re-compressing the same data on each request. The ByteStream API stands to benefit the most from this, with the least amount of effort. Thanks to Eric Burnett and Grzegorz Lukasik for helping with this. Implements bazelbuild#147.
In many cases, it is desirable for blobs to be sent in compressed form to and from the cache. While gRPC supports channel-level compression, the generated bindings APIs require that implementers provide data in unserialized and uncompressed form. By allowing compressed data at the REAPI level instead, we can avoid re-compressing the same data on each request. The ByteStream API stands to benefit the most from this, with the least amount of effort. Thanks to Eric Burnett and Grzegorz Lukasik for helping with this. Implements bazelbuild#147.
In many cases, it is desirable for blobs to be sent in compressed form to and from the cache. While gRPC supports channel-level compression, the generated bindings APIs require that implementers provide data in unserialized and uncompressed form. By allowing compressed data at the REAPI level instead, we can avoid re-compressing the same data on each request. The ByteStream API stands to benefit the most from this, with the least amount of effort. Thanks to Eric Burnett and Grzegorz Lukasik for helping with this. Implements bazelbuild#147.
In many cases, it is desirable for blobs to be sent in compressed form to and from the cache. While gRPC supports channel-level compression, the generated bindings APIs require that implementers provide data in unserialized and uncompressed form. By allowing compressed data at the REAPI level instead, we can avoid re-compressing the same data on each request. The ByteStream API stands to benefit the most from this, with the least amount of effort. Thanks to Eric Burnett and Grzegorz Lukasik for helping with this. Implements #147.
This feature was added some time ago now. |
In many cases, it is desirable for blobs to be sent in compressed form to and from the cache. While gRPC supports channel-level compression, the generated bindings APIs for the languages I checked require implementors to provide data in unserialized and uncompressed form.
This has two (related) practical downsides for REAPI caches:
Contrast this with Bazel's more primitive HTTP remote cache protocol, where it is trivial for cache servers to store and serve compressed blobs by inspecting the request's Accept-Encoding header.
I think it might be worth investigating adding optional blob level compression to REAPIv3, to avoid these downsides.
This might involve:
If we decide to implement encryption and authenticity checks (#133), then we may need to consider these two features at the same time during the design phase.
The text was updated successfully, but these errors were encountered: