Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add blob split and splice API #282

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

roloffs
Copy link

@roloffs roloffs commented Nov 23, 2023

This is a proposal of a conservative extension to the ContentAddressableStorage service, which allows to reduce traffic when blobs are fetched from the remote CAS to the host for local usage or inspection. With this extension it is possible to request a remote-execution endpoint to split a specified blob into chunks of a certain average size. These chunks are then stored in the CAS as blobs and the ordered list of chunk digests is returned. The client can then check, which blob chunks are available locally from earlier fetches and fetch only the missing chunks. By using the digest list, the client can splice the requested blob from the locally available chunk data.

This extension could especially help to reduce traffic if large binary files are created at the remote side and needed locally such as executables with debug information, comprehensive libraries, or even whole file system images. It is a conservative extension, so no client is forced to use it. In our build-system project justbuild, we have implemented this protocol extension for server and client side.

@roloffs roloffs requested a review from bergsieker as a code owner November 23, 2023 13:40
@roloffs roloffs changed the title Adding blob-split API Adding blob split API Nov 23, 2023
@EdSchouten
Copy link
Collaborator

Though this proposal may solve certain annoyances, it does not fix some other issues:

  • The inability to upload files by uploading missing chunks. Given the fact that uploading is often slower than downloading (using typical internet connections provided by ISPs), I would argue that this is more important than downloading.
  • The inability for clients to access files at random, without sacrificing the ability to check file integrity.

I would rather see that we try to solve these issues as part of REv3, by working towards eliminating the existence of large CAS objects in their entirety. See this section in the proposal I've been working on:

https://docs.google.com/document/d/1FxpdOzOhzOCTjjn2loppMlBzjqjU9WpYF4E1K6opxVI/edit#heading=h.3ix2ngn3uabo

@sluongng
Copy link
Contributor

This is quite similar to https://github.com/bazelbuild/remote-apis/pull/233/files. @roloffs have you had a chance to review the PR prior and related discussions?

I think both PRs approaching this by adding a separate RPC, which is good and V2 compatible.
The V3 support for this could be discussed separately and should not block V2 improvement as long as it’s backward compatible?

@roloffs
Copy link
Author

roloffs commented Nov 26, 2023

@EdSchouten, thanks for sharing the link to your REv3 discussion document and your comments about this proposal. Sorry for not being aware of this document. After going through it, I agree with you that this API extension would not make much sense in REv3 given your proposal in the "Elimination of large CAS objects" section. However, as also @sluongng stated, since this extension is conservative and backwards compatible to REv2 and the release of REv3 is very uncertain right now, it would not harm people that do not use it, but would provide advantages already now for people who use it and could also lead to insights for your content-defined chunking ideas for REv3, since we also used such an algorithm to split blobs.

I also agree with your concerns that uploading large files is something not covered by this proposal, while there exist relevant use-cases for this. However, I can think of a symmetric SpliceBlob rpc to allow for splitting a large blob on the client side, uploading only those parts of this blob that are missing on the server side, and then splicing there. This could be added in this PR as well.

@sluongng, thanks for pointing out this PR. Despite the fact they look very similar they actually target complementary goals. Let me explain why. While the PR from @EdSchouten introduces split and combine blobs rpcs, the goal is not to safe traffic but to introduce a blob splitting scheme, which allows to verify the integrity of a blob by validating the digests of its chunks without actually reading the whole chunk data. In order to achieve this, he introduced a new digest function SHA256TREE, which allows recursive digest calculation. I hope I did not completely misunderstood your intention @EdSchouten. In contrast, the presented splitting scheme targets reuse as much as possible with the final goal of traffic reduction between client and server. E.g., if a large binary in the remote CAS was just modified slightly and you want to use it locally, you would have to download it completely. Using the presented extension, only the binary differences between the two versions determined by content-defined chunking would have to be downloaded, which is typically much less than the whole data. As I said both splitting schemes are actually complementary and follow different goals.

@sluongng
Copy link
Contributor

I think what's missing in this PR was a specification regarding how the splitting algorithm would look like, and the ability to choose different algorithms for the job.

In #233 , the chunking algorithm was mixed with the Digest algorithm, which I think is a good start as it's customizable. But I definitely can see cases where the Digest algorithm and Chunking algorithm are separated for different combinations (I.e. reed solomon + blake3, FastCDC + SHA256, delta compression + GITSHA1 etc...). And each combination could serve different purposes (deduplication, download parallelization, etc...).

It would be nice if you could provide a bit more detail regarding your splitting algorithm of choice as an option here.

@roloffs
Copy link
Author

roloffs commented Nov 28, 2023

While the actual choice of the splitting algorithm is mainly an implementation detail of the remote-execution endpoint (which of course affects the quality of the split result), the essential property of a server is to provide certain guarantees to a client if it successfully answers a SplitBlob request:

  1. The blob chunks are stored in CAS.
  2. Concatenating the blob chunks in the order of the digest list returned by the server results in the original blob.

Besides this guarantee, in order to increase the reuse factor as much as possible between different versions of a blob, it makes sense to implement a content-defined chunking algorithm. They typically result in chunks of variable size and are insensitive to the data-shifting problem of fixed-size chunking.

Such content-defined chunking algorithms typically rely on a rolling-hash function to efficiently compute hash values of consecutive bytes at every byte position in the data stream in order to determine the chunk boundaries. Popular algorithms for content-defined chunking are:

I have selected FastCDC as chunking algorithm for the endpoint implementation in our build system, since it has been proven to be very compute efficient and faster than the other rolling-hash algorithms while achieving similar deduplication ratios as the Rabin fingerprint. We already observed reuse factors of 96-98% for small changes, when working with big file-system images (around 800 MB) and also of 75% for a 300 MB executable with debug information.

Maybe, you want to have a look at our internal design document for more information about this blob-splitting API extension.

@sluongng
Copy link
Contributor

sluongng commented Nov 28, 2023

Ah I think I have realized what's missing here. Your design seems to be focusing on splitting the blob on the server side for the client to download large blobs. While I was thinking that blob splitting could happen to both the client side and the server side.

For example: a game designer may work on some graphic assets, say a really large picture. Subsequent versions of a picture may get chunked on the client side. Then the client can compare the chunk list with the chunks that are already available on the server-side, and only upload the parts that are missing.

So in the case where both client and server have to split big blobs for efficient download AND upload, it's beneficial for 2 sides to agree upon how to split (and put back together) big blobs.

@roloffs
Copy link
Author

roloffs commented Nov 28, 2023

Yes, you are right, this design currently focuses on splitting on the server side and downloading large blobs, but as mentioned in a comment above, I am willing to extend this design proposal by a SpliceBlob rpc to handle chunked uploads. This allows splitting a blob on the client side, uploading chunks that are missing at the server, and then splicing the original blob there.

Maybe it is worth to mention that in this case, it is not necessarily required for client and server to agree upon the same splitting algorithm, since after the first round-trip overhead, the chunking algorithm for each direction anyway ensures an efficient reuse.

I will update this proposal to handle uploads for you to review. Thank you very much for your interest and nice suggestions.

@sluongng
Copy link
Contributor

Do keep in mind that there could be mixed usage of clients (a) with chunking support and clients (b) without chunking support.

So I do believe a negotiation via the initial GetCapability RPC, similar to the current Digest and Compressor negotiation, is much desirable. As the server would need to know how to put a split blob upload, from (a), back together to serve it to (b).

I would recommend throwing the design ideas into #178. It's not yet settled whether chunking support needs to be a V3 exclusive feature, or we could do it as part of V2. Discussion to help nudge the issue forward would be much appreciated.

@roloffs roloffs force-pushed the main branch 2 times, most recently from 186240b to b74bac8 Compare November 30, 2023 16:04
@roloffs
Copy link
Author

roloffs commented Nov 30, 2023

@sluongng I have updated the PR with a more sharpened description of what is meant by and what is the goal of this blob-splitting approach and a proposal for the chunked upload of large blobs.

Some thoughts about your hints regarding the capabilities negotiation between client and server:

  • If a client does not support blob splitting, it would not call SpliceBlob at the server, but still can call SplitBlob at the server if the server supports it (an evaluation of the blob_split_support flag is enough to find this out). Then it is up to the server how to split the blob, it just needs to give the guarantees mentioned earlier.
  • If a client supports blob splitting, it could upload chunks to the server and use the SpliceBlob operation if the server supports it (an evaluation of the blob_splice_support flag is enough to find this out). Whereby supporting the splice operation mainly means being able to concatenate chunks of data specified by the uploaded list of chunk digests in the given order. The client could even exploit domain-specific knowledge (that the server does not have) to split its blobs into chunks to improve traffic reduction of subsequent uploads of modified blobs. Furthermore, it can also use SplitBlob at the server if the server supports it.

This means, each side is responsible for its chunking approach without having the other side to know about it. The other side just needs to be able to concatenate the chunks. Furthermore, it would be difficult to agree, e.g., on the same FastCDC algorithm, since this algorithm internally depends on an array of 256 random numbers (generated by the implementer) and thus could result in completely different chunk boundaries for two different implementations preventing any reuse between the chunks on the server and the client.

I will also put a summary of this blob splitting and splicing concept into #178. Would be nice if this concept could find its way into REv2 since it is just an extension free to use and no invasive modification.

@sluongng
Copy link
Contributor

sluongng commented Dec 2, 2023

Do give #272 and my draft PR a read on how client/server could negotiate for a spec leveraging GetCapabilities rpc. Could be useful if you want to have a consistent splitting scheme between client and server.

// The ordered list of digests of the chunks which need to be concatenated to
// assemble the original blob.
repeated Digest chunk_digests = 3;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should design this SpliceBlobRequest as part of BatchUpdateBlobsRequest.

The problem is that we are not pushing the chunk blobs in this request, only sending the CAS server metadata regarding combining some blobs into a larger blob. There could be a delay between the BatchUpdateBlobs RPC call and the SpliceBlob RPC call. That delay could be a few mili-seconds, or it could be weeks, or months after some of the uploaded chunks have expired from the CAS server. There is no transaction guarantee between the 2 RPCs.

So a way to have some form of transaction guarantee would be to send this as part of the same RPC that uploads all the blobs.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we are not pushing the chunk blobs in this request, only sending the CAS server metadata regarding combining some blobs into a larger blob.

But that is the overarching principle behind the whole protocol that we upload blobs and later refer to them by their digests. All that works, because the CAS promises to keep all blobs fresh (i.e., not forget about them for a reasonable amount of time) where its answer implies it knows about them (that could be the answer to a FindMissingBlobs request or a success statement to a blob upload request). The typical workflow for a client using this request would anyway be to split the blob locally, use FindMissingBlobs to find out which blobs are not yet known to the CAS, then (batch) upload only the ones not yet known to the CAS (that's where the savings in traffic come from) and then request a splice of all of them. All this works becasue the promise of the CAS to keep referenced objects alive.

To give a prominent example where the protocol is already relying on that guarantee to keep objects alive, consider the request to execute an action. That request does not upload any blobs, yet still expects them to be there because a recent interaction with the CAS showed they are in the CAS already. In a sense, that example also shows that blob splicing is nothing fundamentally new, but just an optimisation: the client could already now request an action calling cat be executed—in a request that is independent of the blob upload. However, making it explicitly an operation on the CAS gives a huge room for optimization: no need to spawn an action-execution environment, the CAS knows ahead of time hash and size of the blob that is to be stored as a result of that request, if it known the blob in question it does not even have to do anything (apart from keeping that blob fresh), etc.

@roloffs
Copy link
Author

roloffs commented Dec 20, 2023

Hello @sluongng, I have updated the proposal by using the capabilities service as you have proposed. It is now possible for a client to determine the supported chunking algorithms at the server side and select one at a SplitBlob request. By this means, the client can select one that it also uses locally so that both communication directions benefit from the available chunking data on each side. Furthermore, I have added some comments about lifetime of chunks. Thanks for your time reviewing this PR!

@roloffs roloffs closed this Feb 27, 2024
roloffs added a commit to roloffs/remote-apis that referenced this pull request Feb 27, 2024
Depending on the software project, possibly large binary artifacts need to be
downloaded from or uploaded to the remote CAS. Examples are executables with
debug information, comprehensive libraries, or even whole file system images.
Such artifacts generate a lot of traffic when downloaded or uploaded.

The blob split API allows to split such artifacts into chunks at the remote
side, to fetch only those parts that are locally missing, and finally to
locally assemble the requested blob from its chunks.

The blob splice API allows to split such artifacts into chunks locally, to
upload only those parts that are remotely missing, and finally to remotely
splice the requested blob from its chunks.

Since only the binary differences from the last download/upload are
fetched/uploaded, the blob split and splice API can significantly save network
traffic between server and client.
@roloffs roloffs reopened this Feb 27, 2024
@roloffs roloffs changed the title Adding blob split API Add blob split and splice API Feb 27, 2024
@roloffs
Copy link
Author

roloffs commented Feb 27, 2024

Hello all,

after spending quite some time working on this proposal and its implementation, I have finished incorporating all suggestions made by the reviewers and that came up during the working group meeting. Finally, the following high-level features would be added to the REv2 protocol:

  1. Optional split operation at the CAS service to reduce traffic for the download direction.
  2. Optional splice operation at the CAS service to reduce traffic for the upload direction.
  3. Negotiation mechanism between client and server to agree about the used chunking algorithm.

This whole proposal is fully implemented in our own remote-execution implementation in justbuild:

  • Split operation provided at server side
  • Splice operation provided at server side

and used by the just client:

  • Split operation used at client side
  • Splice operation used at client side

From my side, this proposal is finished and ready for final review. What I would like to know from you is what needs now to be done that this proposal finally gets merged into main. I can also summarize it again at the next working group meeting and at best would like to know a decision how to proceed with this proposal. Thank you very much for your efforts.

@EdSchouten, @sluongng, @bergsieker

roloffs added a commit to roloffs/remote-apis that referenced this pull request Feb 27, 2024
Depending on the software project, possibly large binary artifacts need to be
downloaded from or uploaded to the remote CAS. Examples are executables with
debug information, comprehensive libraries, or even whole file system images.
Such artifacts generate a lot of traffic when downloaded or uploaded.

The blob split API allows to split such artifacts into chunks at the remote
side, to fetch only those parts that are locally missing, and finally to
locally assemble the requested blob from its chunks.

The blob splice API allows to split such artifacts into chunks locally, to
upload only those parts that are remotely missing, and finally to remotely
splice the requested blob from its chunks.

Since only the binary differences from the last download/upload are
fetched/uploaded, the blob split and splice API can significantly save network
traffic between server and client.
Copy link
Contributor

@mostynb mostynb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we already effectively have the ability to splice blobs by using the bytestream API with read_offset and read_limit?

// The digest of the blob to be splitted.
Digest blob_digest = 2;

// The chunking algorithm to be used. Must be IDENTITY (no chunking) or one of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead reject IDENTITY as an invalid argument? I imagine this would only be used by broken clients?

Copy link
Author

@roloffs roloffs Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about that, I have basically copied the pattern from PR #276 to include a sane default value. I leave that open to your decision, I have no objections to change this.

Copy link
Author

@roloffs roloffs Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this field from IDENTITY to DEFAULT, because as you mentioned IDENTITY does not really makes sense being requested by a client. Instead, to provide a proper default value for the chunking algorithm enum, I have introduced DEFAULT, which means the client does not care, which exact chunking algorithm is used by the server, just use the default implementation. If a client wants to negotiate more explicitly about the used chunking algorithm, it should specify one of the other enum values that are supported and advertised by the server.

I hope, this resolves your concerns? @mostynb

@roloffs
Copy link
Author

roloffs commented Feb 28, 2024

@mostynb, as far as I have understood the protocol, no. While the bytestream API with read_offset and read_limit allows you to partially read the content of a blob, it does not allow you to create a new blob from a batch of other blobs (its chunks) at the remote CAS.

The goal of blob splicing is that if a client regularly uploads slightly different large objects to the remote CAS, only the binary differences between the versions are needed to be uploaded and not the entire block of binary data every time. To achieve this, the client needs to split the large object into reusable chunks (which is typically done by content-defined chunking) and just uploads the chunks (handled as blobs) that are missing at the remote CAS, which are normally a lot when uploading the first time. If the client needs to upload this large object again, but a slightly different version of it (meaning only a percentage of the binary data has been changed), he again splits it into chunks and tests which chunks are missing at the remote CAS. Normally, content-defined chunking splits the binary data that hasn't been changed into the same set of chunks, only where binary differences occur, different chunks will be created. This means, only a fraction of the whole set of chunks need to be uploaded to the remote CAS in order to be able to reconstruct the second version of the large object at the remote CAS. The actual reconstruction of a large blob at the remote side is done using the splice command with a description of which chunks need to be concatenated (a list of chunk digests available at the remote CAS).

The split operation works exactly the other way around, when you regularly download an ever changing large object from remote CAS. Then, the server splits the large object into chunks, the client fetches only the locally missing chunks and reconstructs the large object locally from the locally available chunks.

Finally, to exploit chunking for both directions at the same time, it makes sense that the client and the server agree on a chunking algorithm to allow reusing chunks created on both sides. For this, we added a negotiation mechanism to agree on the chunking algorithm used on both sides.

// (Algorithm 2, FastCDC8KB). The algorithm is configured to have the
// following properties on resulting chunk sizes.
// - Minimum chunk size: 2 KB
// - Average chunk size: 8 KB
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that using small chunk sizes, such as 8 KB, can increase the likelihood of deduplication and may also reduce the risk of disk storage fragmentation. However, have you considered if there is potential performance overhead of having too many fine-grained CAS blobs?

I can envision that a feature like this could also be beneficial in distributing the load more evenly across multiple CAS shards. But for such use cases, it might make sense to use much larger chunks, perhaps 8 MB? Should we somehow accommodate also for larger chunks in this PR?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We experimented with FastCDC on approximately 5TB of real Bazel data from many different codebases and found that 0.5MB is a good trade-off between space savings and metadata overhead. Too small of a chunk size means the metadata for all chunks becomes very large / numerous, while too large of a chunk size means poor space savings.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @luluz66, I think 0.5 MB is more reasonable than 8 KB.

Do you think there is one value that would fit all, or should a size like this be allowed to be tuned in a flexible way?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @luluz66 for your experiments. Indeed, we did not evaluate storage consumption trade-offs since we were mainly interested in traffic reduction. I think, 500 KB of average chunk size is a sane default value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.5 MB was the ideal range for us based on the Bazel-specific data set we were testing against. However, there would be no telling whether that number would be different for a different client/server pair, or a different data set.

So I think we would want a discovery mechanism for the FastCDC configuration on the server side. The client should follow the server's advertised setting in order to achieve the best result. WDYT?

roloffs added a commit to roloffs/remote-apis that referenced this pull request Mar 7, 2024
Depending on the software project, possibly large binary artifacts need to be
downloaded from or uploaded to the remote CAS. Examples are executables with
debug information, comprehensive libraries, or even whole file system images.
Such artifacts generate a lot of traffic when downloaded or uploaded.

The blob split API allows to split such artifacts into chunks at the remote
side, to fetch only those parts that are locally missing, and finally to
locally assemble the requested blob from its chunks.

The blob splice API allows to split such artifacts into chunks locally, to
upload only those parts that are remotely missing, and finally to remotely
splice the requested blob from its chunks.

Since only the binary differences from the last download/upload are
fetched/uploaded, the blob split and splice API can significantly save network
traffic between server and client.
roloffs added a commit to roloffs/remote-apis that referenced this pull request Mar 11, 2024
Depending on the software project, possibly large binary artifacts need to be
downloaded from or uploaded to the remote CAS. Examples are executables with
debug information, comprehensive libraries, or even whole file system images.
Such artifacts generate a lot of traffic when downloaded or uploaded.

The blob split API allows to split such artifacts into chunks at the remote
side, to fetch only those parts that are locally missing, and finally to
locally assemble the requested blob from its chunks.

The blob splice API allows to split such artifacts into chunks locally, to
upload only those parts that are remotely missing, and finally to remotely
splice the requested blob from its chunks.

Since only the binary differences from the last download/upload are
fetched/uploaded, the blob split and splice API can significantly save network
traffic between server and client.
Depending on the software project, possibly large binary artifacts need to be
downloaded from or uploaded to the remote CAS. Examples are executables with
debug information, comprehensive libraries, or even whole file system images.
Such artifacts generate a lot of traffic when downloaded or uploaded.

The blob split API allows to split such artifacts into chunks at the remote
side, to fetch only those parts that are locally missing, and finally to
locally assemble the requested blob from its chunks.

The blob splice API allows to split such artifacts into chunks locally, to
upload only those parts that are remotely missing, and finally to remotely
splice the requested blob from its chunks.

Since only the binary differences from the last download/upload are
fetched/uploaded, the blob split and splice API can significantly save network
traffic between server and client.

// Content-defined chunking using Rabin fingerprints. An implementation of
// this scheme in presented in this paper
// https://link.springer.com/chapter/10.1007/978-1-4613-9323-8_11. The final
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This document is behind a paywall. Any chance we can link to a spec that is freely accessible?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it this paper? http://cui.unige.ch/tcs/cours/algoweb/2002/articles/art_marculescu_andrei_1.pdf

Even though that paper provides a fairly thorough mathematical definition of how Rabin fingerprints work, it's not entirely obvious to me how it translates to an actual algorithm for us to use. Any chance we can include some pseudocode, or link to a publication that provides it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, that is the paper, I can update the link. I really spend some time trying to find a nice algorithmic description for the Rabin fingerprint method, but failed. Only I could find some real implementations on GitHub, but I assume this is not something, we would like to link here. There is also the original paper of the Rabin method http://www.xmailserver.org/rabin.pdf, but that one doesn't seem to help either. The paper above gives a reasonable introduction to how Rabin fingerprints works and even some thoughts about how to implement it, so I thought, that is the best source to link here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the paper about FastCDC https://ieeexplore.ieee.org/document/9055082 also contains an algorithmic description of the Rabin fingerprinting technique (Algorithm 3. RabinCDC8KB). The only thing that is missing here is a precise description of the precomputed U and T arrays. Even though, they provide links to other papers, where these arrays are supposedly defined, I could not find any definition in these papers.

// - Maximum chunk size: 2048 KB
// The irreducible polynomial to be used for the modulo divisions is the
// following 64-bit polynomial of degree 53: 0x003DA3358B4DC173. The window
// size to be used is 64 bits.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ NOTE: I'm absolutely not an expert on content defined chunking algorithms! ]

Is a window size of 64 bits intentional? If I look at stuff like https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf, it seems they are using 48 bytes, while in their case they are aiming for min=2KB, avg=8KB, max=64KB. Shouldn't this be scaled proportionally?

To me it's also not fully clear why an algorithm like this makes a distinction between a minimum chunk size and a window size. Why wouldn't one simply pick a 128 KB window and slide over that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, that was a mistake from myside, what was meant is 64 bytes window size. I took this value from an existing implementation. Thanks for this finding!

The window size is a general parameter of a rolling hash function and the hash value or also fingerprint for a specific byte position is calculated for that window of bytes. Then, you move forward by one byte and calculate the hash value for this new window of bytes again. Thanks to the rolling-hash property, this process can be done very efficiently. So, the window size influences the hash value for a specific byte position and thus, the locations of actual chunk boundaries. Theoretically, we could use a window size of the minimum chunk size of 128 KB, but it is not common to use such a large window size in the implementations of content-defined chunking algorithms I have seen so far.


// Content-defined chunking using Rabin fingerprints. An implementation of
// this scheme in presented in this paper
// https://link.springer.com/chapter/10.1007/978-1-4613-9323-8_11. The final
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it this paper? http://cui.unige.ch/tcs/cours/algoweb/2002/articles/art_marculescu_andrei_1.pdf

Even though that paper provides a fairly thorough mathematical definition of how Rabin fingerprints work, it's not entirely obvious to me how it translates to an actual algorithm for us to use. Any chance we can include some pseudocode, or link to a publication that provides it?

// implementation of this algorithm should be configured to have the
// following properties on resulting chunk sizes.
// - Minimum chunk size: 128 KB
// - Average chunk size: 512 KB (0x000000000007FFFF bit mask)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that the idea is that you process input byte for byte, compute a fingerprint over a sliding window, and create a chunk if the last 19 bits of the hash are all zeroes or ones (which is it?), are you sure that this will give an average chunk size of 512 KB? That would only hold if there was no minimum size, right? So shouldn't the average chunk size be 128+512 = 640 KB?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the 19 lowest bits of the fingerprint need to be zero for the expression (fp & mask) to become 0, in which case you found a chunk boundary. Regarding your question about the average chunk size, you are right, the actual average chunk size = expected chunk size (512 KB) + minimum chunk size (128 KB). I will state this more clearly in the comments, thanks for pointing this out!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did some measurements, and it looks like using fp & mask == 0 is actually a pretty bad choice. The reason being that it's pretty easy to craft windows for which fp == 0, namely a sequence consisting exclusively of 0-bytes. This has also been observed here:

https://ieeexplore.ieee.org/document/9006560

Although using a cut value of zero seems to be a natural choice in theory, it turns out to be a poor choice in practice (Figure 2).

I noticed this especially when creating chunks of a Linux kernel source tarball (decompressed), for the same reason as stated in the article:

It turns out that tar files use zeroes to pad out internal file structures. These padded structures cause an explosion of 4-byte (or smaller) length chunks if the cut value is also zero. In fact, over 98% of the chunks are 4-bytes long (Figure 2, (W16,P15,M16,C0) Table I).

I didn't observe the 4-byte chunks, for the reason that I used a minimum size as documented.

Results look a lot better if I use fp & mask == mask.

// https://link.springer.com/chapter/10.1007/978-1-4613-9323-8_11. The final
// implementation of this algorithm should be configured to have the
// following properties on resulting chunk sizes.
// - Minimum chunk size: 128 KB
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that there is a...

1 - ((2^19-1)/(2^19))^(128*1024) = 22.12...%

... probability that chunks actually contain a cutoff point that wasn't taken because it would violate the minimum chunk size? That probability sounds a lot higher than I'd imagine to be acceptable.

Consider the case where you inject some arbitrary data close to the beginning of a chunk that had a cutoff point within the first 128 KB that wasn't respected. If the injected data causes the cutoff point to be pushed above the 128 KB boundary, that would cause the cutoff point to be respected. This in turn could in its turn cause successive chunks to use different cutoff points as well.

Maybe it makes more sense to pick 16 KB or 32 KB here?

// following properties on resulting chunk sizes.
// - Minimum chunk size: 128 KB
// - Average chunk size: 512 KB (0x000000000007FFFF bit mask)
// - Maximum chunk size: 2048 KB
Copy link
Collaborator

@EdSchouten EdSchouten Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming my understanding of the algorithm is correct, doesn't that mean that there is a...

((2^19-1)/(2^19))^((2048-128)*1024) = 2.352...%

... probability that chunks end up reaching the maximum size? This sounds relatively high, but is arguably unavoidable.

A bit odd that these kinds of algorithms don't attempt to account for this, for example by repeatedly rerunning the algorithm with a smaller bit mask (18, 17, 16, [...] bits) until a match is found. That way you get a more even spread of data across such chunks, and need to upload fewer chunks in case data is injected into/removed from neighbouring chunks that both reach the 2 MB limit.

Comment on lines +2217 to +2219
// - Minimum chunk size: 128 KB
// - Average chunk size: 512 KB
// - Maximum chunk size: 2048 KB
Copy link
Collaborator

@EdSchouten EdSchouten Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at algorithm 2 in that paper, I see that I can indeed plug in these values to MinSize, MaxSize, and NormalSize. But what values should I use for MaskS and MaskL now? (There's also MaskA, but that seems unused by the stock algorithm.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, as far as I have understood the algorithm, it is the normalized chunking technique that they use, which allows to keep these mask values as they are, but change the min/max/average chunk sizes according to your needs.

Copy link
Collaborator

@EdSchouten EdSchouten Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at algorithm 2 in that paper that attempts to compute 8 KB chunks, I think the probability of a random chunk having size at most x should be as follows (link):

desmos-graph

Now if I change that graph to use the minimum/average/maximum chunk sizes that you propose while leaving the bitmasks unaltered, I see this (link):

desmos-graph-2

If I change it so that MaskS has 21 bits and MaskL has 17, then the graph starts to resemble the original one (link):

desmos-graph-3

So I do think the masks need to be adjusted as well.

Copy link
Collaborator

@EdSchouten EdSchouten Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Making the graphs above was meaningful. I think it gives good insight in why FastCDC was designed the way it is. They are essentially trying to mimic a normal distribution (link):

desmos-graph-5

Using a literal normal distribution would not be desirable, because it means that the probability that a chunk is created at a given point not only depends on the data within the window, but also the size of the current chunk. And this is exactly what CDC tries to prevent. So that's why they emulate it using three partial functions.

Yeah, we should make sure to set MaskS and MaskL accordingly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, thanks @EdSchouten for this great analysis. I have to admit, I did not look into these mask values in that detail, because it also was not really explained in detail in the paper, but your concerns are absolutely right and we have to set the mask values accordingly, when the min/max/average chunk sizes are changed. I am just asking myself, since the paper authors mentioned they derived the mask values empirically, how should we adapt them?

@luluz66 , @sluongng since you mentioned you were working with the FastCDC algorithm with chunk sizes of 500 KB, I am wondering how you handled the adaption of the mask values or whether you did it at all. Can you share your experience here? Thank you so much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have our fork of CDC implementation here https://github.com/buildbuddy-io/fastcdc-go/blob/47805a2ecd550cb875f1b797a47a1a648a1feed1/fastcdc.go#L162-L172

This work is very much in progress and not final. We hope that the current API will come with configurable knobs so that downstream implementation could choose what's best for their use cases and data.

Copy link
Collaborator

@EdSchouten EdSchouten Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just asking myself, since the paper authors mentioned they derived the mask values empirically, how should we adapt them?

I think that the authors of the paper just used the following heuristics:

  1. Don't use the bottom 16 bits, because those likely contain low quality data.
  2. Don't use the top 16 bits to get an equal comparison against many common Rabin fingerprinting implementations that use a 48 byte window.

(1) sounds like a good idea, but (2) can likely be ignored. So just pick 21 out of the top 48 bits, and then another 17 bits that are a subset of the former.

@EdSchouten
Copy link
Collaborator

EdSchouten commented Mar 21, 2024

Hi @roloffs,

Earlier today I spent some time experimenting with FastCDC. In addition to implementing Algorithm 2 "FastCDC8KB" from the paper, I also wrote something corresponding to the following:

Input: data buffer, src; buffer length, n
Output: chunking breakpoint bestOffset

/* The same initialization as the original algorithm. */
MinSize = ...;
MaxSize = ...;
fp = 0;
if n <= MinSize then
    return n;
if n >= MaxSize then
    n = MaxSize;

/* Compute the rolling hash at the start of the cutoff range. */
i = MinSize - 64;
for ; i < MinSize; i++; do
    fp = (fp << 1) + Gear[src[i]];

/* Test if the cutoff range contains points for which the rolling hash is higher. */
bestFP = fp
bestOffset = MinSize
for ; i < MaxSize; i++; do
    fp = (fp << 1) + Gear[src[i]];
    if bestFP < fp then
        bestFP = fp;
        bestOffset = i + 1;
return bestOffset;

The following thought has gone into this variant:

  • As we saw above, the expected chunk size of plain FastCDC8KB resembles a normal distribution. I fail to see why this should be preferred over a uniform distribution. It means that if cuts start to diverge due to differing file contents, it also takes potentially longer for the cuts to converge after the file contents become equal again. This due to the existence of a potential period/cadence in cuts.

  • With FastCDC8KB, if no suitable location can be found to make a cut, the algorithm simply returns $MaxSize$. For Rabin fingerprinting the probability of hitting $MaxSize$ tends to be relatively high, but FastCDC's use of $MaskL$ acts somewhat as a safeguard. That said, if we do reach the maximum it stops being "content defined" chunking altogether. The poor selection of the cut could potentially trickle through into subsequent chunks/cuts. By always placing the cut at the point with the highest $fp$ instead of hoping to see a point where the desired number of bits is clear, the chunking remains "content defined".

  • FastCDC8KB only uses the middle 32 bits of $fp$. This is a shame, because the top 16 bits tend to be the most valuable ones (being computed from up to 64 previous bytes of data).

If implemented naively, the algorithm above will have a worse worst-case running time. Namely, we always compute $fp$ over the full $MaxSize$ bytes of input, even though the resulting chunk may be closer to $MinSize$. This means that the worst-case running time of the algorithm above is $O(n \cdot \frac{MaxSize}{MinSize})$ instead of just $O(n)$. Fortunately, this can easily be addressed by making an array of $bestFP$ and $bestOffset$ that is $\lceil \frac{MaxSize}{MinSize} \rceil$ elements in size. This array can preserve rolling hashes, to be carried over to subsequent rounds. Using this approach I managed to end up with an implementation with an amortized worst-case running time of $O(n)$. Its throughput is nearly indistinguishable from plain FastCDC8KB, due to the hot path still being the rolling hash computation, which is unaltered.

With regards to performance of the chunking performed, I downloaded some different versions of the Linux kernel source code. Because the upstream tarballs contain timestamps, I unpacked them and concatenated all of the files contained within, which gave me some ~1.4 GB files, consisting mostly of text. I cut these files into ~10 KB chunks using both FastCDC8KB and the algorithm described above, giving me ~140k chunks per kernel. Comparing Linux 6.7.10 with Linux 6.8.1, I see that:

  • Using FastCDC8KB, 12.29% of the chunks change.
  • Using the algorithm above, only 11.44% of the chunks change.

This means that the algorithm described above performs $1 - 11.44\% / 12.29\% = 6.92\%$ better at eliminating redundancy for this specific dataset. Also using a uniform distribution in chunk size makes it possible to reduce the maximum chunk size significantly. Tuning it to fit your needs should also be easier, as you now only need to adjust $MinSize$ and $MaxSize$. $AvgSize$ and the bit masks are no longer needed.

Would you by any chance be interested in trying to reproduce these results? I'd be interested in knowing whether these savings hold in general.

@sluongng
Copy link
Contributor

@EdSchouten what do you think would be the next step here for this PR?

To me, it seems like we have established the value of using FastCDC as one of the potential chunking algorithms.
FastCDC comes with several configuration knobs that could be finetuned against the data to improve the efficiency of runtime and deduplication hits.

I think this means that we gonna need a mechanism for the server to advertise the desired FastCDC config, and the client to comply accordingly.

As for the modified-FastCDC, if we cannot expose it as an discoverable configuration knobs, then we could add a new chunking algorithm after FastCDC is merged.

WDYT?

@EdSchouten
Copy link
Collaborator

EdSchouten commented Mar 21, 2024

I have no opinion whatsoever what should happen here. As I mentioned during the last working group meeting, I have absolutely no intent to implement any of this on the Buildbarn side as part of REv2. I don't think that there is an elegant way we can get this into the existing protocol without making unreasonable sacrifices. For example, I care about data integrity. So with regards to what the next steps are, that's for others within the working group to decide.

That said, I am more than willing to engage in more discussions how we should address this as part of REv3. First and foremost, I think that the methodology that is used to chunk object should not be part of the lower level storage. Files should be Merkle trees that are stored in the CAS in literal form. What methodology is used to chunk files should only need to be specified by clients to ensure that workers chunk files in a way that is consistent with locally created files. Therefore, the policy to chunk should in REv3 most likely be stored in its equivalent of the Command message. Not in any of the capabilities.

@EdSchouten
Copy link
Collaborator

FYI: If other people want to do some testing in this area, I have just released the source code for the algorithm described above: https://github.com/buildbarn/go-cdc

@sluongng
Copy link
Contributor

Hey @roloffs ,

We had a monthly REAPI meeting this week and the maintainers have concluded that we should push this PR forward.
As folks discussed on the current state of this PR, notable blockers that need some additional work:

  • Clarification on which chunking algorithm should be included and why they should be included. In the specific case of CDC, the maintainers have asked whether we could provide a "sane default" for the set of tuning involved in setting it up.

  • A test vector should be provided for each chunking algorithm. This should help folks verify their implementations against the spec more easily.

  • Some document improvement is needed. Specifically, the maintainers suggested that this new API set should be marked as an "optional extension" of REAPI v2. The purpose is to highlight that this is a temporary solution for the large blobs problem for REAPI v2. I also suspect using the word "Experimental" here would help lower the expectations from the end users and help us merge this PR earlier. The maintainers hope that REAPI v3 will provide a more concrete solution to this problem.

With that said, please let me know if you still have the capacity to work on this PR.
If not, I could give it a try in a few weeks to see if I could help drive it to the finish line.

cc: @buchgr @EdSchouten

@roloffs
Copy link
Author

roloffs commented Jul 9, 2024

Hello @sluongng, sorry for not being responsive for a longer time, I was on parental leave from work for three months and will catch up everything during the next days. I am willing to finish this PR and also have the capacity to do this from now on. Still, if you are willing to support, it would be appreciated since I have to consider and incorporate all great comments from @EdSchouten. Today, there is a Remote Execution API Working Group Meeting, however, I won't attend since there is not much to report. I will do my best to finish everything until the next meeting in August.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants