-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chunked: store cache as binary and use a bloom filter #1870
chunked: store cache as binary and use a bloom filter #1870
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: giuseppe The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
f4f29dd
to
ba6657b
Compare
@kolyshkin @mtrmac @rhatdan some more improvements to the cache file |
fb5a6e7
to
de1dff9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for splitting this into smaller commits. That made the review very enjoyable.
@@ -154,6 +162,19 @@ fallback: | |||
return buf, nil, err | |||
} | |||
|
|||
func getBinaryDigest(stringDigest string) ([]byte, error) { | |||
d, err := digest.Parse(stringDigest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider having this function accept a digest.Digest
instead, and pushing the digest.Parse
to callers; on various code paths that compute the digest, that can avoid a digest.Validate
(regex evaluation).
OTOH then the other code paths must call Parse
or Validate
if dealing with untrusted data.
pkg/chunked/cache_linux.go
Outdated
if err != nil { | ||
return nil, err | ||
} | ||
digest := append([]byte(d.Algorithm()+":"), digestBytes...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(More space could be saved by using a single byte to index into a table of algorithm names, or something like that, at the cost of even more complexity. Possibly not worth it, and definitely not blocking this PR.)
bloomFilter: bloomFilter, | ||
digestLen: int(digestLen), | ||
fnames: fnames, | ||
fnamesLen: int(fnamesLen), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Nit: I’d aesthetically prefer if the type declaration, and the two constructors, were all listing the fields in the same order — and if that order were somehow consistent. Here we have (data, len) for file names, but (len, data) for tags, for example.)
pkg/chunked/cache_linux_test.go
Outdated
@@ -164,7 +167,8 @@ func TestWriteCache(t *testing.T) { | |||
if digest != r.ChunkDigest { | |||
t.Error("wrong digest found") | |||
} | |||
expectedLocation := generateFileLocation(r.Name, uint64(r.ChunkOffset), uint64(r.ChunkSize)) | |||
expectedLocation, err := generateFileLocation(0, uint64(r.ChunkOffset), uint64(r.ChunkSize)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking: Testing this with at least two distinct paths, to truly exercise the indexing logic, would be nice.
de1dff9
to
9f2b6a3
Compare
the PR is ready for review |
@mtrmac needs another review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m sorry about the delayed response.
9f2b6a3
to
d986219
Compare
I've fixed your comments, except #1870 (comment). What would you like me to do here? |
// are stored. $DIGEST has length digestLen stored in the cache file file header. | ||
func generateTag(digest string, offset, len uint64) string { | ||
return fmt.Sprintf("%s%.20d@%.20d", digest, offset, len) | ||
func appendTag(digest []byte, offset, len uint64) ([]byte, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Absolutely non-blocking: the error value is always nil
now.)
https://github.com/containers/storage/pull/1870/files#r1557856169 , or perhaps I’m missing something. |
@giuseppe This is waiting on you now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some very minor comments. Overall it looks like a good case for protobuf, or is that an overkill?
It looks like appendBinaryDigest
's first argument is always []byte{}
. First, it could be nil
, second, if it's not ever used maybe drop it (and rename the function to getBinaryDigest
).
pkg/chunked/cache_linux.go
Outdated
var vdata bytes.Buffer | ||
var tagsBuffer bytes.Buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
var vdata, tagsBuffer bytes.Buffer
return "", 0, 0 | ||
} | ||
|
||
nElements := len(cacheFile.tags) / cacheFile.tagLen | ||
|
||
i := sort.Search(nElements, func(i int) bool { | ||
d := byteSliceAsString(cacheFile.tags[i*cacheFile.tagLen : i*cacheFile.tagLen+cacheFile.digestLen]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: with this change, func byteSliceAsString
(defined above) can also be removed.
pkg/chunked/cache_linux.go
Outdated
func generateTag(digest string, offset, len uint64) string { | ||
return fmt.Sprintf("%s%.20d@%.20d", digest, offset, len) | ||
func generateTag(digest []byte, offset, len uint64) []byte { | ||
tag := append(digest[:], []byte(fmt.Sprintf("%.20d@%.20d", offset, len))...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain the need for [:]
after digest
here? I am staring at it and can't see why it's needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, []byte()
conversion is redundant here, in Go you can append a string... to a []byte
slice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is replaced by a next commit. I'll squash the two patches so it will just disappear
sort.Slice(tags, func(i, j int) bool { | ||
return bytes.Compare(tags[i], tags[j]) == -1 | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easier to use slices.SortFunc(tags, bytes.Compare)
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had to revert this change as we are using go 1.20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(BTW we are now agreed on updating to Go 1.21 (around containers/skopeo#2297 ). That would probably be a separate PR.)
pkg/chunked/cache_linux.go
Outdated
for _, k := range toc { | ||
if k.Digest != "" { | ||
digest, err := appendBinaryDigest([]byte{}, k.Digest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can use nil
instead of []byte{}
here -- appending to nil slice is fine.
pkg/chunked/bloom_filter_test.go
Outdated
var tagsBuffer bytes.Buffer | ||
var vdata bytes.Buffer | ||
var fnames bytes.Buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
var tagsBuffer, vdata, fnames bytes.Buffer
(unless you don't like that style; to me it's less repetition)
pkg/chunked/cache_linux.go
Outdated
nameBytes := []byte(name) | ||
|
||
if err := binary.Write(&fnames, binary.LittleEndian, uint32(len(nameBytes))); err != nil { | ||
return 0, err | ||
} | ||
if _, err := fnames.Write(nameBytes); err != nil { | ||
return 0, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can avoid []byte
conversion here, using name
directly (as in len(name)
and fnames.WriteString(name)
)
Signed-off-by: Giuseppe Scrivano <[email protected]>
d986219
to
0820a2a
Compare
thanks @mtrmac and @kolyshkin. I've addressed your last comments and pushed a new version |
0820a2a
to
6d8fc8e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise.
use the binary representation for a given digest, it helps reducing the file size by ~25%. Signed-off-by: Giuseppe Scrivano <[email protected]>
it helps reducing the cache file size by ~25%. Signed-off-by: Giuseppe Scrivano <[email protected]>
Signed-off-by: Giuseppe Scrivano <[email protected]>
use a bloom filter to speed up lookup of digests in a cache file. The biggest advantage is that it reduces page faults with the mmap'ed cache file. Signed-off-by: Giuseppe Scrivano <[email protected]>
so that the same file path is stored only once in the cache file. After this change, the cache file measured on the fedora:{38,39,40} images is in average ~6% smaller. Signed-off-by: Giuseppe Scrivano <[email protected]>
it reduces the cache file size by ~3%. Signed-off-by: Giuseppe Scrivano <[email protected]>
Signed-off-by: Giuseppe Scrivano <[email protected]>
6d8fc8e
to
065a2f3
Compare
/lgtm Thanks! |
The bloom filter itself is useful to reduce page faults with the mmap'ed cache files, as it reduces lookups.
Storing the file as a binary instead reduces the file size considerably, with the
quay.io/giuseppe/zstd-chunked:fedora-{38,39,40}{,-updated}
images I see:before:
after:
so it is ~50% size reduction