Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/compression: new zstd variant zstd:chunked #1084

Merged
merged 12 commits into from
Jul 7, 2021

Conversation

giuseppe
Copy link
Member

@giuseppe giuseppe commented Nov 19, 2020

add a new custom variant of the zstd compression that permits to retrieve each file separately.

The idea is based on CRFS and its stargz format for having seekable and indexable tarballs.

zstd_chunked

One disadvantage of the stargz format is that a custom file is added to the tarball to store the layer metadata, and the metadata file is part of the image itself. Clients that are not aware of the stargz format will propagate the metadata file inside of the containers.

The zstd compression supports embedding additional data as part of the stream that the zstd decompressor will ignore (skippable frame), so the issue above with CRFS can be solved directly within the zstd compression format.

Beside this minor advantage, zstd is much faster and compresses better than gzip, so take this opportunity to push the zstd format further.

The zstd compression is supported by the OCI image specs since August 2019: opencontainers/image-spec#788 and has been supported by containers/image since then.

Clients that are not aware of the zstd:chunked format, won't notice any difference when handling a blob that uses the variant.

Since access to host files is required for the deduplication, the untar doesn't happen in a separate process running in a chroot, but it happens in original mount namespace. To mitigate possible breakages, openat2(2) is used were available.

Requires: containers/storage#775

Signed-off-by: Giuseppe Scrivano [email protected]

@giuseppe
Copy link
Member Author

@cyphar what do you think about this idea? :-)

I've implemented (still hacky) the deduplication also with host files, e.g. pulling fedora:33 on a Fedora 33 host gives this result:

$ skopeo copy docker://docker.io/gscrivano/zstd-chunked:fedora33 containers-storage:fedora:33
Getting image source signatures
Copying blob 39e269c6c24a [========++++++++++++++++++++++++++++++] 68.4MiB / 68.4MiB (deduplicated: 54.0MiB = 78.87%)
Copying config 7985d34067 done  
Writing manifest to image destination
Storing signatures

@giuseppe giuseppe force-pushed the zstd-chunked branch 3 times, most recently from 13435c2 to 9d39fae Compare November 25, 2020 11:17
Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an extremely partial skim, I have read < 10% of the code, just to get a basic idea of the broad structure.

  • At a first uninformed glance, I’d prefer all of the filesystem knowledge to be in c/storage.
  • Should this wait until the c/image/storage parallel pull code (not necessarily the more complex “concurrent pull detection” part) is revived?
  • Is there a relationship to Support layer deltas #902 ? Is this strictly more powerful, or are the two independent optimizations, each with its own advantages, that can live alongside each other?

copy/copy.go Outdated Show resolved Hide resolved
copy/copy.go Outdated Show resolved Hide resolved
docker/errors.go Outdated Show resolved Hide resolved
storage/chunked_zstd.go Outdated Show resolved Hide resolved
types/types.go Outdated Show resolved Hide resolved
copy/copy.go Outdated Show resolved Hide resolved
copy/copy.go Show resolved Hide resolved
storage/storage_image.go Outdated Show resolved Hide resolved
storage/chunked_zstd.go Outdated Show resolved Hide resolved
storage/storage_image.go Outdated Show resolved Hide resolved
@mtrmac
Copy link
Collaborator

mtrmac commented Nov 25, 2020

Oh, and: how does digest verification happen with this, or what replaces it to obtain equivalent security?

@giuseppe
Copy link
Member Author

  • Is there a relationship to Support layer deltas #902 ? Is this strictly more powerful, or are the two independent optimizations, each with its own advantages, that can live alongside each other?

they are independent. #902 requires deltas to be generated and they are useful when e.g. pulling an updated image. This PR instead is to improve pulling of a generic layer and doing deduplication with all the files in the storage.

Another advantage of this PR is that it performs local deduplication as well. If a file is present locally, it is deduplicated with reflinks when the underlying file system supports them.

The two stages "prepare staging directory" & "commit from staging directory" breaks the c/storage abstraction that everything is a tarball and can be handled transparently by different drivers (in such case the PutBlob() logic is still fine), but I think it is an improvement when using overlay, the "Commit()" is basically a os.Rename() and can be easily done in parallel.

@giuseppe
Copy link
Member Author

  • Should this wait until the c/image/storage parallel pull code (not necessarily the more complex “concurrent pull detection” part) is revived?

I think will simplify the "parallel pull" as each "commit to staging directory" can be done separately without any interference among the different layers. The Commit() can still be left sequential, but being a very cheap operation, that won't be a problem. @vrothberg What do you think?

@giuseppe
Copy link
Member Author

giuseppe commented Nov 26, 2020

Oh, and: how does digest verification happen with this, or what replaces it to obtain equivalent security?

this part is still not solved and where I'll need your help :-) Do you think it is reasonable to include signatures as part of the metadata (it can be a separate chunk)? Only the manifest must be signed.

Since the manifest has the checksum for each file in the layer, we can validate each file individually.

@vrothberg
Copy link
Member

I think will simplify the "parallel pull" as each "commit to staging directory" can be done separately without any interference among the different layers. The Commit() can still be left sequential, but being a very cheap operation, that won't be a problem. @vrothberg What do you think?

I only briefly skimmed this PR and do not yet understand how the two PRs relate to another. In case an optimized commit would facilitate this work, let me know. We can prioritize it. If others want to tackle it earlier, I am more than happy :)

At the moment, I channel all my mental resources on getting the podman-cp rewrite done but I hope to have it finished by Monday. Thanksgiving buys me uninterrupted time in the EU :^)

@giuseppe giuseppe force-pushed the zstd-chunked branch 2 times, most recently from eb4d32b to b7bcbf6 Compare November 26, 2020 15:31
@mtrmac
Copy link
Collaborator

mtrmac commented Nov 26, 2020

Another advantage of this PR is that it performs local deduplication as well. If a file is present locally, it is deduplicated with reflinks when the underlying file system supports them.

BTW if this is letting the creator of the image cause accesses to files on the host filesystem, this worries me a lot. It very likely allows a network observer to determine whether files exists at attacker-directed paths on the host filesystem, and in the extreme (but probably fixable) cases it risks hardware operations (IIRC some files in /dev cause hardware to do things (rewind tapes?) just on open(2)).

@mtrmac
Copy link
Collaborator

mtrmac commented Nov 26, 2020

Oh, and: how does digest verification happen with this, or what replaces it to obtain equivalent security?

this part is still not solved and where I'll need your help :-) Do you think it is reasonable to include signatures as part of the metadata (it can be a separate chunk)? Only the manifest must be signed.

Since the manifest has the checksum for each file in the layer, we can validate each file individually.

There needs to be some kind of “chain of authentication” going from the manifest (which is validated against a signature or user-specified digest) all the way to individual on-disk files. I really haven’t read the code; maybe that chain of authentication could be an annotation on the layer in the manifest (which would restrict this feature to manifest schemas that have annotations, but this is probably already OCI-only), with that annotation containing some kind of hash of the “top-level” metadata item in the layer (is there even such a thing?), which further authenticates the individual files / layer chunks, or some groups of individual files / layer chunks.

@giuseppe giuseppe force-pushed the zstd-chunked branch 3 times, most recently from 393baa8 to 2efaee5 Compare November 26, 2020 18:09
@giuseppe
Copy link
Member Author

giuseppe commented Jul 1, 2021

fixed in the last version

add a new custom variant of the zstd compression that permits to
retrieve each file separately.

The idea is based on CRFS and its stargz format for having seekable
and indexable tarballs.

One disadvantage of the stargz format is that a custom file is added
to the tarball to store the layer metadata, and the metadata file is
part of the image itself.  Clients that are not aware of the stargz
format will propagate the metadata file inside of the containers.

The zstd compression supports embeddeding additional data as part of
the stream that the zstd decompressor will ignore (skippable frame),
so the issue above with CRFS can be solved directly within the zstd
compression format.

Beside this minor advantage, zstd is much faster and compresses better
than gzip, so take this opportunity to push the zstd format further.

The zstd compression is supported by the OCI image specs since August
2019: opencontainers/image-spec#788 and has
been supported by containers/image since then.

Clients that are not aware of the zstd:chunked format, won't notice
any difference when handling a blob that uses the variant.

Signed-off-by: Giuseppe Scrivano <[email protected]>
Signed-off-by: Giuseppe Scrivano <[email protected]>
@giuseppe
Copy link
Member Author

giuseppe commented Jul 2, 2021

rebased again. @vrothberg are you ok with the last version of the progress bar?

@vrothberg
Copy link
Member

@giuseppe, I am cool to merge as is (see below), thanks! The race is something I wanted to tackle for a long time but never found/took the time until now.

skopeo (main) $ ./bin/skopeo copy docker://docker.io/gscrivano/zstd-chunked:fedora3 containers-storage:fedora-chunke
d
INFO[0000] Not using native diff for overlay, this may cause degraded performance for building images: opaque flag erroneously copied up, co
nsider update to kernel 4.8 or later to fix 
Getting image source signatures
Copying blob 8f75b7c47cae [======================================] 82.0MiB / 82.2MiB (skipped: 0.0b = 0.00%)
Copying config 0a485308c4 done  
Writing manifest to image destination
Storing signatures
skopeo (main) $ ./bin/skopeo copy docker://docker.io/gscrivano/zstd-chunked:fedora3 containers-storage:fedora-chunked
INFO[0000] Not using native diff for overlay, this may cause degraded performance for building images: opaque flag erroneously copied up, co
nsider update to kernel 4.8 or later to fix 
Getting image source signatures
Copying blob 8f75b7c47cae [--------------------------------------] 0.0b / 0.0b
Copying config 0a485308c4 done  
Writing manifest to image destination
Storing signatures

@giuseppe
Copy link
Member Author

giuseppe commented Jul 3, 2021

So anything more left before we can finally merge this? :)

@giuseppe
Copy link
Member Author

giuseppe commented Jul 3, 2021

@rhatdan are you fine with it?

@rhatdan
Copy link
Member

rhatdan commented Jul 7, 2021

LGTM

@rhatdan rhatdan merged commit eeb9819 into containers:main Jul 7, 2021
return err
}

// FIXME: what to do with the uncompressed digest?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giuseppe Why is this (and the related s.blobDiffIDs[blobDigest] = blobDigest assignment), using the compressed values as the supposedly-unique uncompressed ones, correct and safe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants