-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix estargz compression loses the original tar metadata #2352
Conversation
@@ -91,26 +105,14 @@ func getConverter(desc ocispecs.Descriptor, compressionType compression.Type) (c | |||
|
|||
type conversion struct { | |||
target compression.Type | |||
decompress func(io.Reader) (cdcompression.DecompressReadCloser, error) | |||
decompress func(context.Context, ocispecs.Descriptor) (io.ReadCloser, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change it to ReaderAt
or ReadSeeker
instead. We should be able to define conversions for all supported compressions as a single array and then unit test them with random (tarball) data making sure that compression is lossless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was there an issue with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change it to ReaderAt or ReadSeeker instead.
IIUC cdcompression.DecompressReadCloser
is not seekable. So if we want to make it seekable we need to change the current decompress
implementation to store the decompressed blob to somewhere seekable (e.g. content store) then return the readseeker. Does this change SGTY?
43b45b4
to
d90d0fa
Compare
Fixed to use lossless compressor/decompressor of estargz (ktock/stargz-snapshotter@f5999ba). |
You don't seem to be using tar-split parser there for encode phase. Therefore are you sure it works? It might work fine if the initial tarball was created by (same version of) go as the encoding in estargz and original tar matches. But try it with a tarball from GNU tar for example. Tar allows padding etc so there are many ways how the same files could end up encoded into a tarball. |
It just streams the original tar into the destination writer using
It's tested against three formats of tar (https://github.com/ktock/stargz-snapshotter/blob/f5999bafbd01e69a54adb7842d476afbc9eee52d/estargz/testutil.go#L621) but it uses go library so I'll try with tools outside golang. |
Interesting. It looks like the gzip writer is switched in the parser function https://github.com/ktock/stargz-snapshotter/blob/f5999bafbd01e69a54adb7842d476afbc9eee52d/estargz/estargz.go#L804 but the actual write happens in the |
Fixed to use tar-split (https://github.com/ktock/stargz-snapshotter/blob/aac1cee508136d135d55d9f51629bf9b493e048f/estargz/estargz.go#L817). |
04e0307
to
ce62cbb
Compare
ce62cbb
to
5ee07ef
Compare
cache/estargz.go
Outdated
return false | ||
} | ||
defer r.Close() | ||
_, _, err = estargz.OpenFooter(io.NewSectionReader(r, 0, r.Size())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we leave a label on the blob after this has been determined for the first time and then avoid opening blob after that
cache/manager_test.go
Outdated
require.NoError(t, err, testName) | ||
} | ||
|
||
// Check the uncompresed digest is the same as the original |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uncompressed
// Note that we don't support eStragz compression for tar that contains a file named | ||
// `stargz.index.json` because we cannot create eStargz in loseless way for such blob | ||
// (we must overwrite stargz.index.json file). | ||
if err := w.AppendTarLossLess(pr); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit worried that other tools can still create unsupported tars. Can that other API be removed? What is the behavior when such tarballs are pulled? Maybe they should fail the isEstargz
check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cases where the unsupported tar (= contains stargz.index.json
) reach here will be the following
- invalid eStargz: the source blob is a broken (or malformed) eStargz that contains
stargz.index.json
but has an invalid footer. In this case, the blob is recognized as a normal gzip blob anddecompressEStargz
is not applied. - name conflict: the source blob is a non-eStargz blob (tar/gzip/zstd) but contains
stargz.index.json
which is not TOC.
(If the source blob is an eStargz, the decompressor (decompressEStargz
) removes the TOC entry before appending so AppendTarLossLess
won't return the error.)
I think these cases are rare thus it's reasonable to return an error here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I didn't think the error case but just that AppendTarLossLess
is an additional API and AppendTar
still remains. So other tools can create estargz that does not decompress properly. Can we make this requirement for stargz lib, and what is the behavior when we pull such blob that was not created with a lossless encoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other tools can create estargz that does not decompress properly.
If the client converts a plain-gzip image A to eStargz B on their side with tools other than BuildKit, B doesn't hold the original bits so it's treated as a different image than A when pulled to BuildKit. When BuildKit converts B into a tar image, the produced tar is different from what is contained in A.
what is the behavior when we pull such blob that was not created with a lossless encoder.
Could you elaborate on the problem of the conversion happening outside of BuildKit? When it comes to tools other than BuildKit, they mostly need to change tar bits for enabling optimization which sorts tar entries in priority order and adds some landmark dummy files. I think BuildKit cache should treat such blobs as different one from the original.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason other tools couldn't just also use the lossless method. Why can't this be default?
Is it still correct that isEstargz
detects lossy estargz as estargz or should it use it as regular gzip and try to recompress with lossless method?
go.mod
Outdated
gotest.tools/v3 v3.0.3 // indirect | ||
) | ||
|
||
require github.com/vbatts/tar-split v0.11.2 // indirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks to be outside the block
go.mod
Outdated
github.com/golang/protobuf v1.5.2 | ||
// snappy: updated for go1.17 support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this deliberate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make vendor
automatically do this. Maybe we need a replace
if we want to lock the snappy version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. What's the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this will be fixed by #2348.
@@ -91,26 +105,14 @@ func getConverter(desc ocispecs.Descriptor, compressionType compression.Type) (c | |||
|
|||
type conversion struct { | |||
target compression.Type | |||
decompress func(io.Reader) (cdcompression.DecompressReadCloser, error) | |||
decompress func(context.Context, ocispecs.Descriptor) (io.ReadCloser, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was there an issue with this?
971e364
to
cb61cda
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM but I think we need to wait for #2348 to sort out the go.mod issues and rebase then.
@ktock needs rebase |
cb61cda
to
63576c2
Compare
Currently, eStargz compression doesn't preserve the original tar metadata (header bytes and their order). This causes failure of `TestGetRemote` because an uncompressed blob converted from a gzip blob provides different digset against the one converted from eStargz blob even if their original tar (computed by differ) are the same. This commit solves this issue by fixing eStargz to preserve original tar's metadata that is modified by eStargz. Signed-off-by: Kohei Tokunaga <[email protected]>
63576c2
to
da821a4
Compare
@AkihiroSuda sgty? |
Currently, eStargz compression doesn't preserve the original tar metadata (header bytes and their order).
This causes failure of
TestGetRemote
because an uncompressed blob converted from a gzip blob provides different digsetagainst the one converted from eStargz blob even if their original tar (computed by differ) are the same.
This commit solves this issue by fixing eStargz compressor to preserve original tar's metadata that is modified by eStargz.
The metadata is saved on the content store and used for decompression when the eStargz is converted into another compressionType.(EDIT: fixed not to use content store)