Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix estargz compression loses the original tar metadata #2352

Merged
merged 1 commit into from
Sep 24, 2021

Conversation

ktock
Copy link
Collaborator

@ktock ktock commented Sep 8, 2021

Currently, eStargz compression doesn't preserve the original tar metadata (header bytes and their order).
This causes failure of TestGetRemote because an uncompressed blob converted from a gzip blob provides different digset
against the one converted from eStargz blob even if their original tar (computed by differ) are the same.

This commit solves this issue by fixing eStargz compressor to preserve original tar's metadata that is modified by eStargz. The metadata is saved on the content store and used for decompression when the eStargz is converted into another compressionType. (EDIT: fixed not to use content store)

@@ -91,26 +105,14 @@ func getConverter(desc ocispecs.Descriptor, compressionType compression.Type) (c

type conversion struct {
target compression.Type
decompress func(io.Reader) (cdcompression.DecompressReadCloser, error)
decompress func(context.Context, ocispecs.Descriptor) (io.ReadCloser, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change it to ReaderAt or ReadSeeker instead. We should be able to define conversions for all supported compressions as a single array and then unit test them with random (tarball) data making sure that compression is lossless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was there an issue with this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change it to ReaderAt or ReadSeeker instead.

IIUC cdcompression.DecompressReadCloser is not seekable. So if we want to make it seekable we need to change the current decompress implementation to store the decompressed blob to somewhere seekable (e.g. content store) then return the readseeker. Does this change SGTY?

cache/estargz.go Outdated Show resolved Hide resolved
@ktock ktock force-pushed the esgzcvt-preserve-tar branch from 43b45b4 to d90d0fa Compare September 8, 2021 15:05
@ktock
Copy link
Collaborator Author

ktock commented Sep 8, 2021

Fixed to use lossless compressor/decompressor of estargz (ktock/stargz-snapshotter@f5999ba).

@tonistiigi
Copy link
Member

Fixed to use lossless compressor/decompressor of estargz

You don't seem to be using tar-split parser there for encode phase. Therefore are you sure it works? It might work fine if the initial tarball was created by (same version of) go as the encoding in estargz and original tar matches. But try it with a tarball from GNU tar for example. Tar allows padding etc so there are many ways how the same files could end up encoded into a tarball.

@ktock
Copy link
Collaborator Author

ktock commented Sep 8, 2021

@tonistiigi

You don't seem to be using tar-split parser there for encode phase. Therefore are you sure it works?

It just streams the original tar into the destination writer using io.TeeReader with adding gzip headers + compressing as gzip for each file (https://github.com/ktock/stargz-snapshotter/blob/f5999bafbd01e69a54adb7842d476afbc9eee52d/estargz/estargz.go#L762) so it doesn't encode the tar headers anymore.

It might work fine if the initial tarball was created by (same version of) go as the encoding in estargz and original tar matches. But try it with a tarball from GNU tar for example. Tar allows padding etc so there are many ways how the same files could end up encoded into a tarball.

It's tested against three formats of tar (https://github.com/ktock/stargz-snapshotter/blob/f5999bafbd01e69a54adb7842d476afbc9eee52d/estargz/testutil.go#L621) but it uses go library so I'll try with tools outside golang.

@tonistiigi
Copy link
Member

It just streams the original tar into the destination writer using io.TeeReader with adding gzip headers + compressing as gzip for each file (https://github.com/ktock/stargz-snapshotter/blob/f5999bafbd01e69a54adb7842d476afbc9eee52d/estargz/estargz.go#L762) so it doesn't encode the tar headers anymore.

Interesting. It looks like the gzip writer is switched in the parser function https://github.com/ktock/stargz-snapshotter/blob/f5999bafbd01e69a54adb7842d476afbc9eee52d/estargz/estargz.go#L804 but the actual write happens in the tee. I don't think this is safe. The tar parser may read more than the header, writing it to the gzip writer, then return some of it, leaving the rest to the internal buffer. The intention should be to gzip all records precisely with individual gzip writer, but atm I don't think the tar block and gzip blocks match.

@ktock
Copy link
Collaborator Author

ktock commented Sep 8, 2021

I don't think this is safe. The tar parser may read more than the header, writing it to the gzip writer, then return some of it, leaving the rest to the internal buffer. The intention should be to gzip all records precisely with individual gzip writer, but atm I don't think the tar block and gzip blocks match.

Fixed to use tar-split (https://github.com/ktock/stargz-snapshotter/blob/aac1cee508136d135d55d9f51629bf9b493e048f/estargz/estargz.go#L817).

@ktock ktock force-pushed the esgzcvt-preserve-tar branch from ce62cbb to 5ee07ef Compare September 10, 2021 10:51
@ktock ktock marked this pull request as ready for review September 10, 2021 11:29
@ktock ktock requested a review from tonistiigi September 13, 2021 01:51
cache/estargz.go Outdated
return false
}
defer r.Close()
_, _, err = estargz.OpenFooter(io.NewSectionReader(r, 0, r.Size()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we leave a label on the blob after this has been determined for the first time and then avoid opening blob after that

require.NoError(t, err, testName)
}

// Check the uncompresed digest is the same as the original
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uncompressed

// Note that we don't support eStragz compression for tar that contains a file named
// `stargz.index.json` because we cannot create eStargz in loseless way for such blob
// (we must overwrite stargz.index.json file).
if err := w.AppendTarLossLess(pr); err != nil {
Copy link
Member

@tonistiigi tonistiigi Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit worried that other tools can still create unsupported tars. Can that other API be removed? What is the behavior when such tarballs are pulled? Maybe they should fail the isEstargz check?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cases where the unsupported tar (= contains stargz.index.json) reach here will be the following

  • invalid eStargz: the source blob is a broken (or malformed) eStargz that contains stargz.index.json but has an invalid footer. In this case, the blob is recognized as a normal gzip blob and decompressEStargz is not applied.
  • name conflict: the source blob is a non-eStargz blob (tar/gzip/zstd) but contains stargz.index.json which is not TOC.

(If the source blob is an eStargz, the decompressor (decompressEStargz) removes the TOC entry before appending so AppendTarLossLess won't return the error.)

I think these cases are rare thus it's reasonable to return an error here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I didn't think the error case but just that AppendTarLossLess is an additional API and AppendTar still remains. So other tools can create estargz that does not decompress properly. Can we make this requirement for stargz lib, and what is the behavior when we pull such blob that was not created with a lossless encoder.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tonistiigi

other tools can create estargz that does not decompress properly.

If the client converts a plain-gzip image A to eStargz B on their side with tools other than BuildKit, B doesn't hold the original bits so it's treated as a different image than A when pulled to BuildKit. When BuildKit converts B into a tar image, the produced tar is different from what is contained in A.

what is the behavior when we pull such blob that was not created with a lossless encoder.

Could you elaborate on the problem of the conversion happening outside of BuildKit? When it comes to tools other than BuildKit, they mostly need to change tar bits for enabling optimization which sorts tar entries in priority order and adds some landmark dummy files. I think BuildKit cache should treat such blobs as different one from the original.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason other tools couldn't just also use the lossless method. Why can't this be default?

Is it still correct that isEstargz detects lossy estargz as estargz or should it use it as regular gzip and try to recompress with lossless method?

go.mod Outdated
gotest.tools/v3 v3.0.3 // indirect
)

require github.com/vbatts/tar-split v0.11.2 // indirect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks to be outside the block

go.mod Outdated
github.com/golang/protobuf v1.5.2
// snappy: updated for go1.17 support
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this deliberate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make vendor automatically do this. Maybe we need a replace if we want to lock the snappy version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. What's the difference?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will be fixed by #2348.

@@ -91,26 +105,14 @@ func getConverter(desc ocispecs.Descriptor, compressionType compression.Type) (c

type conversion struct {
target compression.Type
decompress func(io.Reader) (cdcompression.DecompressReadCloser, error)
decompress func(context.Context, ocispecs.Descriptor) (io.ReadCloser, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was there an issue with this?

@ktock ktock force-pushed the esgzcvt-preserve-tar branch 2 times, most recently from 971e364 to cb61cda Compare September 14, 2021 07:38
Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM but I think we need to wait for #2348 to sort out the go.mod issues and rebase then.

@tonistiigi
Copy link
Member

@ktock needs rebase

@ktock ktock force-pushed the esgzcvt-preserve-tar branch from cb61cda to 63576c2 Compare September 22, 2021 01:42
Currently, eStargz compression doesn't preserve the original tar metadata
(header bytes and their order). This causes failure of `TestGetRemote` because
an uncompressed blob converted from a gzip blob provides different digset
against the one converted from eStargz blob even if their original tar (computed
by differ) are the same.
This commit solves this issue by fixing eStargz to preserve original tar's
metadata that is modified by eStargz.

Signed-off-by: Kohei Tokunaga <[email protected]>
@ktock ktock force-pushed the esgzcvt-preserve-tar branch from 63576c2 to da821a4 Compare September 22, 2021 01:51
@tonistiigi
Copy link
Member

@AkihiroSuda sgty?

@crazy-max crazy-max added this to the v0.10.0 milestone Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants