Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redunant definition of layers #115

Closed
s-urbaniak opened this issue Jun 2, 2016 · 9 comments
Closed

Redunant definition of layers #115

s-urbaniak opened this issue Jun 2, 2016 · 9 comments

Comments

@s-urbaniak
Copy link
Collaborator

We have two definitions of layers:

  1. https://github.com/opencontainers/image-spec/blob/master/manifest.md#image-manifest-property-descriptions
  2. Layer DiffIDs specified in https://github.com/opencontainers/image-spec/blob/master/serialization.md#terminology

Which list should we consider for unpacking/oci-runtime-bundling?

@philips
Copy link
Contributor

philips commented Jun 2, 2016

I would assume 1. is the correct one and that 2. is deprecated

@vbatts @stevvooe seems like we should drop 2, right?

@stevvooe
Copy link
Contributor

stevvooe commented Jun 2, 2016

@philips I'm not sure saying we have redundant definitions of layers is correct. We have two different identifiers depending on the context. With the manifest, we have a simple, byte-stream digest. Within image config, DiffID was added to protect against hash instability introduced during compression. DiffID is used as part of the ChainID to ensure correct assembly. It is necessary if implementations prefer reassembly to reproduce artifacts over saving them in a cache.

We can look at these as two components, numbered according to the above:

  1. This is the generic content description. Collecting the content requires no knowledge about whats inside. After processing and verifying, the result should be all the elements required to assemble the image, without having to understand how to actually assemble it.
  2. This is the application component. At this stage, content is verified but image assembly may need to further process the content to correctly identify it. In this case, the raw tar stream, without compression, is hashed.

The benefit of this distinction is that images can be fetched without having to understand anything about a container runtime or actual image format. It also simplifies content storage, in that we no longer need to couple storage with the content format.

Most of this can be avoided if we opt to store artifacts, rather than try to reassemble. DiffID can be deprecated, if we converge it with content digest, but it imposes an architecture on implementations that may not be ideal for disk space usage.

@wking
Copy link
Contributor

wking commented Jun 2, 2016

On Thu, Jun 02, 2016 at 12:59:05PM -0700, Stephen Day wrote:

Within image config, DiffID was added to protect against hash
instability introduced during compression.

I haven't wrapped my head around DiffID, so this may be off target.
But isn't hash-instability a reconstructor issue? And DiffID has to
be populated (and signed) by the original image author? How does
“reconstruction-instability protection” work in a verified, Merkle-DAG
context?

Most of this can be avoided if we opt to store artifacts, rather
than try to reassemble. DiffID can be deprecated, if we converge
it with content digest, but it imposes an architecture on
implementations that may not be ideal for disk space usage.

This sounds like the best approach to me. Disk space is cheap and
reconstruction is difficult. How many unpacked bundles do you expect
to have lying around at once? And as a later improvement, we can
increase blob-storage efficiency by adjusting the Merkle
representation to use smaller, more-likely-to-be-shared objects
(asymptotically approaching IPFS and it's sub-file, potentially
Rabin-fingerprinted blobs ;).

@stevvooe
Copy link
Contributor

stevvooe commented Jun 2, 2016

I haven't wrapped my head around DiffID, so this may be off target.
But isn't hash-instability a reconstructor issue?

Ideally, but with compressed layers, there is little that can be done to ensure this. Including the DiffID just ensures that internally these IDs remain as stable as possible. This design was done to secure image identification in a manner that was least disruptive to users.

This sounds like the best approach to me. Disk space is cheap and
reconstruction is difficult.

👃 👈

I'd love to resolve this in OCI, but doing this in the 1.0 time frame is unrealistic. We need to be careful not to impose an artifact store on implementations, as there are cases where that is untenable or impractical.

@wking
Copy link
Contributor

wking commented Jun 2, 2016

On Thu, Jun 02, 2016 at 01:29:52PM -0700, Stephen Day wrote:

This sounds like the best approach to me. Disk space is cheap and
reconstruction is difficult.

I'd love to resolve this in OCI, but doing this in the 1.0 time
frame is unrealistic. We need to be careful not to impose an
artifact store on implementations, as there are cases where that is
untenable or impractical.

If we drop DiffID in favor of descriptors for layers, the only
downside seems to be potentially inefficient blobs (e.g. recompressing
tar stream 12 to blob 56, when you could have recycled an existing
blob 34 compression of tar stream 12). Background discussion starting
at 1. So if we go with descriptors, I'd recommend publishers used a
blob store to avoid compression instability, but I don't think we'd
need to require it. If your publishers are pushing many copies of the
same tarball with different hashes due to recompression, and you have
a problem with that, then you need to incentivize your publishers to
be more efficient (e.g. by charging them for adding new blobs to your
registry).

I'm not clear on the security angle (although I have a guess 2).

@philips
Copy link
Contributor

philips commented Jun 3, 2016

@stevvooe So, if I were to add a clarifying point to help out @s-urbaniak and other people new to the spec the config diffID is the ungzipped hash? While the layers digest is gzipd?

@stevvooe
Copy link
Contributor

stevvooe commented Jun 6, 2016

@philips That is a fair differentiation but not binding.

The digest is the hash of the unprocessed content. It can be verified without understand tar or layers or images. The diffID must be the ungzipped hash.

@philips
Copy link
Contributor

philips commented Jun 7, 2016

@stevvooe what do you mean by non-binding?

I will try and get some language together to close out this bug.

@stevvooe
Copy link
Contributor

stevvooe commented Jun 7, 2016

This is what I meant:

The digest is the hash of the unprocessed content. It can be verified without understand tar or layers or images. The diffID must be the ungzipped hash.

A digest reference can be verified without understanding the content, a diffID cannot.

philips pushed a commit to philips/image-spec that referenced this issue Jun 15, 2016
DiffIDs and Manifest list digests were a bit confusing. Explain the
difference.

Fixes: opencontainers#115
philips pushed a commit to philips/image-spec that referenced this issue Jun 15, 2016
DiffIDs and Manifest list digests were a bit confusing. Explain the
difference.

Fixes: opencontainers#115
Signed-off-by: Brandon Philips <[email protected]>
philips pushed a commit to philips/image-spec that referenced this issue Jun 15, 2016
DiffIDs and Manifest list digests were a bit confusing. Explain the
difference.

Fixes: opencontainers#115
Signed-off-by: Brandon Philips <[email protected]>
philips pushed a commit to philips/image-spec that referenced this issue Jun 17, 2016
DiffIDs and Manifest list digests were a bit confusing. Explain the
difference.

Fixes: opencontainers#115
Signed-off-by: Brandon Philips <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants