Make the order more clear #330

coolljt0725 · 2016-09-19T12:08:02Z

Signed-off-by: Lei Jitang [email protected]
order could be ascending order of descending order,
to avoid confusing, make it clear about the order of layer.

wking · 2016-09-19T17:50:36Z

config.md

@@ -244,7 +244,7 @@ This specification uses the following terms:
            Implementations MUST generate an error if they encounter a unknown value while verifying or unpacking an image.
          </li>
          <li>
-            <code>diff_ids</code> is an array of layer content hashes (<code>DiffIDs</code>), in order from bottom-most to top-most.
+            <code>diff_ids</code> is an array of layer content hashes (<code>DiffIDs</code>), in order from bottom-most to top-most and bottom-most at index 0.


#318 is proposing consolidated language for the manifest's layers. Having layers in one JSON blob and diff_ids in another is awkward, but I'd suggest leaning on the layers definition with something like:

<code>diff_ids</code> has the same semantics as the <a href="manifest.md#image-manifest-property-descriptions">manifest's `layers`</a>, except that the digests MUST all be taken after removing any compression.

Without layers, diff_ids seems pretty useless. How do you recover the type of the referenced blob? Can we land the +gzip suffix being discussed in #316 and convert diff_ids to an array of descriptors? Then you could tell that one entry pointed at anapplication/vnd.oci.image.layer.tar while another pointed at an application/vnd.oci.image.layer.nondistributable.tar and another pointed at an application/vnd.example.image.extension. Then diff_ids would have a shot at being useful in its own right (in cases where the consumer had access to a CAS engine holding the uncompressed blobs).

@wking Frow what I understanding, layers and diff_ids are different concept at some point. layers is defined in manifest to reference layers in image layout, and diff_ids is defined in config.json to reference layers in image local storage for engine like docker.

@wking Be very careful: I have explained the concept of diff_ids several times and you keep keeping it absolutely wrong. I would prefer if you could take the time to understand the difference before commenting further on this matter. You are causing a lot of confusion and wasting a lot of time of the maintainers and contributors.

This literally has nothing to do with #316, #318 or descriptors. The CAS system has no meaning when inside the image config. This is for calculating the secure ChainID which is used as the image id.

wking · 2016-09-20T08:14:35Z

On Mon, Sep 19, 2016 at 07:06:46PM -0700, Lei Jitang wrote:

       <code>diff_ids</code> is an array of layer content hashes (<code>DiffIDs</code>), in order from bottom-most to top-most.
       <code>diff_ids</code> is an array of layer content hashes (<code>DiffIDs</code>), in order from bottom-most to top-most and bottom-most at index 0.
@wking Frow what I understanding, layers and diff_ids are
different concept at some point. layers is defined in manifest to
reference layers in image layout, and diff_ids is defined in
config.json to reference layers in image local storage for engine
like docker.

There's no fundamental distinction between CAS blobs in image-layout
and CAS blobs in local storage. You could certainly use image-layout
for local storage. My understanding of diff_ids (based on #115) is
that the uncompressed application/vnd.oci.image.layer.tar are
reasonably easy to round-trip (with tar-split or similar) while the
compressed application/vnd.oci.image.layer.tar+gzip are hard to
round-trip (because compression is unstable between, and sometimes
within, compressor implementations). So hashing over the uncompressed
application/vnd.oci.image.layer.tar gives you stable hashes (without
having to preserve the original blobs), which in turn give you more
stable image hashes 1. Using descriptors:

"diff_ids": [
{
"digest": "sha256:abc…",
"mediaType": "application/vnd.oci.image.layer.tar",
"size": 3072
},
…
]

is just as stable (especially with a SHOULD canonical JSON, #259) as
the current:

"diff_ids": [
"sha256:abc…",
…
]

and the descriptor version has the added benefit that you know which
media type you're pointing at without having to cross-reference the
manifest's ‘layers’.

And with the descriptor form, the only difference between ‘layers’ and
‘diff_ids’ is that you should use uncompressed media types and blobs
(which is what my suggested wording was focusing on 2).

If we stick with the digest-only array, it's a bit harder to lean on
‘layers’. But I still think we want to avoid two parts of the spec
using two different phrasings to express the same ordering
requirements.

stevvooe · 2016-09-20T21:38:42Z

Not LGTM: this addition makes it less clear.

The statement says it is "in order from bottom-most to top-most", which implies index 0. Adding the extra sentence makes the clause very confusing.

wking · 2016-09-20T21:52:46Z

On Tue, Sep 20, 2016 at 02:37:31PM -0700, Stephen Day wrote:

       <code>diff_ids</code> is an array of layer content hashes (<code>DiffIDs</code>), in order from bottom-most to top-most.
       <code>diff_ids</code> is an array of layer content hashes (<code>DiffIDs</code>), in order from bottom-most to top-most and bottom-most at index 0.
@wking Be very careful: I have explained the concept of diff_ids
several times and you keep keeping it absolutely wrong. I would
prefer if you could take the time to understand the difference
before commenting further on this matter. You are causing a lot of
confusion and wasting a lot of time of the maintainers and
contributors.

I'm trying very hard to understand it, but am unclear on why having a
CAS meaning is a bad idea.

This literally has nothing to do with #316, #318 or descriptors. The
CAS system has no meaning when inside the image config. This is for
calculating the secure ChainID which is used as the image id.

I understood the image ID to be a hash over the config JSON 1. That
does not include the ChainID (computed by hashing diffIDs 2). As
far as image IDs go, the important thing seems to be that it includes
some round-trip-able representation of the layer array, which you can
accomplish with either an array of digests (like we have now) or an
array of descriptors (like I'm proposing 3), as long as the digests
are over uncompressed content. If you're using descriptors, you have
the added benefit that you could use them to grab CAS objects
(e.g. from a local cache) and know how to unpack them to construct the
rootfs without needing a manifest to tell you the media types. The
only downside I can see is a one-time bump in imageIDs when folks
translate their config from the old array-of-digests to the new
array-of-descriptors. That seems like a small enough price to pay,
and folks who don't want to pay it can just stick with their existing
application/vnd.docker.container.image.v1+json config (or whatever
they're using now).

coolljt0725 · 2016-09-21T02:00:32Z

@stevvooe

The statement says it is "in order from bottom-most to top-most", which implies index 0.

I don't know if implies is good for a spec, I just think we should make everything explicit in a spec without any ambiguous. :)

stevvooe · 2016-09-21T02:52:58Z

@wking You're continuing to confuse two concepts. There is no CAS at the image config level. By the time the system is processing the diff ids, CAS doesn't matter. If you couple CAS at this layer, you make for a very bad design. These are only to be used in the calculation of the ChainID. The fact that these share structure with the uncompressed layer artifacts is just a convenience. It is not supposed to be used for fetching content or integrated with some other ill-conceived contraption.

Even worse, this has not a thing to do with this PR. You've taken yet another discussion and made it completely off topic and completely unproductive.

The fact that you responded within 10 minutes of my earlier comments tells me that you've put no further effort into understanding the concept. Taking time to understand the problem doesn't mean crapping out another paragraph of nonsense. It means reading the docker code and trying to understand the behavior and intent. Try to calculate an image ID using the image config and chain ID and match it with the docker implementation. Then, look at the issues faced when pulling images that arise when you couple remote identification (ie digests) with local verification values.

That will build true understanding.

wking · 2016-09-21T06:39:11Z

On Tue, Sep 20, 2016 at 07:52:59PM -0700, Stephen Day wrote:

There is no CAS at the image config level.

All you need for CAS is a digest, and we have that now. It's also
useful to know what to do with the object you've pulled out of CAS,
and for that you'll want a media type. Which is why I'm proposing we
use descriptors here.

By the time the system is processing the diff ids, CAS doesn't
matter. If you couple CAS at this layer, you make for a very bad
design. These are only to be used in the calculation of the ChainID.

Then just include a ChainID and skip diff_ids altogether.

But I don't buy CAS-coupling here being a very bad idea. Say I store
my uncompressed tarballs on a filesystem that does implicit
compression (e.g. btrfs). And I refuse to store gzipped objects in my
local store; I just store the uncompressed versions (with btrfs
compressing them before they hit the block level). A diff_ids full of
descriptors gives me the digests I need to fetch those uncompressed
layers and the media types I need to interpret those uncompressed
layers. That may not be the most efficient way to store and use a
large body of layers, but it doesn't seem like it would be “very bad”.

The fact that these share structure with the uncompressed layer
artifacts is just a convenience. It is not supposed to be used for
fetching content or integrated with some other ill-conceived
contraption.

You certainly want to be transmitting compressed layers over the wire.
But I don't see why that requires the CAS blobs to be explicitly
compressed inside the image-layout. And #332 seems to be getting some
positive review and will allow for uncompressed layer blobs in
image-layout (so users can use them in situtations where they make
sense).

Even worse, this has not a thing to do with this PR.

If ‘diff_ids’ hold descriptors, it becomes very easy to define it
clearly be leaning on ‘layers’ (or vice versa). I'm trying to avoid
having two parts of the spec using two different phrasings to express
the same ordering requirements 1. So my suggestion here is that we
resolve the diff_ids ambiguity by DRYing up the diff_ids/layers split,
and that seems on-topic to me.

The fact that you responded within 10 minutes of my earlier comments
tells me that you've put no further effort into understanding the
concept.

“you keep keeping it absolutely wrong” didn't give me a lot to go on
2 ;). You have some more-constructive advice below:

Try to calculate an image ID using the image config and chain ID and
match it with the docker implementation.

docker --version

Docker version 1.11.2, build b9f10c9

docker images --no-trunc debian

REPOSITORY TAG IMAGE ID CREATED SIZE
debian 7 sha256:dc81f1af026c21c1b2eb9861da188a154027ef11a46df42986165ce7b8e952c2 7 weeks ago 84.9 MB
debian latest sha256:1b01529cc499d51767c62f9fc8083043610546b4e050898809ec54e00dbb1a34 7 weeks ago 125.1 MB

sha256sum /var/lib/docker/image/btrfs/imagedb/content/sha256/dc81f1af026c21c1b2eb9861da188a154027ef11a46df42986165ce7b8e952c2

dc81f1af026c21c1b2eb9861da188a154027ef11a46df42986165ce7b8e952c2 /var/lib/docker/image/btrfs/imagedb/content/sha256/dc81f1af026c21c1b2eb9861da188a154027ef11a46df42986165ce7b8e952c2

jq . /var/lib/docker/image/btrfs/imagedb/content/sha256/dc81f1af026c21c1b2eb9861da188a154027ef11a46df42986165ce7b8e952c2

{
"architecture": "amd64",
…
"rootfs": {
"type": "layers",
"diff_ids": [
"sha256:10a32ce5555eb1ad981a08014840a6e28de60506fd07a979d74b148774316c05"
]
}
}

So it's pretty clear that the image ID is a hash over the config JSON,
and does not involve the ChainID. That matches my earlier
understanding 3, which is not surprising because the spec is very
clear on this point 4.

For the ChainID:

$ cat x.go
package main

import (
"fmt"

      "github.com/docker/docker/layer"

)

func main() {
chainID := layer.CreateChainID([]layer.DiffID{
"sha256:10a32ce5555eb1ad981a08014840a6e28de60506fd07a979d74b148774316c05",
})
fmt.Println(chainID)
}
$ go run x.go
sha256:10a32ce5555eb1ad981a08014840a6e28de60506fd07a979d74b148774316c05

Which is also unsurprising because the specs clearly say the chain ID
of the initial layer is its diff ID 5.

Then, look at the issues faced when pulling images that arise when
you couple remote identification (ie digests) with local
verification values.

I'm missing the issues you're pointing at here. Is this “we can't use
on-the-wire-only layer compression because of the risk of
compression-bombs”? I'm not suggesting we drop ‘layers’, so folks who
will be shipping compressed content are still free to use descriptors
pointing at the compressed objects, and they can sign that manifest or
something further up the Merkle tree so I know I can compare the
compressed content with the trusted descriptor before decompressing.
I'm suggesting we use descriptors for diff_ids, which would mean that
I could notice diff_ids[0] is layer tarball with a digest of
sha256:10…, fetch it from my local store, and unpack it without
looking at the manifest at all.

I fail to see how adding media type information about the uncompressed
content will trigger issues with pulling content from remote
repositories. You still have all the information you had before, so
you can still take whatever steps you were previously taking to
protect yourself from whatever the remote identification / local
verification issue was. But if you have a more targetted link for
that issue, I'm happy to do more reading.

Signed-off-by: Lei Jitang <[email protected]>

wking · 2016-10-02T21:52:20Z

Obsolete after #349?

coolljt0725 force-pushed the order branch from 58d139f to 43bfb14 Compare September 19, 2016 12:10

coolljt0725 changed the title ~~Make it clear about the order~~ Make the order more clear Sep 19, 2016

wking reviewed Sep 19, 2016

View reviewed changes

wking mentioned this pull request Sep 20, 2016

media-types: Define '+gzip' structured syntax suffix #332

Closed

coolljt0725 force-pushed the order branch from 43bfb14 to e758a9e Compare September 28, 2016 02:29

Make the order more clear

e758a9e

Signed-off-by: Lei Jitang <[email protected]>

coolljt0725 closed this Oct 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the order more clear #330

Make the order more clear #330

coolljt0725 commented Sep 19, 2016

wking Sep 19, 2016

coolljt0725 Sep 20, 2016

stevvooe Sep 20, 2016

wking commented Sep 20, 2016

stevvooe commented Sep 20, 2016

wking commented Sep 20, 2016

coolljt0725 commented Sep 21, 2016

stevvooe commented Sep 21, 2016

wking commented Sep 21, 2016

wking commented Oct 2, 2016 via email

Make the order more clear #330

Make the order more clear #330

Conversation

coolljt0725 commented Sep 19, 2016

wking Sep 19, 2016

Choose a reason for hiding this comment

coolljt0725 Sep 20, 2016

Choose a reason for hiding this comment

stevvooe Sep 20, 2016

Choose a reason for hiding this comment

wking commented Sep 20, 2016

stevvooe commented Sep 20, 2016

wking commented Sep 20, 2016

coolljt0725 commented Sep 21, 2016

stevvooe commented Sep 21, 2016

wking commented Sep 21, 2016

docker --version

docker images --no-trunc debian

sha256sum /var/lib/docker/image/btrfs/imagedb/content/sha256/dc81f1af026c21c1b2eb9861da188a154027ef11a46df42986165ce7b8e952c2

jq . /var/lib/docker/image/btrfs/imagedb/content/sha256/dc81f1af026c21c1b2eb9861da188a154027ef11a46df42986165ce7b8e952c2

wking commented Oct 2, 2016 via email