Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of layers inconsistent with kaniko builds that copy/copy the same files multiple times #251

Closed
bobcatfish opened this issue Jul 24, 2018 · 18 comments · Fixed by #273 or #289
Closed
Assignees
Labels
kind/bug Something isn't working

Comments

@bobcatfish
Copy link
Contributor

The integration test run for #249 failed even though the PR is fixing a typo in a README.

Seems like we either have a flakey test or the test is broken! :O

++ go test
--- FAIL: TestLayers (49.18s)
    --- FAIL: TestLayers/test_layer_dockerfiles/Dockerfile_test_copy_reproducible (2.70s)
    	integration_test.go:250: incorrect offset between layers of gcr.io/kaniko-test/docker-dockerfile_test_copy_reproducible and gcr.io/kaniko-test/kaniko-dockerfile_test_copy_reproducible: expected 0 but got 1
FAIL
@bobcatfish bobcatfish self-assigned this Jul 24, 2018
@priyawadhwa priyawadhwa added the kind/bug Something isn't working label Jul 24, 2018
@bobcatfish
Copy link
Contributor Author

Looking back through kokoro history I can see a bunch of similar failures:

#249:

++ go test
--- FAIL: TestLayers (49.18s)
    --- FAIL: TestLayers/test_layer_dockerfiles/Dockerfile_test_copy_reproducible (2.70s)
    	integration_test.go:250: incorrect offset between layers of gcr.io/kaniko-test/docker-dockerfile_test_copy_reproducible and gcr.io/kaniko-test/kaniko-dockerfile_test_copy_reproducible: expected 0 but got 1

#243 (twice, both with same commit):

++ go test
--- FAIL: TestLayers (49.67s)
    --- FAIL: TestLayers/test_layer_dockerfiles/Dockerfile_test_copy (2.74s)
    	integration_test.go:250: incorrect offset between layers of gcr.io/kaniko-test/docker-dockerfile_test_copy and gcr.io/kaniko-test/kaniko-dockerfile_test_copy: expected 0 but got 1
++ go test
--- FAIL: TestLayers (48.30s)
    --- FAIL: TestLayers/test_layer_dockerfiles/Dockerfile_test_copy_reproducible (2.55s)
    	integration_test.go:250: incorrect offset between layers of gcr.io/kaniko-test/docker-dockerfile_test_copy_reproducible and gcr.io/kaniko-test/kaniko-dockerfile_test_copy_reproducible: expected 0 but got 1

#244:

++ go test
--- FAIL: TestLayers (53.94s)
    --- FAIL: TestLayers/test_layer_dockerfiles/Dockerfile_test_copy (2.91s)
    	integration_test.go:250: incorrect offset between layers of gcr.io/kaniko-test/docker-dockerfile_test_copy and gcr.io/kaniko-test/kaniko-dockerfile_test_copy: expected 0 but got 1

#238:

++ go test
--- FAIL: TestLayers (49.92s)
    --- FAIL: TestLayers/test_layer_dockerfiles/Dockerfile_test_copy_bucket (2.61s)
    	integration_test.go:250: incorrect offset between layers of gcr.io/kaniko-test/docker-dockerfile_test_copy_bucket and gcr.io/kaniko-test/kaniko-dockerfile_test_copy_bucket: expected 0 but got 1

@priyawadhwa
Copy link
Collaborator

Hmm weird yah I think it might be a flaky test -- does docker always build images with the same number of layers? Only reason I can think of as to why this might be happening.

@bobcatfish
Copy link
Contributor Author

image

Another example from #256 - I'LL GET TO YOU YET

YOUR DAYS ARE NUMBERED JUST LIKE YOUR LAYERS

@bobcatfish
Copy link
Contributor Author

In a recent run, this failed twice:

image

Super weird tho b/c the "reproducible" build, which uses the same image (and actually, no other changes) didn't fail 🤔

@bobcatfish
Copy link
Contributor Author

But sometimes the reproducible image fails:

image

It's always the images testing copying tho.

bobcatfish added a commit to bobcatfish/kaniko that referenced this issue Jul 31, 2018
In GoogleContainerTools#251 we are investigating test flakes due to layer offsets not
matching, this change will give us a bit more context so we can be sure
which image has which number of layers.

Also updated reproducible Dockerfile to be built with reproducible flag,
which I think was the original intent (without this change, there is no
difference between how `kaniko-dockerfile_test_copy_reproducible` and
`kaniko-dockerfile_test_copy` are built.
@bobcatfish
Copy link
Contributor Author

Managed to reproduce it locally, so it's not a docker version issue:

image

Interestingly, it looks like it's the kaniko built image that is short a layer:

--- FAIL: TestLayers (59.86s)
    --- FAIL: TestLayers/test_layer_Dockerfile_test_copy_bucket (19.43s)
    	integration_test.go:210: Difference in number of layers in each image is 1 but should be 0. us.gcr.io/christiewilson-catfactory/docker-dockerfile_test_copy_bucket has 18 layers and us.gcr.io/christiewilson-catfactory/kaniko-dockerfile_test_copy_bucket has 17 layers

I added some extra debug output and the correct number of layers for that image should be 18

bobcatfish added a commit to bobcatfish/kaniko that referenced this issue Jul 31, 2018
In GoogleContainerTools#251 we are investigating test flakes due to layer offsets not
matching, this change will give us a bit more context so we can be sure
which image has which number of layers, and it will also include the
digest of the image, since kaniko always pushes images to a remote repo,
so if the test fails we can pull the digest and see what is up.

Also updated reproducible Dockerfile to be built with reproducible flag,
which I think was the original intent (without this change, there is no
difference between how `kaniko-dockerfile_test_copy_reproducible` and
`kaniko-dockerfile_test_copy` are built.
@bobcatfish
Copy link
Contributor Author

Looking back at one of the images kaniko built which only had 17 layers, when I compare that image to one with 18 layers using container-diff, container-diff reports no differences!! It seems like for some reason two layers are being squashed. Either that or there is a layer that does nothing?

@bobcatfish
Copy link
Contributor Author

bobcatfish commented Jul 31, 2018

thought I had reproduced this with some other images

lol nope, that was a bug I introduced into the test XD

bobcatfish added a commit to bobcatfish/kaniko that referenced this issue Jul 31, 2018
In GoogleContainerTools#251 we are investigating test flakes due to layer offsets not
matching, this change will give us a bit more context so we can be sure
which image has which number of layers, and it will also include the
digest of the image, since kaniko always pushes images to a remote repo,
so if the test fails we can pull the digest and see what is up.

Also updated reproducible Dockerfile to be built with reproducible flag,
which I think was the original intent (without this change, there is no
difference between how `kaniko-dockerfile_test_copy_reproducible` and
`kaniko-dockerfile_test_copy` are built.
@bobcatfish
Copy link
Contributor Author

When reproduced on my machine, it looks like the difference is that in the 17 layer case, we get this:

time="2018-07-31T23:21:46Z" level=info msg="cmd: copy [context/b*]"
time="2018-07-31T23:21:46Z" level=info msg="dest: /baz/"
time="2018-07-31T23:21:46Z" level=info msg="Creating directory /baz"
time="2018-07-31T23:21:46Z" level=info msg="Creating directory /baz/bam"
time="2018-07-31T23:21:46Z" level=info msg="Copying file /workspace/context/bar/bam/bat to /baz/bam/bat"
time="2018-07-31T23:21:46Z" level=info msg="Copying file /workspace/context/bar/bat to /baz/bat"
time="2018-07-31T23:21:46Z" level=info msg="Copying file /workspace/context/bar/baz to /baz/baz"
time="2018-07-31T23:21:46Z" level=info msg="Taking snapshot of files [/baz/ /baz/bam /baz/bam/bat /baz/bat /baz/baz]..."
time="2018-07-31T23:21:46Z" level=info msg="No files were changed, appending empty layer to config."
time="2018-07-31T23:21:46Z" level=info msg="cmd: copy [context/foo context/bar/ba?]"

And in the 18 layer case, we get:

time="2018-07-31T23:22:05Z" level=info msg="cmd: copy [context/b*]"
time="2018-07-31T23:22:05Z" level=info msg="dest: /baz/"
time="2018-07-31T23:22:05Z" level=info msg="Creating directory /baz"
time="2018-07-31T23:22:05Z" level=info msg="Creating directory /baz/bam"
time="2018-07-31T23:22:05Z" level=info msg="Copying file /workspace/context/bar/bam/bat to /baz/bam/bat"
time="2018-07-31T23:22:05Z" level=info msg="Copying file /workspace/context/bar/bat to /baz/bat"
time="2018-07-31T23:22:05Z" level=info msg="Copying file /workspace/context/bar/baz to /baz/baz"
time="2018-07-31T23:22:05Z" level=info msg="Taking snapshot of files [/baz/ /baz/bam /baz/bam/bat /baz/bat /baz/baz]..."
time="2018-07-31T23:22:05Z" level=info msg="cmd: copy [context/foo context/bar/ba?]"

@priyawadhwa
Copy link
Collaborator

Hmm so I guess sometimes kaniko thinks files have changed and sometimes not -- It's probably because this command and this command are doing the same thing (files probably shouldn't have changed after running the second command which I guess is why the Docker images have 17 layers as well)

I'd probably take a look and make sure snapshotting is happening correctly, here is where kaniko decides if a file has changed and should be added!

@bobcatfish
Copy link
Contributor Author

bobcatfish commented Aug 1, 2018

Thanks @priyawadhwa ! I think that MaybeAdd function you linked to might be the culprit 🤔

I've found a way to reproduce this pretty reliably, with this Dockerfile:

FROM alpine@sha256:5ce5f501c457015c4b91f91a15ac69157d9b06f1a75cf9107bf2b62e0843983a
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/
COPY context/bar /baz/

The weirdest thing is that it's not consistent, the number of layers missing varies, e.g. between these two runs with the above Dockerfile:

--- FAIL: TestLayers (18.52s)
    --- FAIL: TestLayers/test_layer_Dockerfile_test_copy_fail (18.52s)
    	integration_test.go:220: Difference in number of layers in each image is 10 but should be 0. Image 1: Image: [us.gcr.io/christiewilson-catfactory/docker-dockerfile_test_copy_fail] Digest: [50ab99551cba99eb8b5a825198a050eff931b5b0128cd4b4588e75c8c49a626a] Number of Layers: [15], Image 2: Image: [us.gcr.io/christiewilson-catfactory/kaniko-dockerfile_test_copy_fail] Digest: [adab18e80865003107b1b251913ec026ba300579f559919f2468b072507b13fa] Number of Layers: [5]
FAIL
--- FAIL: TestLayers (17.29s)
    --- FAIL: TestLayers/test_layer_Dockerfile_test_copy_fail (17.29s)
    	integration_test.go:220: Difference in number of layers in each image is 1 but should be 0. Image 1: Image: [us.gcr.io/christiewilson-catfactory/docker-dockerfile_test_copy_fail] Digest: [50ab99551cba99eb8b5a825198a050eff931b5b0128cd4b4588e75c8c49a626a] Number of Layers: [15], Image 2: Image: [us.gcr.io/christiewilson-catfactory/kaniko-dockerfile_test_copy_fail] Digest: [f002e2b9f606e631bb3c3282ce97be2f50f7e88c03a4f087bbdf50799f08db40] Number of Layers: [14]
FAIL

It seems like copying a dir multiple times is the problem.

I also tried copying just one file multiple times:

FROM alpine@sha256:5ce5f501c457015c4b91f91a15ac69157d9b06f1a75cf9107bf2b62e0843983a
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo
COPY context/foo foo

And similarly, the number of layers was different (b/c MaybeAdd would have correctly determined no new layer should be added I guess) but it was consistent whereas in the dir case it's inconsistent.
I tried the same thing again with more COPY statements and it turns out the results were just as inconsistent, it just doesn't happen as often.

@priyawadhwa does the log message "No files were changed, appending empty layer to config" mean that an actual layer should be added in that case? That doesn't seem to be the behavior (no layer is added from what I can tell)

@priyawadhwa
Copy link
Collaborator

does the log message "No files were changed, appending empty layer to config" mean that an actual layer should be added in that case? That doesn't seem to be the behavior (no layer is added from what I can tell)

Yah that's correct, that log message means that an actual layer won't be added, but an "empty layer" will be added to the config just to show that a command was run.

@bobcatfish
Copy link
Contributor Author

bobcatfish commented Aug 2, 2018

Okay so it looks like what is happening is:

  1. When we check if we should add a file to a layer
  2. We call MaybeAdd
  3. Which, if the file has already been added, by default compares the hashes of the files using the files' mtimes as an input

It turns out that the mtime can lag. At first I thought this was a docker thing, or specific to the base images we are using, but I was also able to reproduce it on my ubuntu host with (thanks @dlorenc for this one liner):

for i in $(seq 100); do echo "hey $i" > foo2 && stat foo2 | grep Modify; done;

Sometimes the mtime will lag. This means the contents of the file may have actually changed, but the mtime will make it look like the files are the same. This behaviour is not consistent, which is why the number of layers kaniko was building would vary.

However if you introduce a call to sync before each stat, the mtime seems to be consistently updated:

for i in $(seq 100); do echo "hey $i" > foo2 && sync && stat foo2 | grep Modify; done;

This does make the complete for loop take ~1 order of magnitude longer, however if we call this once per layer, I think we're looking at only < 100 ms extra time per layer.

@bobcatfish bobcatfish changed the title TestLayers failed on typo PR Number of layers inconsistent with kaniko builds that copy/copy the same files multiple times Aug 2, 2018
bobcatfish added a commit to bobcatfish/kaniko that referenced this issue Aug 2, 2018
The default hashing algorithm used by kaniko to determine if two files
are the same uses the files' mtime (the inode's modification time). It
turns out that this time is not always up to date, meaning that a file
could be modified but when you stat the file, the modification time may
not yet have been updated.

The copy integration tests were adding the same directory twice, the
second instance being to test copying a directory with a wilcard '*'.
Since the mtime is sometimes not updated, this caused kaniko to
sometimes think the files were the same, and sometimes think they were
different, varying the number of layers it created.

Now we will update those tests to use a completely different set of
files instead of copying the same files again, and we add a new test
(`Dockerfile_test_copy_same_file`) which intentionally copies the same
file multiple times, which would reliably reproduce the issue.

We fix the issue by calling `sync` before we start comparing mtimes.
This will slow down layer snapshotting - on my personal machine it costs
~30 ms to call, and added ~4 seconds to running all of the
`Dockerfile_test_copy*` tests. I'm assuming that adding 30ms per layer
is okay, but it's a potential place to speed things up later if we need
to.

Fixes GoogleContainerTools#251

_Interesting note, if you build this same Dockerfile with devicemapper,
you end up with only 2 layers! `¯\_(ツ)_/¯` _
bobcatfish added a commit to bobcatfish/kaniko that referenced this issue Aug 2, 2018
The default hashing algorithm used by kaniko to determine if two files
are the same uses the files' mtime (the inode's modification time). It
turns out that this time is not always up to date, meaning that a file
could be modified but when you stat the file, the modification time may
not yet have been updated.

The copy integration tests were adding the same directory twice, the
second instance being to test copying a directory with a wilcard '*'.
Since the mtime is sometimes not updated, this caused kaniko to
sometimes think the files were the same, and sometimes think they were
different, varying the number of layers it created.

Now we will update those tests to use a completely different set of
files instead of copying the same files again, and we add a new test
(`Dockerfile_test_copy_same_file`) which intentionally copies the same
file multiple times, which would reliably reproduce the issue.

We fix the issue by calling `sync` before we start comparing mtimes.
This will slow down layer snapshotting - on my personal machine it costs
~30 ms to call, and added ~4 seconds to running all of the
`Dockerfile_test_copy*` tests. I'm assuming that adding 30ms per layer
is okay, but it's a potential place to speed things up later if we need
to.

Fixes GoogleContainerTools#251

_Interesting note, if you build this same Dockerfile with devicemapper,
you end up with only 2 layers! `¯\_(ツ)_/¯` _
bobcatfish added a commit to bobcatfish/kaniko that referenced this issue Aug 2, 2018
The default hashing algorithm used by kaniko to determine if two files
are the same uses the files' mtime (the inode's modification time). It
turns out that this time is not always up to date, meaning that a file
could be modified but when you stat the file, the modification time may
not yet have been updated.

The copy integration tests were adding the same directory twice, the
second instance being to test copying a directory with a wilcard '*'.
Since the mtime is sometimes not updated, this caused kaniko to
sometimes think the files were the same, and sometimes think they were
different, varying the number of layers it created.

Now we will update those tests to use a completely different set of
files instead of copying the same files again, and we add a new test
(`Dockerfile_test_copy_same_file`) which intentionally copies the same
file multiple times, which would reliably reproduce the issue.

We fix the issue by calling `sync` before we start comparing mtimes.
This will slow down layer snapshotting - on my personal machine it costs
~30 ms to call, and added ~4 seconds to running all of the
`Dockerfile_test_copy*` tests. I'm assuming that adding 30ms per layer
is okay, but it's a potential place to speed things up later if we need
to.

Fixes GoogleContainerTools#251

_Interesting note, if you build this same Dockerfile with devicemapper,
you end up with only 2 layers! `¯\_(ツ)_/¯` _
@bobcatfish
Copy link
Contributor Author

Unfortunately even after adding a sync before checking hashes, it looks like the issue is still happening! This might have something to do with the fact that sync may return before the writes are actually completed (https://linux.die.net/man/2/sync).

Other ideas for how to fix this:

  1. Add more sync calls 😅 e.g. when kaniko starts up, and/or before the tests run (maybe all the test setup is causing a lot of data to get flushed when sync is called)
  2. Radical idea: always include files that are COPY/ADD-ed: the way COPY is implemented the file will always be truncated and re-written, even if it's different, so we should really be included a layer for this every time (unless we want to start looking at file contents which I'm guessing we don't b/c it's much too slow)

What do you think @priyawadhwa ?

@bobcatfish
Copy link
Contributor Author

From the latest failures in #273:

time="2018-08-02T23:53:10Z" level=info msg="Copying file /workspace/context/foo to /foo"
time="2018-08-02T23:53:10Z" level=info msg="Taking snapshot of files [/foo]..."
time="2018-08-02T23:53:10Z" level=info msg="File / Mode drwxr-xr-x ModTime 2018-08-02 23:53:10.775057552 +0000 UTC IsRegular %!s(bool=false)\n"
time="2018-08-02T23:53:10Z" level=info msg="File /foo Mode -rwxrwxr-x ModTime 2018-08-02 23:53:10.843057552 +0000 UTC IsRegular %!s(bool=true)\n"
time="2018-08-02T23:53:10Z" level=info msg="cmd: copy [context/foo]"
time="2018-08-02T23:53:10Z" level=info msg="dest: /foo"
time="2018-08-02T23:53:10Z" level=info msg="Copying file /workspace/context/foo to /foo"
time="2018-08-02T23:53:10Z" level=info msg="Taking snapshot of files [/foo]..."
time="2018-08-02T23:53:10Z" level=info msg="File / Mode drwxr-xr-x ModTime 2018-08-02 23:53:10.775057552 +0000 UTC IsRegular %!s(bool=false)\n"
time="2018-08-02T23:53:10Z" level=info msg="File /foo Mode -rwxrwxr-x ModTime 2018-08-02 23:53:10.843057552 +0000 UTC IsRegular %!s(bool=true)\n"
time="2018-08-02T23:53:10Z" level=info msg="No files were changed, appending empty layer to config."

The modtime is the same :''''(

@bobcatfish
Copy link
Contributor Author

bobcatfish commented Aug 3, 2018

Talked with @priyawadhwa and we're gonna try out option 2!

@priyawadhwa
Copy link
Collaborator

As discussed let's go for option 2!! 😃

@bobcatfish
Copy link
Contributor Author

Aw yeah, eventually consistent github commenting 😎

bobcatfish added a commit to bobcatfish/kaniko that referenced this issue Aug 3, 2018
The default hashing algorithm used by kaniko to determine if two files
are the same uses the files' mtime (the inode's modification time). It
turns out that this time is not always up to date, meaning that a file
could be modified but when you stat the file, the modification time may
not yet have been updated.

The copy integration tests were adding the same directory twice, the
second instance being to test copying a directory with a wilcard '*'.
Since the mtime is sometimes not updated, this caused kaniko to
sometimes think the files were the same, and sometimes think they were
different, varying the number of layers it created.

Now we will update those tests to use a completely different set of
files instead of copying the same files again.

In a later commit (which will hopefully fix GoogleContainerTools#251) we will add a fix for
this and a new test case that will intentionally exercise this
functionality. In the meantime we'll prevent noisy test failures for
submitters.

Fixes GoogleContainerTools#251
bobcatfish added a commit to bobcatfish/kaniko that referenced this issue Aug 3, 2018
The default hashing algorithm used by kaniko to determine if two files
are the same uses the files' mtime (the inode's modification time). It
turns out that this time is not always up to date, meaning that a file
could be modified but when you stat the file, the modification time may
not yet have been updated.

The copy integration tests were adding the same directory twice, the
second instance being to test copying a directory with a wilcard '*'.
Since the mtime is sometimes not updated, this caused kaniko to
sometimes think the files were the same, and sometimes think they were
different, varying the number of layers it created.

Now we will update those tests to use a completely different set of
files instead of copying the same files again.

In a later commit (which will hopefully fix GoogleContainerTools#251) we will add a fix for
this and a new test case that will intentionally exercise this
functionality. In the meantime we'll prevent noisy test failures for
submitters.
@dlorenc dlorenc reopened this Aug 6, 2018
bobcatfish added a commit to bobcatfish/kaniko that referenced this issue Aug 14, 2018
Kaniko uses mtime (as well as file contents and other attributes) to
determine if files have changed. COPY and ADD commands should _always_
update the mtime, because they actually overwrite the files. However it
turns out that the mtime can lag, so kaniko would sometimes add a new
layer when using COPY or ADD on a file, and sometimes would not. This
leads to a non-deterministic number of layers.

To fix this, we have updated the kaniko commands to be more
authoritative in declaring when they have changed a file (e.g. WORKDIR
will now only create the directory when it doesn't exist) and we will
trust those files and _always_ add them, instead of only adding them if
they haven't changed.

It is possible for RUN commands to also change the filesystem, in which
case kaniko has no choice but to look at the filesystem to determine
what has changed. For this case we have added a call to `sync` however
we still cannot guarantee that sometimes the mtime will lag, causing the
number of layers to be non-deterministic. However when I tried to cause
this behaviour with the RUN command, I couldn't.

This changes the snapshotting logic a bit; before this change, the last
command of the last stage in a Dockerfile would always scan the whole
file system and ignore the files returned by the kaniko command. Instead
we will now trust those files and assume that the snapshotting
performed by previous commands will be adequate.

Fixes GoogleContainerTools#251
bobcatfish added a commit to bobcatfish/kaniko that referenced this issue Aug 14, 2018
Kaniko uses mtime (as well as file contents and other attributes) to
determine if files have changed. COPY and ADD commands should _always_
update the mtime, because they actually overwrite the files. However it
turns out that the mtime can lag, so kaniko would sometimes add a new
layer when using COPY or ADD on a file, and sometimes would not. This
leads to a non-deterministic number of layers.

To fix this, we have updated the kaniko commands to be more
authoritative in declaring when they have changed a file (e.g. WORKDIR
will now only create the directory when it doesn't exist) and we will
trust those files and _always_ add them, instead of only adding them if
they haven't changed.

It is possible for RUN commands to also change the filesystem, in which
case kaniko has no choice but to look at the filesystem to determine
what has changed. For this case we have added a call to `sync` however
we still cannot guarantee that sometimes the mtime will lag, causing the
number of layers to be non-deterministic. However when I tried to cause
this behaviour with the RUN command, I couldn't.

This changes the snapshotting logic a bit; before this change, the last
command of the last stage in a Dockerfile would always scan the whole
file system and ignore the files returned by the kaniko command. Instead
we will now trust those files and assume that the snapshotting
performed by previous commands will be adequate.

Docker itself seems to rely on the storage driver to determine when
files have changed and so doesn't have to deal with these problems
directly.

An alternative implementation would use `inotify` to track which files
have changed. However that would mean watching every file in the
filesystem, and adding new watches as files are added. Not only is there
a limit on the number of files that can be watched, but according to the
man pages a) this can take a significant amount of time b) there is
complication around when events arrive (e.g. by the time they arrive,
the files may have changed) and lastly c) events can be lost, which
would mean we'd run into this non-deterministic behaviour again anyway.

Fixes GoogleContainerTools#251
bobcatfish added a commit to bobcatfish/kaniko that referenced this issue Aug 14, 2018
Kaniko uses mtime (as well as file contents and other attributes) to
determine if files have changed. COPY and ADD commands should _always_
update the mtime, because they actually overwrite the files. However it
turns out that the mtime can lag, so kaniko would sometimes add a new
layer when using COPY or ADD on a file, and sometimes would not. This
leads to a non-deterministic number of layers.

To fix this, we have updated the kaniko commands to be more
authoritative in declaring when they have changed a file (e.g. WORKDIR
will now only create the directory when it doesn't exist) and we will
trust those files and _always_ add them, instead of only adding them if
they haven't changed.

It is possible for RUN commands to also change the filesystem, in which
case kaniko has no choice but to look at the filesystem to determine
what has changed. For this case we have added a call to `sync` however
we still cannot guarantee that sometimes the mtime will lag, causing the
number of layers to be non-deterministic. However when I tried to cause
this behaviour with the RUN command, I couldn't.

This changes the snapshotting logic a bit; before this change, the last
command of the last stage in a Dockerfile would always scan the whole
file system and ignore the files returned by the kaniko command. Instead
we will now trust those files and assume that the snapshotting
performed by previous commands will be adequate.

Docker itself seems to rely on the storage driver to determine when
files have changed and so doesn't have to deal with these problems
directly.

An alternative implementation would use `inotify` to track which files
have changed. However that would mean watching every file in the
filesystem, and adding new watches as files are added. Not only is there
a limit on the number of files that can be watched, but according to the
man pages a) this can take a significant amount of time b) there is
complication around when events arrive (e.g. by the time they arrive,
the files may have changed) and lastly c) events can be lost, which
would mean we'd run into this non-deterministic behaviour again anyway.

Fixes GoogleContainerTools#251
bobcatfish added a commit to bobcatfish/kaniko that referenced this issue Aug 14, 2018
Kaniko uses mtime (as well as file contents and other attributes) to
determine if files have changed. COPY and ADD commands should _always_
update the mtime, because they actually overwrite the files. However it
turns out that the mtime can lag, so kaniko would sometimes add a new
layer when using COPY or ADD on a file, and sometimes would not. This
leads to a non-deterministic number of layers.

To fix this, we have updated the kaniko commands to be more
authoritative in declaring when they have changed a file (e.g. WORKDIR
will now only create the directory when it doesn't exist) and we will
trust those files and _always_ add them, instead of only adding them if
they haven't changed.

It is possible for RUN commands to also change the filesystem, in which
case kaniko has no choice but to look at the filesystem to determine
what has changed. For this case we have added a call to `sync` however
we still cannot guarantee that sometimes the mtime will lag, causing the
number of layers to be non-deterministic. However when I tried to cause
this behaviour with the RUN command, I couldn't.

This changes the snapshotting logic a bit; before this change, the last
command of the last stage in a Dockerfile would always scan the whole
file system and ignore the files returned by the kaniko command. Instead
we will now trust those files and assume that the snapshotting
performed by previous commands will be adequate.

Docker itself seems to rely on the storage driver to determine when
files have changed and so doesn't have to deal with these problems
directly.

An alternative implementation would use `inotify` to track which files
have changed. However that would mean watching every file in the
filesystem, and adding new watches as files are added. Not only is there
a limit on the number of files that can be watched, but according to the
man pages a) this can take a significant amount of time b) there is
complication around when events arrive (e.g. by the time they arrive,
the files may have changed) and lastly c) events can be lost, which
would mean we'd run into this non-deterministic behaviour again anyway.

Fixes GoogleContainerTools#251
bobcatfish added a commit to bobcatfish/kaniko that referenced this issue Aug 14, 2018
Kaniko uses mtime (as well as file contents and other attributes) to
determine if files have changed. COPY and ADD commands should _always_
update the mtime, because they actually overwrite the files. However it
turns out that the mtime can lag, so kaniko would sometimes add a new
layer when using COPY or ADD on a file, and sometimes would not. This
leads to a non-deterministic number of layers.

To fix this, we have updated the kaniko commands to be more
authoritative in declaring when they have changed a file (e.g. WORKDIR
will now only create the directory when it doesn't exist) and we will
trust those files and _always_ add them, instead of only adding them if
they haven't changed.

It is possible for RUN commands to also change the filesystem, in which
case kaniko has no choice but to look at the filesystem to determine
what has changed. For this case we have added a call to `sync` however
we still cannot guarantee that sometimes the mtime will not lag, causing the
number of layers to be non-deterministic. However when I tried to cause
this behaviour with the RUN command, I couldn't.

This changes the snapshotting logic a bit; before this change, the last
command of the last stage in a Dockerfile would always scan the whole
file system and ignore the files returned by the kaniko command. Instead
we will now trust those files and assume that the snapshotting
performed by previous commands will be adequate.

Docker itself seems to rely on the storage driver to determine when
files have changed and so doesn't have to deal with these problems
directly.

An alternative implementation would use `inotify` to track which files
have changed. However that would mean watching every file in the
filesystem, and adding new watches as files are added. Not only is there
a limit on the number of files that can be watched, but according to the
man pages a) this can take a significant amount of time b) there is
complication around when events arrive (e.g. by the time they arrive,
the files may have changed) and lastly c) events can be lost, which
would mean we'd run into this non-deterministic behaviour again anyway.

Fixes GoogleContainerTools#251
bobcatfish added a commit to bobcatfish/kaniko that referenced this issue Aug 14, 2018
Kaniko uses mtime (as well as file contents and other attributes) to
determine if files have changed. COPY and ADD commands should _always_
update the mtime, because they actually overwrite the files. However it
turns out that the mtime can lag, so kaniko would sometimes add a new
layer when using COPY or ADD on a file, and sometimes would not. This
leads to a non-deterministic number of layers.

To fix this, we have updated the kaniko commands to be more
authoritative in declaring when they have changed a file (e.g. WORKDIR
will now only create the directory when it doesn't exist) and we will
trust those files and _always_ add them, instead of only adding them if
they haven't changed.

It is possible for RUN commands to also change the filesystem, in which
case kaniko has no choice but to look at the filesystem to determine
what has changed. For this case we have added a call to `sync` however
we still cannot guarantee that sometimes the mtime will not lag, causing the
number of layers to be non-deterministic. However when I tried to cause
this behaviour with the RUN command, I couldn't.

This changes the snapshotting logic a bit; before this change, the last
command of the last stage in a Dockerfile would always scan the whole
file system and ignore the files returned by the kaniko command. Instead
we will now trust those files and assume that the snapshotting
performed by previous commands will be adequate.

Docker itself seems to rely on the storage driver to determine when
files have changed and so doesn't have to deal with these problems
directly.

An alternative implementation would use `inotify` to track which files
have changed. However that would mean watching every file in the
filesystem, and adding new watches as files are added. Not only is there
a limit on the number of files that can be watched, but according to the
man pages a) this can take a significant amount of time b) there is
complication around when events arrive (e.g. by the time they arrive,
the files may have changed) and lastly c) events can be lost, which
would mean we'd run into this non-deterministic behaviour again anyway.

Fixes GoogleContainerTools#251
bobcatfish added a commit to bobcatfish/kaniko that referenced this issue Aug 24, 2018
Kaniko uses mtime (as well as file contents and other attributes) to
determine if files have changed. COPY and ADD commands should _always_
update the mtime, because they actually overwrite the files. However it
turns out that the mtime can lag, so kaniko would sometimes add a new
layer when using COPY or ADD on a file, and sometimes would not. This
leads to a non-deterministic number of layers.

To fix this, we have updated the kaniko commands to be more
authoritative in declaring when they have changed a file (e.g. WORKDIR
will now only create the directory when it doesn't exist) and we will
trust those files and _always_ add them, instead of only adding them if
they haven't changed.

It is possible for RUN commands to also change the filesystem, in which
case kaniko has no choice but to look at the filesystem to determine
what has changed. For this case we have added a call to `sync` however
we still cannot guarantee that sometimes the mtime will not lag, causing the
number of layers to be non-deterministic. However when I tried to cause
this behaviour with the RUN command, I couldn't.

This changes the snapshotting logic a bit; before this change, the last
command of the last stage in a Dockerfile would always scan the whole
file system and ignore the files returned by the kaniko command. Instead
we will now trust those files and assume that the snapshotting
performed by previous commands will be adequate.

Docker itself seems to rely on the storage driver to determine when
files have changed and so doesn't have to deal with these problems
directly.

An alternative implementation would use `inotify` to track which files
have changed. However that would mean watching every file in the
filesystem, and adding new watches as files are added. Not only is there
a limit on the number of files that can be watched, but according to the
man pages a) this can take a significant amount of time b) there is
complication around when events arrive (e.g. by the time they arrive,
the files may have changed) and lastly c) events can be lost, which
would mean we'd run into this non-deterministic behaviour again anyway.

Fixes GoogleContainerTools#251
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
3 participants