-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Number of layers inconsistent with kaniko builds that copy/copy the same files multiple times #251
Comments
Looking back through kokoro history I can see a bunch of similar failures: #249:
#243 (twice, both with same commit):
#244:
#238:
|
Hmm weird yah I think it might be a flaky test -- does docker always build images with the same number of layers? Only reason I can think of as to why this might be happening. |
Another example from #256 - I'LL GET TO YOU YET YOUR DAYS ARE NUMBERED JUST LIKE YOUR LAYERS |
In GoogleContainerTools#251 we are investigating test flakes due to layer offsets not matching, this change will give us a bit more context so we can be sure which image has which number of layers. Also updated reproducible Dockerfile to be built with reproducible flag, which I think was the original intent (without this change, there is no difference between how `kaniko-dockerfile_test_copy_reproducible` and `kaniko-dockerfile_test_copy` are built.
Managed to reproduce it locally, so it's not a docker version issue: Interestingly, it looks like it's the kaniko built image that is short a layer:
I added some extra debug output and the correct number of layers for that image should be 18 |
In GoogleContainerTools#251 we are investigating test flakes due to layer offsets not matching, this change will give us a bit more context so we can be sure which image has which number of layers, and it will also include the digest of the image, since kaniko always pushes images to a remote repo, so if the test fails we can pull the digest and see what is up. Also updated reproducible Dockerfile to be built with reproducible flag, which I think was the original intent (without this change, there is no difference between how `kaniko-dockerfile_test_copy_reproducible` and `kaniko-dockerfile_test_copy` are built.
Looking back at one of the images kaniko built which only had 17 layers, when I compare that image to one with 18 layers using |
lol nope, that was a bug I introduced into the test XD |
In GoogleContainerTools#251 we are investigating test flakes due to layer offsets not matching, this change will give us a bit more context so we can be sure which image has which number of layers, and it will also include the digest of the image, since kaniko always pushes images to a remote repo, so if the test fails we can pull the digest and see what is up. Also updated reproducible Dockerfile to be built with reproducible flag, which I think was the original intent (without this change, there is no difference between how `kaniko-dockerfile_test_copy_reproducible` and `kaniko-dockerfile_test_copy` are built.
When reproduced on my machine, it looks like the difference is that in the 17 layer case, we get this:
And in the 18 layer case, we get:
|
Hmm so I guess sometimes kaniko thinks files have changed and sometimes not -- It's probably because this command and this command are doing the same thing (files probably shouldn't have changed after running the second command which I guess is why the Docker images have 17 layers as well) I'd probably take a look and make sure snapshotting is happening correctly, here is where kaniko decides if a file has changed and should be added! |
Thanks @priyawadhwa ! I think that I've found a way to reproduce this pretty reliably, with this Dockerfile:
The weirdest thing is that it's not consistent, the number of layers missing varies, e.g. between these two runs with the above Dockerfile:
It seems like copying a dir multiple times is the problem. I also tried copying just one file multiple times:
And similarly, the number of layers was different (b/c @priyawadhwa does the log message "No files were changed, appending empty layer to config" mean that an actual layer should be added in that case? That doesn't seem to be the behavior (no layer is added from what I can tell) |
Yah that's correct, that log message means that an actual layer won't be added, but an "empty layer" will be added to the config just to show that a command was run. |
Okay so it looks like what is happening is:
It turns out that the
Sometimes the However if you introduce a call to
This does make the complete for loop take ~1 order of magnitude longer, however if we call this once per layer, I think we're looking at only < 100 ms extra time per layer. |
The default hashing algorithm used by kaniko to determine if two files are the same uses the files' mtime (the inode's modification time). It turns out that this time is not always up to date, meaning that a file could be modified but when you stat the file, the modification time may not yet have been updated. The copy integration tests were adding the same directory twice, the second instance being to test copying a directory with a wilcard '*'. Since the mtime is sometimes not updated, this caused kaniko to sometimes think the files were the same, and sometimes think they were different, varying the number of layers it created. Now we will update those tests to use a completely different set of files instead of copying the same files again, and we add a new test (`Dockerfile_test_copy_same_file`) which intentionally copies the same file multiple times, which would reliably reproduce the issue. We fix the issue by calling `sync` before we start comparing mtimes. This will slow down layer snapshotting - on my personal machine it costs ~30 ms to call, and added ~4 seconds to running all of the `Dockerfile_test_copy*` tests. I'm assuming that adding 30ms per layer is okay, but it's a potential place to speed things up later if we need to. Fixes GoogleContainerTools#251 _Interesting note, if you build this same Dockerfile with devicemapper, you end up with only 2 layers! `¯\_(ツ)_/¯` _
The default hashing algorithm used by kaniko to determine if two files are the same uses the files' mtime (the inode's modification time). It turns out that this time is not always up to date, meaning that a file could be modified but when you stat the file, the modification time may not yet have been updated. The copy integration tests were adding the same directory twice, the second instance being to test copying a directory with a wilcard '*'. Since the mtime is sometimes not updated, this caused kaniko to sometimes think the files were the same, and sometimes think they were different, varying the number of layers it created. Now we will update those tests to use a completely different set of files instead of copying the same files again, and we add a new test (`Dockerfile_test_copy_same_file`) which intentionally copies the same file multiple times, which would reliably reproduce the issue. We fix the issue by calling `sync` before we start comparing mtimes. This will slow down layer snapshotting - on my personal machine it costs ~30 ms to call, and added ~4 seconds to running all of the `Dockerfile_test_copy*` tests. I'm assuming that adding 30ms per layer is okay, but it's a potential place to speed things up later if we need to. Fixes GoogleContainerTools#251 _Interesting note, if you build this same Dockerfile with devicemapper, you end up with only 2 layers! `¯\_(ツ)_/¯` _
The default hashing algorithm used by kaniko to determine if two files are the same uses the files' mtime (the inode's modification time). It turns out that this time is not always up to date, meaning that a file could be modified but when you stat the file, the modification time may not yet have been updated. The copy integration tests were adding the same directory twice, the second instance being to test copying a directory with a wilcard '*'. Since the mtime is sometimes not updated, this caused kaniko to sometimes think the files were the same, and sometimes think they were different, varying the number of layers it created. Now we will update those tests to use a completely different set of files instead of copying the same files again, and we add a new test (`Dockerfile_test_copy_same_file`) which intentionally copies the same file multiple times, which would reliably reproduce the issue. We fix the issue by calling `sync` before we start comparing mtimes. This will slow down layer snapshotting - on my personal machine it costs ~30 ms to call, and added ~4 seconds to running all of the `Dockerfile_test_copy*` tests. I'm assuming that adding 30ms per layer is okay, but it's a potential place to speed things up later if we need to. Fixes GoogleContainerTools#251 _Interesting note, if you build this same Dockerfile with devicemapper, you end up with only 2 layers! `¯\_(ツ)_/¯` _
Unfortunately even after adding a Other ideas for how to fix this:
What do you think @priyawadhwa ? |
From the latest failures in #273:
The modtime is the same :''''( |
Talked with @priyawadhwa and we're gonna try out option 2! |
As discussed let's go for option 2!! 😃 |
Aw yeah, eventually consistent github commenting 😎 |
The default hashing algorithm used by kaniko to determine if two files are the same uses the files' mtime (the inode's modification time). It turns out that this time is not always up to date, meaning that a file could be modified but when you stat the file, the modification time may not yet have been updated. The copy integration tests were adding the same directory twice, the second instance being to test copying a directory with a wilcard '*'. Since the mtime is sometimes not updated, this caused kaniko to sometimes think the files were the same, and sometimes think they were different, varying the number of layers it created. Now we will update those tests to use a completely different set of files instead of copying the same files again. In a later commit (which will hopefully fix GoogleContainerTools#251) we will add a fix for this and a new test case that will intentionally exercise this functionality. In the meantime we'll prevent noisy test failures for submitters. Fixes GoogleContainerTools#251
The default hashing algorithm used by kaniko to determine if two files are the same uses the files' mtime (the inode's modification time). It turns out that this time is not always up to date, meaning that a file could be modified but when you stat the file, the modification time may not yet have been updated. The copy integration tests were adding the same directory twice, the second instance being to test copying a directory with a wilcard '*'. Since the mtime is sometimes not updated, this caused kaniko to sometimes think the files were the same, and sometimes think they were different, varying the number of layers it created. Now we will update those tests to use a completely different set of files instead of copying the same files again. In a later commit (which will hopefully fix GoogleContainerTools#251) we will add a fix for this and a new test case that will intentionally exercise this functionality. In the meantime we'll prevent noisy test failures for submitters.
Kaniko uses mtime (as well as file contents and other attributes) to determine if files have changed. COPY and ADD commands should _always_ update the mtime, because they actually overwrite the files. However it turns out that the mtime can lag, so kaniko would sometimes add a new layer when using COPY or ADD on a file, and sometimes would not. This leads to a non-deterministic number of layers. To fix this, we have updated the kaniko commands to be more authoritative in declaring when they have changed a file (e.g. WORKDIR will now only create the directory when it doesn't exist) and we will trust those files and _always_ add them, instead of only adding them if they haven't changed. It is possible for RUN commands to also change the filesystem, in which case kaniko has no choice but to look at the filesystem to determine what has changed. For this case we have added a call to `sync` however we still cannot guarantee that sometimes the mtime will lag, causing the number of layers to be non-deterministic. However when I tried to cause this behaviour with the RUN command, I couldn't. This changes the snapshotting logic a bit; before this change, the last command of the last stage in a Dockerfile would always scan the whole file system and ignore the files returned by the kaniko command. Instead we will now trust those files and assume that the snapshotting performed by previous commands will be adequate. Fixes GoogleContainerTools#251
Kaniko uses mtime (as well as file contents and other attributes) to determine if files have changed. COPY and ADD commands should _always_ update the mtime, because they actually overwrite the files. However it turns out that the mtime can lag, so kaniko would sometimes add a new layer when using COPY or ADD on a file, and sometimes would not. This leads to a non-deterministic number of layers. To fix this, we have updated the kaniko commands to be more authoritative in declaring when they have changed a file (e.g. WORKDIR will now only create the directory when it doesn't exist) and we will trust those files and _always_ add them, instead of only adding them if they haven't changed. It is possible for RUN commands to also change the filesystem, in which case kaniko has no choice but to look at the filesystem to determine what has changed. For this case we have added a call to `sync` however we still cannot guarantee that sometimes the mtime will lag, causing the number of layers to be non-deterministic. However when I tried to cause this behaviour with the RUN command, I couldn't. This changes the snapshotting logic a bit; before this change, the last command of the last stage in a Dockerfile would always scan the whole file system and ignore the files returned by the kaniko command. Instead we will now trust those files and assume that the snapshotting performed by previous commands will be adequate. Docker itself seems to rely on the storage driver to determine when files have changed and so doesn't have to deal with these problems directly. An alternative implementation would use `inotify` to track which files have changed. However that would mean watching every file in the filesystem, and adding new watches as files are added. Not only is there a limit on the number of files that can be watched, but according to the man pages a) this can take a significant amount of time b) there is complication around when events arrive (e.g. by the time they arrive, the files may have changed) and lastly c) events can be lost, which would mean we'd run into this non-deterministic behaviour again anyway. Fixes GoogleContainerTools#251
Kaniko uses mtime (as well as file contents and other attributes) to determine if files have changed. COPY and ADD commands should _always_ update the mtime, because they actually overwrite the files. However it turns out that the mtime can lag, so kaniko would sometimes add a new layer when using COPY or ADD on a file, and sometimes would not. This leads to a non-deterministic number of layers. To fix this, we have updated the kaniko commands to be more authoritative in declaring when they have changed a file (e.g. WORKDIR will now only create the directory when it doesn't exist) and we will trust those files and _always_ add them, instead of only adding them if they haven't changed. It is possible for RUN commands to also change the filesystem, in which case kaniko has no choice but to look at the filesystem to determine what has changed. For this case we have added a call to `sync` however we still cannot guarantee that sometimes the mtime will lag, causing the number of layers to be non-deterministic. However when I tried to cause this behaviour with the RUN command, I couldn't. This changes the snapshotting logic a bit; before this change, the last command of the last stage in a Dockerfile would always scan the whole file system and ignore the files returned by the kaniko command. Instead we will now trust those files and assume that the snapshotting performed by previous commands will be adequate. Docker itself seems to rely on the storage driver to determine when files have changed and so doesn't have to deal with these problems directly. An alternative implementation would use `inotify` to track which files have changed. However that would mean watching every file in the filesystem, and adding new watches as files are added. Not only is there a limit on the number of files that can be watched, but according to the man pages a) this can take a significant amount of time b) there is complication around when events arrive (e.g. by the time they arrive, the files may have changed) and lastly c) events can be lost, which would mean we'd run into this non-deterministic behaviour again anyway. Fixes GoogleContainerTools#251
Kaniko uses mtime (as well as file contents and other attributes) to determine if files have changed. COPY and ADD commands should _always_ update the mtime, because they actually overwrite the files. However it turns out that the mtime can lag, so kaniko would sometimes add a new layer when using COPY or ADD on a file, and sometimes would not. This leads to a non-deterministic number of layers. To fix this, we have updated the kaniko commands to be more authoritative in declaring when they have changed a file (e.g. WORKDIR will now only create the directory when it doesn't exist) and we will trust those files and _always_ add them, instead of only adding them if they haven't changed. It is possible for RUN commands to also change the filesystem, in which case kaniko has no choice but to look at the filesystem to determine what has changed. For this case we have added a call to `sync` however we still cannot guarantee that sometimes the mtime will lag, causing the number of layers to be non-deterministic. However when I tried to cause this behaviour with the RUN command, I couldn't. This changes the snapshotting logic a bit; before this change, the last command of the last stage in a Dockerfile would always scan the whole file system and ignore the files returned by the kaniko command. Instead we will now trust those files and assume that the snapshotting performed by previous commands will be adequate. Docker itself seems to rely on the storage driver to determine when files have changed and so doesn't have to deal with these problems directly. An alternative implementation would use `inotify` to track which files have changed. However that would mean watching every file in the filesystem, and adding new watches as files are added. Not only is there a limit on the number of files that can be watched, but according to the man pages a) this can take a significant amount of time b) there is complication around when events arrive (e.g. by the time they arrive, the files may have changed) and lastly c) events can be lost, which would mean we'd run into this non-deterministic behaviour again anyway. Fixes GoogleContainerTools#251
Kaniko uses mtime (as well as file contents and other attributes) to determine if files have changed. COPY and ADD commands should _always_ update the mtime, because they actually overwrite the files. However it turns out that the mtime can lag, so kaniko would sometimes add a new layer when using COPY or ADD on a file, and sometimes would not. This leads to a non-deterministic number of layers. To fix this, we have updated the kaniko commands to be more authoritative in declaring when they have changed a file (e.g. WORKDIR will now only create the directory when it doesn't exist) and we will trust those files and _always_ add them, instead of only adding them if they haven't changed. It is possible for RUN commands to also change the filesystem, in which case kaniko has no choice but to look at the filesystem to determine what has changed. For this case we have added a call to `sync` however we still cannot guarantee that sometimes the mtime will not lag, causing the number of layers to be non-deterministic. However when I tried to cause this behaviour with the RUN command, I couldn't. This changes the snapshotting logic a bit; before this change, the last command of the last stage in a Dockerfile would always scan the whole file system and ignore the files returned by the kaniko command. Instead we will now trust those files and assume that the snapshotting performed by previous commands will be adequate. Docker itself seems to rely on the storage driver to determine when files have changed and so doesn't have to deal with these problems directly. An alternative implementation would use `inotify` to track which files have changed. However that would mean watching every file in the filesystem, and adding new watches as files are added. Not only is there a limit on the number of files that can be watched, but according to the man pages a) this can take a significant amount of time b) there is complication around when events arrive (e.g. by the time they arrive, the files may have changed) and lastly c) events can be lost, which would mean we'd run into this non-deterministic behaviour again anyway. Fixes GoogleContainerTools#251
Kaniko uses mtime (as well as file contents and other attributes) to determine if files have changed. COPY and ADD commands should _always_ update the mtime, because they actually overwrite the files. However it turns out that the mtime can lag, so kaniko would sometimes add a new layer when using COPY or ADD on a file, and sometimes would not. This leads to a non-deterministic number of layers. To fix this, we have updated the kaniko commands to be more authoritative in declaring when they have changed a file (e.g. WORKDIR will now only create the directory when it doesn't exist) and we will trust those files and _always_ add them, instead of only adding them if they haven't changed. It is possible for RUN commands to also change the filesystem, in which case kaniko has no choice but to look at the filesystem to determine what has changed. For this case we have added a call to `sync` however we still cannot guarantee that sometimes the mtime will not lag, causing the number of layers to be non-deterministic. However when I tried to cause this behaviour with the RUN command, I couldn't. This changes the snapshotting logic a bit; before this change, the last command of the last stage in a Dockerfile would always scan the whole file system and ignore the files returned by the kaniko command. Instead we will now trust those files and assume that the snapshotting performed by previous commands will be adequate. Docker itself seems to rely on the storage driver to determine when files have changed and so doesn't have to deal with these problems directly. An alternative implementation would use `inotify` to track which files have changed. However that would mean watching every file in the filesystem, and adding new watches as files are added. Not only is there a limit on the number of files that can be watched, but according to the man pages a) this can take a significant amount of time b) there is complication around when events arrive (e.g. by the time they arrive, the files may have changed) and lastly c) events can be lost, which would mean we'd run into this non-deterministic behaviour again anyway. Fixes GoogleContainerTools#251
Kaniko uses mtime (as well as file contents and other attributes) to determine if files have changed. COPY and ADD commands should _always_ update the mtime, because they actually overwrite the files. However it turns out that the mtime can lag, so kaniko would sometimes add a new layer when using COPY or ADD on a file, and sometimes would not. This leads to a non-deterministic number of layers. To fix this, we have updated the kaniko commands to be more authoritative in declaring when they have changed a file (e.g. WORKDIR will now only create the directory when it doesn't exist) and we will trust those files and _always_ add them, instead of only adding them if they haven't changed. It is possible for RUN commands to also change the filesystem, in which case kaniko has no choice but to look at the filesystem to determine what has changed. For this case we have added a call to `sync` however we still cannot guarantee that sometimes the mtime will not lag, causing the number of layers to be non-deterministic. However when I tried to cause this behaviour with the RUN command, I couldn't. This changes the snapshotting logic a bit; before this change, the last command of the last stage in a Dockerfile would always scan the whole file system and ignore the files returned by the kaniko command. Instead we will now trust those files and assume that the snapshotting performed by previous commands will be adequate. Docker itself seems to rely on the storage driver to determine when files have changed and so doesn't have to deal with these problems directly. An alternative implementation would use `inotify` to track which files have changed. However that would mean watching every file in the filesystem, and adding new watches as files are added. Not only is there a limit on the number of files that can be watched, but according to the man pages a) this can take a significant amount of time b) there is complication around when events arrive (e.g. by the time they arrive, the files may have changed) and lastly c) events can be lost, which would mean we'd run into this non-deterministic behaviour again anyway. Fixes GoogleContainerTools#251
The integration test run for #249 failed even though the PR is fixing a typo in a README.
Seems like we either have a flakey test or the test is broken! :O
The text was updated successfully, but these errors were encountered: