-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache final images instead of build layers #161
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea. We are getting pretty close to a registry 😁
The fingerprinting might have bugs. As we all know, cache invalidation is hard :)
Does S3 support a PUT-like operation? We could use that to have the user supply the cache key, much like we would supply the name of the docker image for a registry.
What do you think?
I think if we limit the context to the directory (or subdirectory) where the build script is called from and the builds are reproducible, then this shouldn’t be a problem since we are hashing all the inputs. Is there some bug you had in mind? Note this is an input hash addressed scheme rather than content hash addressed scheme. There are trade offs here, but this is generally a bit more straightforward. |
Nothing specifically. What do you think of using an explicit key and overwrite the cache each time? The key could still be computed by the action through the working directory for example to not increase the config surface area. |
I'm confused. Isn't this what we're doing?
The explicit key here is
Maybe I don't follow, but why would we want to do this? |
…ockerfile is outside of build_context)
Ah, I think I understand what you mean. Instead of this being part of the wrapper, we have the caller give us a key to use. We can expose helpers to do the correct thing. But this could be useful if the caller wants to hash one file instead of a directory because they know the build is reproducible from that file. e.g. The caller can pass in a commit hash and Dockerfile hash as the key since it'll be downloading the repo at that commit anyways. |
More or less. I was thinking that if we just supply a key, we could basically emulate a registry. With a registry, I'd push a new version of the Perhaps I am too influenced from how we do caching in |
I don't want to rebuild and republish unnecessarily. One change I'm trying to make to speed things up is to check if we get a cache hit before downloading the source to build from. This will give us a bit of a speed up if we do hit since we don't need to fetch the repo source. |
That is getting to a point where referencing a tagged image from a registry here would give you all of that without any code in here. What if we combine two solutions:
|
a3e3d7f
to
ee27a53
Compare
48920ed
to
bf78222
Compare
First off, thanks for the input @thomaseizinger. I really appreciate it! Here are the constraints i'm trying to solve:
If we did this, and put something like "rust-v0.48.0" here, then previous builds may reproduce incorrectly because what "rust-v0.48.0" means back then and what it means after a new patch are different. This would break the first constraint. We could say we'll use something like "rust-v0.48.0-some-cache-key", but then we have to make sure we always update "some-cache-key" properly. This cache key could be a commit for example, but the downside here is that we then have to remember to update the cache key whenever we make a change to a test binary (e.g. update the commit hash in both places). Seems error prone to me. And it breaks down when we aren't relying on a just commit hash (what's the current hash of all the input files?). This branch is somewhat trying to do the latter approach, but derive the cache key from the inputs and build automatically if the cache misses. But maybe there's some other better solution? I don't love writing bash scripts (especially when environments differ in subtle ways, what if someone is running an older version of |
With "they never change", I meant that there is never a need to update things in here (IIRC, that updates in here require releasing an image somewhere else was part of the criticism back when we discussed this). We would release a new image even for the patch release, meaning the old container would always be the same. We can pin its hash too if we want to be absolutely sure. A registry still breaks constraint 2 but it can never fulfill that. It also doesn't have to IMO. If we go as far as "not downloading the code if the cache doesn't exist" then we are just almost building a registry in S3 but without the support of having docker automatically pull it. |
image.json: rust-libp2p-${commitSha} | ||
cd rust-libp2p-${commitSha} && IMAGE_NAME=${image_name} ../../../dockerBuildWrapper.sh -f interop-tests/Dockerfile . | ||
docker image inspect ${image_name} -f "{{.Id}}" | \ | ||
ifneq (${shouldUseCache},) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very happy with the fact that we are starting to do actual programming in here. Is there a way we can write this in a proper language?
I am already scared of touching this and there is a lot of duplication between the versions too ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each version is considered a separate thing. There's no dependency between them. We can remove duplication at the expense of interlinking these things. I don't think that's worth it.
This code is also relatively straightforward. But at a high level, I agree make and bash is not the best solution. However, rewriting all this in X will give us no real benefit at the end of the day. If you want to make this better, I'd suggest looking into how to improve the test runner speed, each test takes about 2s. Can we speed that up? Updating that will give us the most bang for our buck.
@@ -65,12 +65,12 @@ runs: | |||
|
|||
- name: Build images | |||
working-directory: ${{ steps.find-workdir.outputs.WORK_DIR }} | |||
run: make | |||
run: make -j 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to still use make here? We could as well just use typescript for this step.
Going even further, we could make it a "proper" action that is published to the GitHub marketplace. That would allow us to move away from a composite
one to either Docker (and allow us to write it in go or Rust) or JavaScript.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to it. Make gives us a lot of stuff out of the box, like this
-j
flag for paralleization. I might punt on that for now though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add another point, this is the root directory of these tests. Make here does nothing but call make in the implementation folders. I wouldn't want to remove this make. It's a good interface point such that implementations know they have to provide a Makefile that generates a image.json. The rest can be done however they want.
This makes the checks go from 40+ min to ~25 min. |
If you break constraint 2 like that you'll also break constraint 1. You aren't building from a commit. You're building from a commit + the state of the registry. |
Also this should significantly speed up the Nim-libp2p and Zig-libp2p since they'll now be able to take advantage of the public cache. Should shave 25min from those runs. (Previously builds would take 30min, but now everyone can use the cache which means setup should only be <5min). |
@thomaseizinger / @mxinden This system also benefits from a minimal final image size. It would be great if one of you could update the Dockerfile to use multistage builds and only emit the minimal image. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a massive amount of code changes that are not well-contained. There’s a lot of non-obvious steps that have to inserted into every build file of every image.
Are we certain that massively complicating our setup is worth the speed up?
Specifically, can’t we get most of the benefits by using normal Docker caching when we use multistage builds for the implementations that output a binary?
|
||
cacheKey=${shell find . -type f | sort | grep -v image.json | tr "\n" " " | xargs ../../helpers/hashFiles.sh} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of duplicated, non-trivial code. Would it make sense to split out the common parts into a separate script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to make that change in a follow up PR. But I don’t think this is that complicated. This doesn’t actually need the sort call since that’s handled by the helper.
We could also have the helper read from stdin instead of the args and that would remove the tr
.
Long term though this will go away as all the future versions are defined in the implementation repos and the cache key becomes the commit.
Disagree. Yes, there’s a lot of duplication, but that’s because each thing is contained. The steps don’t have to be inserted. The worst that happens is the image isn’t cached. Could this be changed? Sure. Maybe rewriting it in a simpler scripting language would help and keep things DRY. Makefiles aren’t that flexible unfortunately. I would be open to that change in a future PR. For now this gives us a pretty big speed improvement that we want for getting js-libp2p test.
Yes. This shaves of 10 min for every run. Even more if you didn’t have the aws s3 keys setup (e.g. nim).
I don’t follow. We were using docker layer caching before. Could you expand? This is blocking getting js-libp2p tests in, which is blocking getting the latest js-libp2p version tested, which is blocking getting js-libp2p-webrtc and browser-to-browser tests in. I’m happy to iterate on this in the future, but we kind of need this sooner rather than later. |
We haven't used multi-stage builds though. Would using a multi-stage build speed up restoring from the cache, as Docker now presumably would only need to download the layers from the second stage, and can skip all builder stage layers? |
I don’t think so. I have a small change I can make to reduce a lot of the code here. I’ll make that soon |
I opened an issue on the |
I think this will make @thomaseizinger very happy ;)
In an effort to speed things up, this changes the caching strategy from relying on Docker's somewhat opaque rules to simply caching the generated image. Here's how it works:
$IMAGE_NAME-$CACHE_KEY-$ARCH.tar.gz
, if we do import that image.PUSH_CACHE
env we can push the built image to s3.Q: Why not a GH registry?
A: Then we have to deal with GC. This lets us rely on the s3 behavior of a lifetime for objs.
Q: Why not an S3 registry?
A: That requires a bit more work to set up. You need a registry container running in the background. This is simpler. It also supports public read only access to the cache.
Benefits
Why didn't we do this from the start?
One step at a time. The build layer caching strategy was convenient and it was correct. I was hoping it would be better, but here we are.
Note that this solution is different (and imo better) than each implementation managing its own images and pushing to a registry. I say better since only this repo needs to worry about pushing images and GC'ing them.
Why now?
@achingbrain blocked my PR until I make interop tests go vroom.