Cache final images instead of build layers #161

MarcoPolo · 2023-03-31T19:57:40Z

I think this will make @thomaseizinger very happy ;)

In an effort to speed things up, this changes the caching strategy from relying on Docker's somewhat opaque rules to simply caching the generated image. Here's how it works:

We define some cache key. Generally this is either the commit we're building from or the hash of the input files. Up to the implementation to decide.
We check s3 directly to see if we have a file named $IMAGE_NAME-$CACHE_KEY-$ARCH.tar.gz, if we do import that image.
No? Build as normal.
If we have the PUSH_CACHE env we can push the built image to s3.

Q: Why not a GH registry?
A: Then we have to deal with GC. This lets us rely on the s3 behavior of a lifetime for objs.

Q: Why not an S3 registry?
A: That requires a bit more work to set up. You need a registry container running in the background. This is simpler. It also supports public read only access to the cache.

Benefits

Faster CI runs for everyone. Less time spent building, even if only cached (I noticed one js-libp2p PR took 8 minutes before starting the test).
More predictable caching. Caching strategy is dead simple here so it's easy to see why a cache missed happened.
Anonymous S3 Reads. Everyone can use the cache
Faster downloads of images. If the final image of the container-mage is small (it should be), then this is significantly less data than getting every single layer and rebuilding.

Why didn't we do this from the start?

One step at a time. The build layer caching strategy was convenient and it was correct. I was hoping it would be better, but here we are.

Note that this solution is different (and imo better) than each implementation managing its own images and pushing to a registry. I say better since only this repo needs to worry about pushing images and GC'ing them.

Why now?

@achingbrain blocked my PR until I make interop tests go vroom.

thomaseizinger

Nice idea. We are getting pretty close to a registry 😁

The fingerprinting might have bugs. As we all know, cache invalidation is hard :)

Does S3 support a PUT-like operation? We could use that to have the user supply the cache key, much like we would supply the name of the docker image for a registry.

What do you think?

multidim-interop/dockerBuildWrapper.sh

multidim-interop/src/compose-runner.ts

MarcoPolo · 2023-04-01T16:47:46Z

The fingerprinting might have bugs

I think if we limit the context to the directory (or subdirectory) where the build script is called from and the builds are reproducible, then this shouldn’t be a problem since we are hashing all the inputs. Is there some bug you had in mind?

Note this is an input hash addressed scheme rather than content hash addressed scheme. There are trade offs here, but this is generally a bit more straightforward.

thomaseizinger · 2023-04-03T17:15:32Z

The fingerprinting might have bugs

I think if we limit the context to the directory (or subdirectory) where the build script is called from and the builds are reproducible, then this shouldn’t be a problem since we are hashing all the inputs. Is there some bug you had in mind?

Nothing specifically.

What do you think of using an explicit key and overwrite the cache each time? The key could still be computed by the action through the working directory for example to not increase the config surface area.

MarcoPolo · 2023-04-03T17:32:13Z

What do you think of using an explicit key and overwrite the cache each time? The key could still be computed by the action through the working directory for example to not increase the config surface area.

I'm confused. Isn't this what we're doing?

docker image save $IMAGE_NAME | gzip | aws s3 cp - s3://$AWS_BUCKET/imageCache/$BUILD_HASH.tar.gz

The explicit key here is BUILD_HASH.

overwrite the cache each time

Maybe I don't follow, but why would we want to do this?

…ockerfile is outside of build_context)

MarcoPolo · 2023-04-04T00:14:04Z

Ah, I think I understand what you mean. Instead of this being part of the wrapper, we have the caller give us a key to use. We can expose helpers to do the correct thing. But this could be useful if the caller wants to hash one file instead of a directory because they know the build is reproducible from that file. e.g. The caller can pass in a commit hash and Dockerfile hash as the key since it'll be downloading the repo at that commit anyways.

thomaseizinger · 2023-04-04T10:05:49Z

More or less. I was thinking that if we just supply a key, we could basically emulate a registry. With a registry, I'd push a new version of the rust-libp2p-ping-interop-test image every time master changes. I don't really care about the hash of all inputs at this point. That fact that we pushed to master has authority.

Perhaps I am too influenced from how we do caching in rust-libp2p: We treat all caches for PRs as read-only and always update all caches on pushes to master.

MarcoPolo · 2023-04-04T17:18:44Z

I don't want to rebuild and republish unnecessarily.

One change I'm trying to make to speed things up is to check if we get a cache hit before downloading the source to build from. This will give us a bit of a speed up if we do hit since we don't need to fetch the repo source.

thomaseizinger · 2023-04-04T18:58:33Z

I don't want to rebuild and republish unnecessarily.

One change I'm trying to make to speed things up is to check if we get a cache hit before downloading the source to build from. This will give us a bit of a speed up if we do hit since we don't need to fetch the repo source.

That is getting to a point where referencing a tagged image from a registry here would give you all of that without any code in here.

What if we combine two solutions:

For images of released versions, push them to a registry. They never change unless we make patch releases and publish upon those can be automated.
For the latest head, we pass in an image hash from an image we just built in the PR.

MarcoPolo · 2023-04-04T20:35:24Z

First off, thanks for the input @thomaseizinger. I really appreciate it!

Here are the constraints i'm trying to solve:

Be reproducible for a given commit.
Cache is an optimization. Things should be fine without it.
If we have a cache hit, be fast.

For images of released versions, push them to a registry. They never change unless we make patch releases and publish upon those can be automated.

If we did this, and put something like "rust-v0.48.0" here, then previous builds may reproduce incorrectly because what "rust-v0.48.0" means back then and what it means after a new patch are different. This would break the first constraint.

We could say we'll use something like "rust-v0.48.0-some-cache-key", but then we have to make sure we always update "some-cache-key" properly. This cache key could be a commit for example, but the downside here is that we then have to remember to update the cache key whenever we make a change to a test binary (e.g. update the commit hash in both places). Seems error prone to me. And it breaks down when we aren't relying on a just commit hash (what's the current hash of all the input files?).

This branch is somewhat trying to do the latter approach, but derive the cache key from the inputs and build automatically if the cache misses. But maybe there's some other better solution? I don't love writing bash scripts (especially when environments differ in subtle ways, what if someone is running an older version of env or bash or curl? gnu make vs macOS make?)

thomaseizinger · 2023-04-04T21:46:33Z

With "they never change", I meant that there is never a need to update things in here (IIRC, that updates in here require releasing an image somewhere else was part of the criticism back when we discussed this).

We would release a new image even for the patch release, meaning the old container would always be the same. We can pin its hash too if we want to be absolutely sure.

A registry still breaks constraint 2 but it can never fulfill that. It also doesn't have to IMO. If we go as far as "not downloading the code if the cache doesn't exist" then we are just almost building a registry in S3 but without the support of having docker automatically pull it.

thomaseizinger · 2023-04-04T18:50:29Z

multidim-interop/rust/v0.48/Makefile

-image.json: rust-libp2p-${commitSha}
-	cd rust-libp2p-${commitSha} && IMAGE_NAME=${image_name} ../../../dockerBuildWrapper.sh -f interop-tests/Dockerfile .
-	docker image inspect ${image_name} -f "{{.Id}}" | \
+ifneq (${shouldUseCache},)


I am not very happy with the fact that we are starting to do actual programming in here. Is there a way we can write this in a proper language?

I am already scared of touching this and there is a lot of duplication between the versions too ...

Each version is considered a separate thing. There's no dependency between them. We can remove duplication at the expense of interlinking these things. I don't think that's worth it.

This code is also relatively straightforward. But at a high level, I agree make and bash is not the best solution. However, rewriting all this in X will give us no real benefit at the end of the day. If you want to make this better, I'd suggest looking into how to improve the test runner speed, each test takes about 2s. Can we speed that up? Updating that will give us the most bang for our buck.

thomaseizinger · 2023-04-05T08:44:46Z

.github/actions/run-interop-ping-test/action.yml

@@ -65,12 +65,12 @@ runs:

    - name: Build images
      working-directory: ${{ steps.find-workdir.outputs.WORK_DIR }}
-      run: make
+      run: make -j 2


Is there a reason to still use make here? We could as well just use typescript for this step.

Going even further, we could make it a "proper" action that is published to the GitHub marketplace. That would allow us to move away from a composite one to either Docker (and allow us to write it in go or Rust) or JavaScript.

I'm not opposed to it. Make gives us a lot of stuff out of the box, like this
-j flag for paralleization. I might punt on that for now though.

To add another point, this is the root directory of these tests. Make here does nothing but call make in the implementation folders. I wouldn't want to remove this make. It's a good interface point such that implementations know they have to provide a Makefile that generates a image.json. The rest can be done however they want.

MarcoPolo · 2023-04-07T20:50:23Z

This makes the checks go from 40+ min to ~25 min.

MarcoPolo · 2023-04-07T20:55:36Z

A registry still breaks constraint 2 but it can never fulfill that. It also doesn't have to IMO.

If you break constraint 2 like that you'll also break constraint 1. You aren't building from a commit. You're building from a commit + the state of the registry.

MarcoPolo · 2023-04-07T21:01:22Z

Also this should significantly speed up the Nim-libp2p and Zig-libp2p since they'll now be able to take advantage of the public cache. Should shave 25min from those runs. (Previously builds would take 30min, but now everyone can use the cache which means setup should only be <5min).

MarcoPolo · 2023-04-07T21:02:12Z

@thomaseizinger / @mxinden This system also benefits from a minimal final image size. It would be great if one of you could update the Dockerfile to use multistage builds and only emit the minimal image.

marten-seemann

This is a massive amount of code changes that are not well-contained. There’s a lot of non-obvious steps that have to inserted into every build file of every image.

Are we certain that massively complicating our setup is worth the speed up?
Specifically, can’t we get most of the benefits by using normal Docker caching when we use multistage builds for the implementations that output a binary?

marten-seemann · 2023-04-08T03:20:58Z

multidim-interop/go/v0.23/Makefile


+cacheKey=${shell find . -type f | sort | grep -v image.json  | tr "\n" " " | xargs ../../helpers/hashFiles.sh}


This is a lot of duplicated, non-trivial code. Would it make sense to split out the common parts into a separate script?

Happy to make that change in a follow up PR. But I don’t think this is that complicated. This doesn’t actually need the sort call since that’s handled by the helper.

We could also have the helper read from stdin instead of the args and that would remove the tr.

Long term though this will go away as all the future versions are defined in the implementation repos and the cache key becomes the commit.

MarcoPolo · 2023-04-08T03:57:01Z

This is a massive amount of code changes that are not well-contained. There’s a lot of non-obvious steps that have to inserted into every build file of every image.

Disagree. Yes, there’s a lot of duplication, but that’s because each thing is contained. The steps don’t have to be inserted. The worst that happens is the image isn’t cached.

Could this be changed? Sure. Maybe rewriting it in a simpler scripting language would help and keep things DRY. Makefiles aren’t that flexible unfortunately. I would be open to that change in a future PR. For now this gives us a pretty big speed improvement that we want for getting js-libp2p test.

Are we certain that massively complicating our setup is worth the speed up?

Yes. This shaves of 10 min for every run. Even more if you didn’t have the aws s3 keys setup (e.g. nim).

Specifically, can’t we get most of the benefits by using normal Docker caching when we use multistage builds for the implementations that output a binary?

I don’t follow. We were using docker layer caching before. Could you expand?

This is blocking getting js-libp2p tests in, which is blocking getting the latest js-libp2p version tested, which is blocking getting js-libp2p-webrtc and browser-to-browser tests in. I’m happy to iterate on this in the future, but we kind of need this sooner rather than later.

marten-seemann · 2023-04-08T04:56:10Z

Specifically, can’t we get most of the benefits by using normal Docker caching when we use multistage builds for the implementations that output a binary?

I don’t follow. We were using docker layer caching before. Could you expand?

We haven't used multi-stage builds though. Would using a multi-stage build speed up restoring from the cache, as Docker now presumably would only need to download the layers from the second stage, and can skip all builder stage layers?

MarcoPolo · 2023-04-08T05:19:50Z

Specifically, can’t we get most of the benefits by using normal Docker caching when we use multistage builds for the implementations that output a binary?

I don’t follow. We were using docker layer caching before. Could you expand?

We haven't used multi-stage builds though. Would using a multi-stage build speed up restoring from the cache, as Docker now presumably would only need to download the layers from the second stage, and can skip all builder stage layers?

I don’t think so.

I have a small change I can make to reduce a lot of the code here. I’ll make that soon

thomaseizinger · 2023-04-10T12:27:20Z

@thomaseizinger / @mxinden This system also benefits from a minimal final image size. It would be great if one of you could update the Dockerfile to use multistage builds and only emit the minimal image.

I opened an issue on the rust-libp2p repository: libp2p/rust-libp2p#3758

MarcoPolo added 2 commits March 31, 2023 12:45

Try caching images instead of build layers

8b8c996

Pass in abort controller signal

93d4f71

thomaseizinger reviewed Apr 1, 2023

View reviewed changes

multidim-interop/dockerBuildWrapper.sh Outdated Show resolved Hide resolved

multidim-interop/src/compose-runner.ts Show resolved Hide resolved

MarcoPolo marked this pull request as ready for review April 3, 2023 17:29

MarcoPolo added 2 commits April 3, 2023 11:34

Derive input hash from args, build_context, and dockerfile (even if d…

b2d6350

…ockerfile is outside of build_context)

Fix restargs to build command

de1553d

MarcoPolo marked this pull request as draft April 3, 2023 23:34

Cache differently

8b7bbab

MarcoPolo added 3 commits April 4, 2023 10:22

Silent check

932b496

Use bash in makefile

e0ecfef

Fix build.ts shebang

a8fcce3

Use -S in env

ee27a53

MarcoPolo force-pushed the marco/image-caching branch from a3e3d7f to ee27a53 Compare April 4, 2023 19:09

MarcoPolo added 4 commits April 4, 2023 12:18

Fix Dockerfile for use outside of buildx

cc1e261

Use --fail in curl

a13f2ae

Fix typo

2293a30

Echo something in shouldUseCache

bf78222

MarcoPolo force-pushed the marco/image-caching branch from 48920ed to bf78222 Compare April 4, 2023 19:56

MarcoPolo added 2 commits April 4, 2023 20:07

Use buildx for rust builds

c4f1f39

Load from buildx

e9f44f3

MarcoPolo added 2 commits April 4, 2023 14:19

Fix version import

c0b6f1d

Start from playwright image

1603724

MarcoPolo added 4 commits April 4, 2023 17:30

Try parallel builds

3caee19

Normalized name for compose

2564c1b

Move compose name sanitization to before serializing

e326eba

Double worker count

97d9215

thomaseizinger reviewed Apr 5, 2023

View reviewed changes

MarcoPolo added 4 commits April 5, 2023 17:38

Use explicit PUSH_CACHE env var

1cd6935

Fix name

5255cea

Don't use git ls-files since composite actions aren't in a git repo

4e9a600

Shebang is not consistent across platforms

e057f2c

MarcoPolo marked this pull request as ready for review April 7, 2023 20:50

marten-seemann reviewed Apr 8, 2023

View reviewed changes

MarcoPolo mentioned this pull request Apr 9, 2023

Cache images slightly differently #165

Merged

thomaseizinger mentioned this pull request Apr 10, 2023

feat(interop-tests): use multistage builds to reduce final image size libp2p/rust-libp2p#3758

Closed

MarcoPolo closed this in #165 Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache final images instead of build layers #161

Cache final images instead of build layers #161

MarcoPolo commented Mar 31, 2023 •

edited

Loading

thomaseizinger left a comment

MarcoPolo commented Apr 1, 2023

thomaseizinger commented Apr 3, 2023

MarcoPolo commented Apr 3, 2023

MarcoPolo commented Apr 4, 2023

thomaseizinger commented Apr 4, 2023

MarcoPolo commented Apr 4, 2023

thomaseizinger commented Apr 4, 2023

MarcoPolo commented Apr 4, 2023

thomaseizinger commented Apr 4, 2023

thomaseizinger Apr 4, 2023

MarcoPolo Apr 7, 2023

thomaseizinger Apr 5, 2023

MarcoPolo Apr 5, 2023

MarcoPolo Apr 7, 2023

MarcoPolo commented Apr 7, 2023

MarcoPolo commented Apr 7, 2023

MarcoPolo commented Apr 7, 2023

MarcoPolo commented Apr 7, 2023

marten-seemann left a comment

marten-seemann Apr 8, 2023

MarcoPolo Apr 8, 2023

MarcoPolo commented Apr 8, 2023

marten-seemann commented Apr 8, 2023

MarcoPolo commented Apr 8, 2023

thomaseizinger commented Apr 10, 2023 •

edited

Loading


		cacheKey=${shell find . -type f \| sort \| grep -v image.json \| tr "\n" " " \| xargs ../../helpers/hashFiles.sh}

Cache final images instead of build layers #161

Cache final images instead of build layers #161

Conversation

MarcoPolo commented Mar 31, 2023 • edited Loading

Benefits

Why didn't we do this from the start?

Why now?

thomaseizinger left a comment

Choose a reason for hiding this comment

MarcoPolo commented Apr 1, 2023

thomaseizinger commented Apr 3, 2023

MarcoPolo commented Apr 3, 2023

MarcoPolo commented Apr 4, 2023

thomaseizinger commented Apr 4, 2023

MarcoPolo commented Apr 4, 2023

thomaseizinger commented Apr 4, 2023

MarcoPolo commented Apr 4, 2023

thomaseizinger commented Apr 4, 2023

thomaseizinger Apr 4, 2023

Choose a reason for hiding this comment

MarcoPolo Apr 7, 2023

Choose a reason for hiding this comment

thomaseizinger Apr 5, 2023

Choose a reason for hiding this comment

MarcoPolo Apr 5, 2023

Choose a reason for hiding this comment

MarcoPolo Apr 7, 2023

Choose a reason for hiding this comment

MarcoPolo commented Apr 7, 2023

MarcoPolo commented Apr 7, 2023

MarcoPolo commented Apr 7, 2023

MarcoPolo commented Apr 7, 2023

marten-seemann left a comment

Choose a reason for hiding this comment

marten-seemann Apr 8, 2023

Choose a reason for hiding this comment

MarcoPolo Apr 8, 2023

Choose a reason for hiding this comment

MarcoPolo commented Apr 8, 2023

marten-seemann commented Apr 8, 2023

MarcoPolo commented Apr 8, 2023

thomaseizinger commented Apr 10, 2023 • edited Loading

MarcoPolo commented Mar 31, 2023 •

edited

Loading

thomaseizinger commented Apr 10, 2023 •

edited

Loading