Cache to disk: EBS volumes in EKS #3244

rarecrumb · 2024-01-23T04:36:21Z

What would you like added?

I would like to be able to cache docker images and other parts of the build process to a disk.

Why is this needed?

To speed up builds and skip repetitive steps.

Additional context

Seems like this has been widely talked about in this project, but the newest version of RunnerSet behaves more like a deployment than a StatefulSet, which makes re-using disks in AWS difficult. I imagine if the RunnerSet or AutoscalingRunnerSet used a StatefulSet, we could re-use specific disks...

AlonAvrahami · 2024-01-23T18:07:51Z

Hi @rarecrumb!

I already looked into it in the past, unfortunately, considering that you are using multiple runners (that each has its own docker daemon) running on a multiple hosts, a shared image store using a shared volume (mounted in /var/lib/docker) is not possible.

Sharing a single image store between multiple daemons (when running simultaneously) is not something that can be supported; the storage is designed to be exclusively accessed by a single daemon.

However, i do have 2 ideas on how to optimize the build process:

Using a shared registry.
By implementing some kind of a local registry, you can build your images and tag them with your local registry prefix, the runners should be aware of this as this registry should have a service object in the cluster so they will be able to communicate with.
At first, you will have to build the full image, but in later builds, some of the layers will be exists in the registry.
This method also using the docker buildkit mentioned in option # 2.
As for image retention, it should be easier to implement as this is something that the registry should already support.
Using docker buildkit and a local cache:
https://docs.docker.com/build/cache/backends/
You can mount some kind of a NFS (in AWS - EFS / FSx) and use the docker buildkit to --cache-to/--cache-from the local (NFS) storage.
You might need to understand how to manage the cached layers in term of data retention as it might get full without proper retention policy. read more about it here: https://docs.docker.com/build/cache/garbage-collection/

Let me know if it helped you, and if it did, what is the way you choose? :)

rarecrumb · 2024-01-24T18:25:46Z

Hi @rarecrumb!

I already looked into it in the past, unfortunately, considering that you are using multiple runners (that each has its own docker daemon) running on a multiple hosts, a shared image store using a shared volume (mounted in /var/lib/docker) is not possible.

Wouldn't a StatefulSet help this? Scaling up would mean that an empty cache is initialized on the first run, but those EBS disks would be re-used by the ordered pods... runner-0 would always mount disk-0, for example.

Sharing a single image store between multiple daemons (when running simultaneously) is not something that can be supported; the storage is designed to be exclusively accessed by a single daemon.

With the StatefulSet example, this is fine. runner-0 creates disk-0 and runs jobs, scales down, but the disk remains waiting for runner-0 to spin back up and re-mount it. During a scaling event, runner-1 / runner-2 / runner-N all create their own fresh disks, but subsequent runs can use those caches. When spinning down during a scale-down event, those disks can be retained until scaling back up to reach N again.

However, i do have 2 ideas on how to optimize the build process:

Using a shared registry.
By implementing some kind of a local registry, you can build your images and tag them with your local registry prefix, the runners should be aware of this as this registry should have a service object in the cluster so they will be able to communicate with.
At first, you will have to build the full image, but in later builds, some of the layers will be exists in the registry.
This method also using the docker buildkit mentioned in option # 2.
As for image retention, it should be easier to implement as this is something that the registry should already support.

Using docker buildkit and a local cache:
https://docs.docker.com/build/cache/backends/
You can mount some kind of a NFS (in AWS - EFS / FSx) and use the docker buildkit to --cache-to/--cache-from the local (NFS) storage.
You might need to understand how to manage the cached layers in term of data retention as it might get full without proper retention policy. read more about it here: https://docs.docker.com/build/cache/garbage-collection/

Let me know if it helped you, and if it did, what is the way you choose? :)

What I've done for the moment is implemented s3 in the --cache-from and --cache-to, with mode=max, it seems to help a lot.

I am curious about running a private registry inside of Kubernetes just for the purposes of cacheing and I wonder if that would be faster than using the ECR registry as a cache...

danielloader · 2024-05-12T16:15:03Z

I've done a reasonable amount of work with external highly cached buildkitd deployments, and for image builds that's working great.

That being said I'm still hoping we can find a way to cache the _tools directory, pulling and setting up things like setup-node and setup-java all the time chews up a lot of time, and bandwidth.

Building a customer runner image is one solution, and pre-bake it all, but it's still not as useful as being able to just drop this tools directory into a high performance block storage.

I considered using EFS but NFS and node_modules aren't really friends and I'm not convinced it was faster than sorting it all out from an fresh state.

rarecrumb · 2024-06-12T16:07:00Z

I've done a reasonable amount of work with external highly cached buildkitd deployments, and for image builds that's working great.

Can you share more about this process and what your setup looks like?

Another approach I'm thinking about here is creating an EBS snapshot and mounting copies of that to my runners on each build...

danielloader · 2024-07-17T17:49:32Z

https://medium.com/vouchercodes-tech/speeding-up-ci-in-kubernetes-with-docker-and-buildkit-7890bc47c21a this basically and then configure my docker build jobs to use a remote buildkit rather than the local one.

rarecrumb added community Community contribution enhancement New feature or request needs triage Requires review from the maintainers labels Jan 23, 2024

AlonAvrahami mentioned this issue Feb 29, 2024

Self-hosted runner(EKS) Docker Cache Not working(w. efs) #3303

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache to disk: EBS volumes in EKS #3244

Cache to disk: EBS volumes in EKS #3244

rarecrumb commented Jan 23, 2024

AlonAvrahami commented Jan 23, 2024 •

edited

Loading

rarecrumb commented Jan 24, 2024

danielloader commented May 12, 2024

rarecrumb commented Jun 12, 2024

danielloader commented Jul 17, 2024

Cache to disk: EBS volumes in EKS #3244

Cache to disk: EBS volumes in EKS #3244

Comments

rarecrumb commented Jan 23, 2024

What would you like added?

Why is this needed?

Additional context

AlonAvrahami commented Jan 23, 2024 • edited Loading

rarecrumb commented Jan 24, 2024

danielloader commented May 12, 2024

rarecrumb commented Jun 12, 2024

danielloader commented Jul 17, 2024

AlonAvrahami commented Jan 23, 2024 •

edited

Loading