Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache to disk: EBS volumes in EKS #3244

Open
rarecrumb opened this issue Jan 23, 2024 · 5 comments
Open

Cache to disk: EBS volumes in EKS #3244

rarecrumb opened this issue Jan 23, 2024 · 5 comments
Labels
community Community contribution enhancement New feature or request needs triage Requires review from the maintainers

Comments

@rarecrumb
Copy link

What would you like added?

I would like to be able to cache docker images and other parts of the build process to a disk.

Why is this needed?

To speed up builds and skip repetitive steps.

Additional context

Seems like this has been widely talked about in this project, but the newest version of RunnerSet behaves more like a deployment than a StatefulSet, which makes re-using disks in AWS difficult. I imagine if the RunnerSet or AutoscalingRunnerSet used a StatefulSet, we could re-use specific disks...

@rarecrumb rarecrumb added community Community contribution enhancement New feature or request needs triage Requires review from the maintainers labels Jan 23, 2024
@AlonAvrahami
Copy link

AlonAvrahami commented Jan 23, 2024

Hi @rarecrumb!

I already looked into it in the past, unfortunately, considering that you are using multiple runners (that each has its own docker daemon) running on a multiple hosts, a shared image store using a shared volume (mounted in /var/lib/docker) is not possible.

Sharing a single image store between multiple daemons (when running simultaneously) is not something that can be supported; the storage is designed to be exclusively accessed by a single daemon.

However, i do have 2 ideas on how to optimize the build process:

  1. Using a shared registry.
    By implementing some kind of a local registry, you can build your images and tag them with your local registry prefix, the runners should be aware of this as this registry should have a service object in the cluster so they will be able to communicate with.
    At first, you will have to build the full image, but in later builds, some of the layers will be exists in the registry.
    This method also using the docker buildkit mentioned in option # 2.
    As for image retention, it should be easier to implement as this is something that the registry should already support.

  2. Using docker buildkit and a local cache:
    https://docs.docker.com/build/cache/backends/
    You can mount some kind of a NFS (in AWS - EFS / FSx) and use the docker buildkit to --cache-to/--cache-from the local (NFS) storage.
    You might need to understand how to manage the cached layers in term of data retention as it might get full without proper retention policy. read more about it here: https://docs.docker.com/build/cache/garbage-collection/

Let me know if it helped you, and if it did, what is the way you choose? :)

@rarecrumb
Copy link
Author

Hi @rarecrumb!

I already looked into it in the past, unfortunately, considering that you are using multiple runners (that each has its own docker daemon) running on a multiple hosts, a shared image store using a shared volume (mounted in /var/lib/docker) is not possible.

Wouldn't a StatefulSet help this? Scaling up would mean that an empty cache is initialized on the first run, but those EBS disks would be re-used by the ordered pods... runner-0 would always mount disk-0, for example.

Sharing a single image store between multiple daemons (when running simultaneously) is not something that can be supported; the storage is designed to be exclusively accessed by a single daemon.

With the StatefulSet example, this is fine. runner-0 creates disk-0 and runs jobs, scales down, but the disk remains waiting for runner-0 to spin back up and re-mount it. During a scaling event, runner-1 / runner-2 / runner-N all create their own fresh disks, but subsequent runs can use those caches. When spinning down during a scale-down event, those disks can be retained until scaling back up to reach N again.

However, i do have 2 ideas on how to optimize the build process:

  1. Using a shared registry.
    By implementing some kind of a local registry, you can build your images and tag them with your local registry prefix, the runners should be aware of this as this registry should have a service object in the cluster so they will be able to communicate with.
    At first, you will have to build the full image, but in later builds, some of the layers will be exists in the registry.
    This method also using the docker buildkit mentioned in option # 2.
    As for image retention, it should be easier to implement as this is something that the registry should already support.
  2. Using docker buildkit and a local cache:
    https://docs.docker.com/build/cache/backends/
    You can mount some kind of a NFS (in AWS - EFS / FSx) and use the docker buildkit to --cache-to/--cache-from the local (NFS) storage.
    You might need to understand how to manage the cached layers in term of data retention as it might get full without proper retention policy. read more about it here: https://docs.docker.com/build/cache/garbage-collection/

Let me know if it helped you, and if it did, what is the way you choose? :)

What I've done for the moment is implemented s3 in the --cache-from and --cache-to, with mode=max, it seems to help a lot.

I am curious about running a private registry inside of Kubernetes just for the purposes of cacheing and I wonder if that would be faster than using the ECR registry as a cache...

@danielloader
Copy link

I've done a reasonable amount of work with external highly cached buildkitd deployments, and for image builds that's working great.

That being said I'm still hoping we can find a way to cache the _tools directory, pulling and setting up things like setup-node and setup-java all the time chews up a lot of time, and bandwidth.

Building a customer runner image is one solution, and pre-bake it all, but it's still not as useful as being able to just drop this tools directory into a high performance block storage.

I considered using EFS but NFS and node_modules aren't really friends and I'm not convinced it was faster than sorting it all out from an fresh state.

@rarecrumb
Copy link
Author

I've done a reasonable amount of work with external highly cached buildkitd deployments, and for image builds that's working great.

Can you share more about this process and what your setup looks like?

Another approach I'm thinking about here is creating an EBS snapshot and mounting copies of that to my runners on each build...

@danielloader
Copy link

https://medium.com/vouchercodes-tech/speeding-up-ci-in-kubernetes-with-docker-and-buildkit-7890bc47c21a this basically and then configure my docker build jobs to use a remote buildkit rather than the local one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Community contribution enhancement New feature or request needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

3 participants