Unable to get dockerVolumeMounts working #452

Puneeth-n · 2021-04-13T15:08:12Z

Hi, I am trying to mount a AWS fsx volume to docker:dind image with the new dockerVolumeMounts feature and I am not sure if it is working as expected.

I puller a docker image from inside one runner and ried to do the same from another runner. The expectation was that it would not pull it again in the 2nd runner but it did.

the nodes are in the same AZ as the FSx volume and all the GHA are running on these nodes.

Chart version: 0.10.5
Controller: v0.18.2

Runner config

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: comtravo-github-actions-deployment
  namespace: ${kubernetes_namespace.ci.metadata[0].name}
spec:
  template:
    spec:
      nodeSelector:
        node.k8s.comtravo.com/workergroup-name: github-actions
      image: harbor/cache/comtravo/actions-runner:v2.277.1
      imagePullPolicy: Always
      repository: ${local.actions.git_repository}
      serviceAccountName: ${local.actions.service_account_name}
      securityContext:
        fsGroup: 1447
      dockerVolumeMounts:
      - name: docker-volume
        mountPath: /var/lib/docker
      volumes:
      - name: docker-volume
        persistentVolumeClaim:
          claimName: ${kubernetes_persistent_volume_claim.actions_docker_volume.metadata[0].name}
      resources:
        limits:
          memory: "4Gi"
        requests:
          memory: "256Mi"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: comtravo-github-actions-deployment-autoscaler
  namespace: ${kubernetes_namespace.ci.metadata[0].name}
spec:
  scaleTargetRef:
    name: comtravo-github-actions-deployment
  minReplicas: 4
  maxReplicas: 100
  metrics:
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
    repositoryNames:
      - summerwind/actions-runner-controller
  scaleUpTriggers:
  - githubEvent:
      checkRun:
        types: ["created"]
        status: "queued"
    amount: 1
    duration: "1m"

k -n ci describe runner comtravo-github-actions-deployment-8f2gx-5bmhm

Name:         comtravo-github-actions-deployment-8f2gx-5bmhm
Namespace:    ci
Labels:       runner-deployment-name=comtravo-github-actions-deployment
              runner-template-hash=6959d947d9
Annotations:  <none>
API Version:  actions.summerwind.dev/v1alpha1
Kind:         Runner
Metadata:
  Creation Timestamp:  2021-04-13T14:35:10Z
  Finalizers:
    runner.actions.summerwind.dev
  Generate Name:  comtravo-github-actions-deployment-8f2gx-
  Generation:     1
  Managed Fields:
    API Version:  actions.summerwind.dev/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
        f:generateName:
        f:labels:
          .:
          f:runner-deployment-name:
          f:runner-template-hash:
        f:ownerReferences:
      f:spec:
        .:
        f:dockerdContainerResources:
        f:image:
        f:imagePullPolicy:
        f:nodeSelector:
          .:
          f:node.k8s.comtravo.com/workergroup-name:
        f:repository:
        f:resources:
          .:
          f:limits:
            .:
            f:memory:
          f:requests:
            .:
            f:memory:
        f:securityContext:
          .:
          f:fsGroup:
        f:serviceAccountName:
        f:volumes:
      f:status:
        .:
        f:lastRegistrationCheckTime:
        f:phase:
        f:registration:
          .:
          f:expiresAt:
          f:repository:
          f:token:
    Manager:    manager
    Operation:  Update
    Time:       2021-04-13T15:07:16Z
  Owner References:
    API Version:           actions.summerwind.dev/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  RunnerReplicaSet
    Name:                  comtravo-github-actions-deployment-8f2gx
    UID:                   2492f02a-ee74-4777-9df9-9fb07d9b138f
  Resource Version:        69345080
  Self Link:               /apis/actions.summerwind.dev/v1alpha1/namespaces/ci/runners/comtravo-github-actions-deployment-8f2gx-5bmhm
  UID:                     5c7c3de8-15ba-41ee-80ea-a291c0cbada8
Spec:
  Dockerd Container Resources:
  Image:              harbor.infra.comtravo.com/cache/comtravo/actions-runner:v2.277.1
  Image Pull Policy:  Always
  Node Selector:
    node.k8s.comtravo.com/workergroup-name:  github-actions
  Repository:                                comtravo/ct-backend
  Resources:
    Limits:
      Memory:  4Gi
    Requests:
      Memory:  256Mi
  Security Context:
    Fs Group:            1447
  Service Account Name:  actions
  Volumes:
    Name:  docker-volume
    Persistent Volume Claim:
      Claim Name:  actions-docker-volume
Status:
  Last Registration Check Time:  2021-04-13T15:07:16Z
  Phase:                         Running
  Registration:
    Expires At:  2021-04-13T15:34:31Z
    Repository:  comtravo/ct-backend
    Token:       ASS5GHOQCZPOS6FVDRFG2YTAOW5APAVPNFXHG5DBNRWGC5DJN5XF62LEZYANUGERWFUW443UMFWGYYLUNFXW4X3UPFYGLN2JNZ2GKZ3SMF2GS33OJFXHG5DBNRWGC5DJN5XA
Events:
  Type    Reason                    Age                From               Message
  ----    ------                    ----               ----               -------
  Normal  RegistrationTokenUpdated  31m                runner-controller  Successfully update registration token
  Normal  PodCreated                19s (x2 over 31m)  runner-controller  Created pod 'comtravo-github-actions-deployment-8f2gx-5bmhm'

The text was updated successfully, but these errors were encountered:

mumoshu · 2021-04-13T21:40:30Z

@Puneeth-n Hey! I have no experience in using FSx but is PVC backed by FSx is supposed to mount the "same volume" into multiple pods?

Also, sharing /var/lib/docker isn't how it's supposed to work.

The Docker daemon was explicitly designed to have exclusive access to /var/lib/docker. Nothing else should touch, poke, or tickle any of the Docker files hidden there.
https://jpetazzo.github.io/2015/09/03/do-not-use-docker-in-docker-for-ci/

Puneeth-n · 2021-04-14T06:05:26Z

@mumoshu thanks for reminding me about this article. Then, what is the main purpose of dockerVolumeMounts? to have let's says have 50 persistent volumes for 50 runners? Aren't runners short lived? i.e, don't they die and a new one comes up after the action is done?

More importantly, how can i improve the docker pull and docker build performance? i'm trying to build huge docker containers with GHA and not happy with the docker build times

mumoshu · 2021-04-14T06:09:18Z

@Puneeth-n I think that's mainly for sharing more files between (1) the runner container that runs your job steps and (2)docker containers run within the dind container.

mumoshu · 2021-04-14T06:11:28Z

That's being said, I would appreciate it very much if you could share more use-cases with dockerVolumeMounts, if found any 😄

toast-gear · 2021-04-14T06:30:02Z

It was added here #439 to resolve #435. Unforunately it's is another feature that has been added without any documentation by the author so it's not clear on how it is expected to be used. Those issues may help @Puneeth-n with figuring that out.

A PR to add docs would be greatly appreciated by yourself or @asoldino the original author.

mumoshu · 2021-04-14T06:42:56Z

Thanks @toast-gear! I haven't read @asoldino's original motivation very carefully

Motivation: Having a persistent volume mounted to /var/lib/docker on the dind sidecar can improve the performance of container jobs (layer caching).

When you used e.g. host volumes, this would work when you have only one runner pod per host. In a public cloud like AWS, it would imply that you may prefer combining smaller EC2 instances with the one-pod-per-host model.

But I think it would be preferable to avoid using PV just for making the docker builds faster. Probably you've better experience with e.g. using the nearest container image registry like ECR when you're in AWS with docker's --cache-from option, or even use some kind of object storage like S3 to backup/restore images and layers after/before the docker build.

asoldino · 2021-04-14T07:03:31Z

Hi @mumoshu, in my scenario we leverage private runners mainly because we need powerful machines to run container jobs based on very large images (~10s of Gb hosted on ACR/ECR) and complex compilation units. In this scenario, single-tenancy of runners is desirable. Initial benchmarks shown an improvement of ~5 minutes per build (on a 10 Gb image).
If using --cache-from means that a pristine runner has to download 10Gb image every time, I don't think it's solving the issue we have.
What I'm presenting is probably a corner case, albeit quite common where I'm coming from. This functionality really helped a lot of people already.

What do you think?

mumoshu · 2021-04-14T07:06:43Z

@asoldino Hey! Your scenario and the use of the feature seem completely valid. To be clear, you aren't concurrently writing to /var/lib/docker from multiple dockerd processes, right?

asoldino · 2021-04-14T07:10:58Z

Exactly, that's not supposed to happen

mumoshu · 2021-04-14T07:18:03Z

@asoldino Thanks for confirming!

@Puneeth-n Hey! I believe I wasn't clear. I only wanted to say that sharing /var/lib/docker from multiple concurrently-running containers is wrong. If you can ensure only one container is writing to /var/lib/docker, it should work fine. That's being said, if you'd like to share /var/lib/docker using a host volume, you will likely to need to set some pod anti affinity and/or big resource req/limits to avoid two or more runners pods concurrently scheduled onto the same host.

mumoshu · 2021-04-14T07:29:34Z

@Puneeth-n If you still need to use FSx, I think actions-runner-controller needs to be enhanced to enable the user to specify a PVC template rather than a PVC, like a K8s statefulset.

Puneeth-n · 2021-04-14T07:42:02Z

@mumoshu Thanks for clarifying.

callum-tait-pbx · 2021-04-14T08:08:49Z

@asoldino could you describe your setup a bit more for me please? You have RunnerDeployments in k8s cluster tied to a single really big k8s node which isn't shared with any other runners. Do your runners just have huge requests/limits?

asoldino · 2021-04-14T09:29:59Z

Sure:

I have three node pools in my cluster (one for system pod, one for the platform pods - including the actions-runner-controller, one for the runners)
The runners node pool has a node autoscaler active (using managed components from AKS)
I make sure Kubernetes schedules the runners on the dedicated node pool e.g.

#...
kind: RunnerDeployment
spec:
  template:
    spec:
      nodeSelector:
        agentpool: runners
#...

I make sure to cap the resources of one runner node by e.g. (for 8 cores 32 Gib Ram nodes)

#...
resources:
  limits:
    cpu: "4.0"
    memory: "16Gi"
dockerContainerResources:
  limits:
    cpu: "4.0"
    memory: "16Gi"
#...

I have a HorizontalRunnerAutoscaler, e.g.

#...
kind: HorizontalRunnerAutoscaler
spec:
  scaleTargetRef:
    name: runners
#...

To recap: Resource request forces Kubernetes to schedule one pod per runner node, when the runner autoscaler kicks in then the node autoscaler provisions the extra nodes required and Kubernetes can eventually run the additional pod.

Puneeth-n · 2021-04-14T13:47:48Z

@asoldino are you using buildkit or using --cache-from to speedup docker builds?

asoldino · 2021-04-14T15:08:03Z

@Puneeth-n I'm not actively working on the workflows, I'm "just" a platform provider for my company. I can tell there are a few teams using --cache-from or buildkit, because most of the container jobs are usually normal jobs executed within a container instead of the runner directly. For us, it's much faster and easier to manage.

Puneeth-n · 2021-05-10T15:19:48Z

@mumoshu when do you plan to create a new release? I can't wait to test this feature out :)

stale · 2021-06-09T15:20:44Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

toast-gear · 2021-06-09T15:33:07Z

We've done another release 🎉

Puneeth-n · 2021-06-11T08:58:57Z

@mumoshu @asoldino I can't thank you enough for this feature. This feature finally enabled me to switch to GHA from Jenkins for our heavier workloads and huge docker images 😅 this feature + buildkit is just amazing. It is such a liberating feeling to have almost deprecated Jenkins :D

pratikbin · 2021-09-17T05:36:03Z

@Puneeth-n @callum-tait-pbx @mumoshu Question: I have 32 GB 8cpu single node fully dedicated to 2 runners, maybe increase replicas of runners in near future, and I want to use docker registry as pull through cache pod. is it architecturally right or wrong ?

yesterday I deployed registry pod, configured dockerRegistryMirror and ran 1 or 2 workflows, and it was running fine haven't got to chance to test it more, and I'm looking forward to using volumeclaimTemplates to mount /var/lib/docker in each runner pod.

pratikbin · 2021-09-17T06:06:20Z

^^ taking this to discussion #813 (comment)

antodoms · 2021-11-25T23:43:00Z

@Puneeth-n @pratikbalar have you guys got the PVC to mount /var/lib/docker successfully to RunnerDeployment ? I am planning to use EFS to share image cache between runner pods. Let me know if you had any success in it, if so how did you get it working?

mumoshu · 2021-11-26T01:08:02Z

@antodoms Sharing /var/lib/docker is a big no-no! AFAIK docker isn't designed to work with that.

#452 (comment)

antodoms · 2021-11-26T01:21:32Z

@mumoshu i was more of thinking mounting /var/lib/docker/overlay2 instead of the whole docker folder. since most of the images are stored in that folder. what do you think of that?

mumoshu · 2021-11-26T01:47:18Z

@antodoms I'm not an expert but I guess, docker doesn't maintain an exclusive lock under the whole /var/lib/docker so whatever level of the directory you mount, multiple dockerd processes across pods and nodes would try to write on it and that can break random files under the directory 🤔

Puneeth-n · 2021-11-26T12:40:51Z

@antodoms I do not recommend EFS for anything. I tried it years back to mount the EFS volume across multiple Jenkins agents to have the same source code. I had issues with file consistencies across AZ

antodoms · 2021-11-29T02:28:57Z

Thank you @Puneeth-n and @mumoshu so i have found this metric better, stead of scaling down the replica to 0, i am using minReplica: 2 so there is always some runners running always, which would help in cross build caching.

multi stage build and buildkit definetly helped speed up few steps in build but still without image caching each new runner that gets created will have to fetch new postgres and redis service images from docker hub, which eats up some time. having few runners running always makes sure those images are cached inside those runners.

Puneeth-n · 2021-11-29T12:35:16Z

Thank you @Puneeth-n and @mumoshu so i have found this metric better, stead of scaling down the replica to 0, i am using minReplica: 2 so there is always some runners running always, which would help in cross build caching.

multi stage build and buildkit definetly helped speed up few steps in build but still without image caching each new runner that gets created will have to fetch new postgres and redis service images from docker hub, which eats up some time. having few runners running always makes sure those images are cached inside those runners.

@antodoms why not mount local volume to the docker container and pin one runner per node? or try volumeClaimTemplates and RunnerSet?

prein · 2023-07-07T12:23:36Z

@Puneeth-n @mumoshu @asoldino Wouldn't using subPathExpr invalidate the concern about sharing /var/lib/docker between multiple pods scheduled to the same node?
Consider

      dockerEnv:
      - name: POD_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.name
[...]
      - mountPath: /var/lib/docker
        name: docker
        subPathExpr: $(POD_NAME)-docker

mumoshu · 2023-07-10T11:26:54Z

@prein Hey! I believe we had only two options so far- mount the host /var/lib/docker onto the runner pod and ensure there's only one runner per node, or use emptyDir/dynamic local volume. Neither solutions share /var/lib/docker across pods.

It remains the best practice NOT to share it. I'd consider the use of subPathExpr in this context a variant of the latter option, because it enables you to have a unique /var/lib/docker volume per pod, not shared.

mumoshu added the question Further information is requested label Apr 13, 2021

asoldino mentioned this issue Apr 14, 2021

Add documentation of dockerVolumeMount #453

Merged

stale bot added the stale label Jun 9, 2021

toast-gear closed this as completed Jun 9, 2021

mumoshu mentioned this issue Jun 11, 2021

Add support for volumeClaimTemplates #612

Closed

mumoshu added the don't share /var/lib/docker! label Nov 26, 2021

toast-gear added the docker label Jan 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to get dockerVolumeMounts working #452

Unable to get dockerVolumeMounts working #452

Puneeth-n commented Apr 13, 2021

mumoshu commented Apr 13, 2021

Puneeth-n commented Apr 14, 2021 •

edited

Loading

mumoshu commented Apr 14, 2021 •

edited

Loading

mumoshu commented Apr 14, 2021

toast-gear commented Apr 14, 2021

mumoshu commented Apr 14, 2021 •

edited

Loading

asoldino commented Apr 14, 2021

mumoshu commented Apr 14, 2021 •

edited

Loading

asoldino commented Apr 14, 2021

mumoshu commented Apr 14, 2021

mumoshu commented Apr 14, 2021

Puneeth-n commented Apr 14, 2021

callum-tait-pbx commented Apr 14, 2021

asoldino commented Apr 14, 2021

Puneeth-n commented Apr 14, 2021

asoldino commented Apr 14, 2021

Puneeth-n commented May 10, 2021

stale bot commented Jun 9, 2021

toast-gear commented Jun 9, 2021

Puneeth-n commented Jun 11, 2021 •

edited

Loading

pratikbin commented Sep 17, 2021 •

edited

Loading

pratikbin commented Sep 17, 2021 •

edited

Loading

antodoms commented Nov 25, 2021

mumoshu commented Nov 26, 2021

antodoms commented Nov 26, 2021

mumoshu commented Nov 26, 2021

Puneeth-n commented Nov 26, 2021

antodoms commented Nov 29, 2021

Puneeth-n commented Nov 29, 2021

prein commented Jul 7, 2023 •

edited

Loading

mumoshu commented Jul 10, 2023

Unable to get dockerVolumeMounts working #452

Unable to get dockerVolumeMounts working #452

Comments

Puneeth-n commented Apr 13, 2021

mumoshu commented Apr 13, 2021

Puneeth-n commented Apr 14, 2021 • edited Loading

mumoshu commented Apr 14, 2021 • edited Loading

mumoshu commented Apr 14, 2021

toast-gear commented Apr 14, 2021

mumoshu commented Apr 14, 2021 • edited Loading

asoldino commented Apr 14, 2021

mumoshu commented Apr 14, 2021 • edited Loading

asoldino commented Apr 14, 2021

mumoshu commented Apr 14, 2021

mumoshu commented Apr 14, 2021

Puneeth-n commented Apr 14, 2021

callum-tait-pbx commented Apr 14, 2021

asoldino commented Apr 14, 2021

Puneeth-n commented Apr 14, 2021

asoldino commented Apr 14, 2021

Puneeth-n commented May 10, 2021

stale bot commented Jun 9, 2021

toast-gear commented Jun 9, 2021

Puneeth-n commented Jun 11, 2021 • edited Loading

pratikbin commented Sep 17, 2021 • edited Loading

pratikbin commented Sep 17, 2021 • edited Loading

antodoms commented Nov 25, 2021

mumoshu commented Nov 26, 2021

antodoms commented Nov 26, 2021

mumoshu commented Nov 26, 2021

Puneeth-n commented Nov 26, 2021

antodoms commented Nov 29, 2021

Puneeth-n commented Nov 29, 2021

prein commented Jul 7, 2023 • edited Loading

mumoshu commented Jul 10, 2023

Puneeth-n commented Apr 14, 2021 •

edited

Loading

mumoshu commented Apr 14, 2021 •

edited

Loading

mumoshu commented Apr 14, 2021 •

edited

Loading

mumoshu commented Apr 14, 2021 •

edited

Loading

Puneeth-n commented Jun 11, 2021 •

edited

Loading

pratikbin commented Sep 17, 2021 •

edited

Loading

pratikbin commented Sep 17, 2021 •

edited

Loading

prein commented Jul 7, 2023 •

edited

Loading