Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Startup slow due to fsGroup with large data volume #14

Closed
raxod502-plaid opened this issue Nov 23, 2020 · 15 comments
Closed

Startup slow due to fsGroup with large data volume #14

raxod502-plaid opened this issue Nov 23, 2020 · 15 comments

Comments

@raxod502-plaid
Copy link
Contributor

Cross-posting from runatlantis/atlantis#342 (comment) for visibility. The Atlantis chart uses the securityContext.fsGroup option to initially set filesystem permissions. I presume this is because persistent volumes are often initialized to have a filesystem owned by root. However, fsGroup causes Kubernetes to run a recursive chown on the entire volume at every startup. For a large data volume with many clones of large repositories, this can take upwards of 10 minutes. Since Atlantis runs as a singleton this equates to 10 extra minutes of downtime on every deploy.

In our private fork of Atlantis I have dealt with this problem by doing the recursive chown in the Docker entrypoint, if and only if the top-level directory is not owned by the Atlantis user. Kubernetes now provides this feature natively, but only in alpha: see fsGroupChangePolicy.

raxod502-plaid added a commit to raxod502-plaid/helm-charts that referenced this issue Dec 21, 2020
raxod502-plaid added a commit to raxod502-plaid/atlantis that referenced this issue Dec 21, 2020
spirosoik pushed a commit to spirosoik/helm-charts that referenced this issue Aug 3, 2021
…ent label (runatlantis#14)

* Updating cluster-overprovisioner label

* Bumb chart version

* Update README
jamengual pushed a commit to runatlantis/atlantis that referenced this issue May 16, 2022
jamengual pushed a commit that referenced this issue May 16, 2022
* Implement runWithUser/fsGroup within entrypoint

Addresses #14

* Bump chart version
@gmaghera
Copy link

This changes has reduced our startup duration from over 25 minutes to 60 seconds. Thank you for the implementation.

However, we've had to temporarily turn it back on for deploying a brand new test instance. Has that edge case been tested?

@raxod502-plaid
Copy link
Contributor Author

I thought I tested this when originally setting up the PR, but it was about a year and a half before it was merged, so it's possible something changed since then, or that I have a bad recollection.

@yuha0
Copy link

yuha0 commented Jun 21, 2022

fsGroupChangePolicy has been a beta feature since 1.20, so I think the merged change is not required anymore?

In fact this change broke our deployment because we enforce a non-root user for atlantis container, which cannot chown

@jamengual
Copy link
Contributor

are we saying that this should be reverted?

@gmaghera
Copy link

Hi @jamengual. I don't think we should revert it, I believe it works well now.

This change moved a values.yaml setting of the Helm chart from runatlantis/helm-charts into the entrypoint defined in runatlantis/atlantis. Apparently the releases across the two repositories weren't sufficiently coordinated. The Helm chart release predated the one of Atlantis core, which resulted in restarts of the Atlantis container immediately speeding up, but resulting in an error during the edge case of deploying a new version of the image.

By the time of releases https://github.com/runatlantis/helm-charts/releases/tag/atlantis-4.0.3 and https://github.com/runatlantis/atlantis/releases/tag/v0.19.4 the changes were properly coordinated. Now deploying a new image captures having to run chown and we incur the performance penalty, but subsequent restarts are lightning fast.

Helm chart version 4.0.0 introduced this change on May 16 but Atlantis core did not release it until June 6 or so. They should have been released together.

@jamengual
Copy link
Contributor

jamengual commented Jun 21, 2022 via email

@yuha0
Copy link

yuha0 commented Jun 21, 2022

The change in the helm chart removed default security context settings in values.yaml. However, the capability of changing securityContext field still exists. If a helm user decides to change that in values.yaml, they will have a broken deployment. I would argue that if running as root became a requirement, it should not be a templated field in the chart.

IMHO:

  • Originally, the change was introduced to address a slow startup problem. But now it is not a problem anymore(as long as the user tells k8s to not to chown, which can be a simple template change in the chart).
  • Atlantis was running fine as a non-root user before. If it does not have a slow startup problem anymore, why does it have to run as root, which increases attack surface?

@jamengual
Copy link
Contributor

jamengual commented Jun 21, 2022

After looking at the change

securityContext:
    fsGroup: 1000
    runAsUser: 100

which removes the RunAsUser it makes atlantis to run as root by default which is definitely a bad idea, we want Atlantis to be secure by default and I agree with @yuha0 on that point.

People that need to run it as root, can change the values on the chart since they are already exposed.

@gmaghera
Copy link

Hmm, I see it run as atlantis inside the pod, and when I deployed a new version of the chart and image I saw file ownership updates indicating the contrary. I saved a screenshot to share with my coworkers that we now have both sides of the implementation.

image

This is from inside the pod from Helm chart 4.0.3 using our own image, based on Atlantis 1.9.4:

bash-5.1# ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 {docker-entrypoi} /usr/bin/dumb-init /bin/sh /usr/local/bin/docker-entrypoint.sh server --enable-policy-checks
    7 atlantis 34:39 atlantis server --enable-policy-checks
  659 root      0:00 sh -c clear; (bash || ash || sh)
  666 root      0:00 bash
 5674 root      0:00 bash
 5680 root      0:00 ps -ef
32567 root      0:00 /bin/bash

@jamengual
Copy link
Contributor

I just merged and updated the chart, did you use the new chart ?

@gmaghera
Copy link

That screenshot was taken Tuesday on June 14th using chart version 4.0.3.

ysoldak added a commit to ysoldak/runatlantis-helm-charts that referenced this issue Jun 22, 2022
Use fsGroupChangePolicy: "OnRootMismatch" (stable in k8s 1.23) to eliminate unnecessary chmod that can be slow for large volumes.
See runatlantis#14
See https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods
lkysow pushed a commit that referenced this issue Jun 27, 2022
Use fsGroupChangePolicy: "OnRootMismatch" (stable in k8s 1.23) to eliminate unnecessary chmod that can be slow for large volumes.
See #14
See https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods
@nitrocode
Copy link
Member

can this be closed after merging #156 and #158?

@gmaghera
Copy link

@nitrocode IIRC the improvement was rolled back for some reason.

@nitrocode
Copy link
Member

That was #156. Did you see #158?

@nitrocode
Copy link
Member

Closing due to merging of #158

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants