Startup slow due to fsGroup with large data volume #14

raxod502-plaid · 2020-11-23T17:26:46Z

Cross-posting from runatlantis/atlantis#342 (comment) for visibility. The Atlantis chart uses the securityContext.fsGroup option to initially set filesystem permissions. I presume this is because persistent volumes are often initialized to have a filesystem owned by root. However, fsGroup causes Kubernetes to run a recursive chown on the entire volume at every startup. For a large data volume with many clones of large repositories, this can take upwards of 10 minutes. Since Atlantis runs as a singleton this equates to 10 extra minutes of downtime on every deploy.

In our private fork of Atlantis I have dealt with this problem by doing the recursive chown in the Docker entrypoint, if and only if the top-level directory is not owned by the Atlantis user. Kubernetes now provides this feature natively, but only in alpha: see fsGroupChangePolicy.

The text was updated successfully, but these errors were encountered:

Addresses runatlantis#14

Addresses runatlantis/helm-charts#14

…ent label (runatlantis#14) * Updating cluster-overprovisioner label * Bumb chart version * Update README

Addresses runatlantis/helm-charts#14

* Implement runWithUser/fsGroup within entrypoint Addresses #14 * Bump chart version

gmaghera · 2022-05-26T16:47:02Z

This changes has reduced our startup duration from over 25 minutes to 60 seconds. Thank you for the implementation.

However, we've had to temporarily turn it back on for deploying a brand new test instance. Has that edge case been tested?

raxod502-plaid · 2022-05-26T16:58:26Z

I thought I tested this when originally setting up the PR, but it was about a year and a half before it was merged, so it's possible something changed since then, or that I have a bad recollection.

yuha0 · 2022-06-21T17:17:03Z

fsGroupChangePolicy has been a beta feature since 1.20, so I think the merged change is not required anymore?

In fact this change broke our deployment because we enforce a non-root user for atlantis container, which cannot chown

jamengual · 2022-06-21T18:38:18Z

are we saying that this should be reverted?

gmaghera · 2022-06-21T19:34:37Z

Hi @jamengual. I don't think we should revert it, I believe it works well now.

This change moved a values.yaml setting of the Helm chart from runatlantis/helm-charts into the entrypoint defined in runatlantis/atlantis. Apparently the releases across the two repositories weren't sufficiently coordinated. The Helm chart release predated the one of Atlantis core, which resulted in restarts of the Atlantis container immediately speeding up, but resulting in an error during the edge case of deploying a new version of the image.

By the time of releases https://github.com/runatlantis/helm-charts/releases/tag/atlantis-4.0.3 and https://github.com/runatlantis/atlantis/releases/tag/v0.19.4 the changes were properly coordinated. Now deploying a new image captures having to run chown and we incur the performance penalty, but subsequent restarts are lightning fast.

Helm chart version 4.0.0 introduced this change on May 16 but Atlantis core did not release it until June 6 or so. They should have been released together.

jamengual · 2022-06-21T19:39:55Z

Yes, there were some issues on the releases, now we have a new process and we will continue to improve it.

…

On Tue, Jun 21, 2022 at 12:34 PM Gabor Maghera ***@***.***> wrote: Hi @jamengual <https://github.com/jamengual>. I don't think we should revert it, I believe it works well now. This change moved a values.yaml setting of the Helm chart from runatlantis/helm-charts into the entrypoint defined in runatlantis/atlantis. Apparently the releases across the two repositories weren't sufficiently coordinated. The Helm chart release predated the one of Atlantis core, which resulted in restarts of the Atlantis container immediately speeding up, but resulting in an error during the edge case of deploying a new version of the image. By the time of releases https://github.com/runatlantis/helm-charts/releases/tag/atlantis-4.0.3 and https://github.com/runatlantis/atlantis/releases/tag/v0.19.4 the changes were properly coordinated. Now deploying a new image captures having to run chown and we incur the performance penalty, but subsequent restarts are lightning fast. Helm chart version 4.0.0 introduced this change on May 16 but Atlantis core did not release it until June 6 or so. They should have been released together. — Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQ3ERDNYIIVNW3GFZWZ2STVQIKNRANCNFSM4T7ZAYAA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

yuha0 · 2022-06-21T20:10:45Z

The change in the helm chart removed default security context settings in values.yaml. However, the capability of changing securityContext field still exists. If a helm user decides to change that in values.yaml, they will have a broken deployment. I would argue that if running as root became a requirement, it should not be a templated field in the chart.

IMHO:

Originally, the change was introduced to address a slow startup problem. But now it is not a problem anymore(as long as the user tells k8s to not to chown, which can be a simple template change in the chart).
Atlantis was running fine as a non-root user before. If it does not have a slow startup problem anymore, why does it have to run as root, which increases attack surface?

jamengual · 2022-06-21T20:26:15Z

After looking at the change

securityContext:
    fsGroup: 1000
    runAsUser: 100

which removes the RunAsUser it makes atlantis to run as root by default which is definitely a bad idea, we want Atlantis to be secure by default and I agree with @yuha0 on that point.

People that need to run it as root, can change the values on the chart since they are already exposed.

gmaghera · 2022-06-21T21:19:32Z

Hmm, I see it run as atlantis inside the pod, and when I deployed a new version of the chart and image I saw file ownership updates indicating the contrary. I saved a screenshot to share with my coworkers that we now have both sides of the implementation.

This is from inside the pod from Helm chart 4.0.3 using our own image, based on Atlantis 1.9.4:

bash-5.1# ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 {docker-entrypoi} /usr/bin/dumb-init /bin/sh /usr/local/bin/docker-entrypoint.sh server --enable-policy-checks
    7 atlantis 34:39 atlantis server --enable-policy-checks
  659 root      0:00 sh -c clear; (bash || ash || sh)
  666 root      0:00 bash
 5674 root      0:00 bash
 5680 root      0:00 ps -ef
32567 root      0:00 /bin/bash

jamengual · 2022-06-21T21:49:21Z

I just merged and updated the chart, did you use the new chart ?

gmaghera · 2022-06-21T22:09:03Z

That screenshot was taken Tuesday on June 14th using chart version 4.0.3.

Use fsGroupChangePolicy: "OnRootMismatch" (stable in k8s 1.23) to eliminate unnecessary chmod that can be slow for large volumes. See runatlantis#14 See https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods

Use fsGroupChangePolicy: "OnRootMismatch" (stable in k8s 1.23) to eliminate unnecessary chmod that can be slow for large volumes. See #14 See https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods

Addresses runatlantis/helm-charts#14

nitrocode · 2022-12-29T01:28:14Z

can this be closed after merging #156 and #158?

gmaghera · 2022-12-29T01:38:25Z

@nitrocode IIRC the improvement was rolled back for some reason.

nitrocode · 2022-12-29T01:41:54Z

That was #156. Did you see #158?

nitrocode · 2022-12-30T16:06:46Z

Closing due to merging of #158

raxod502-plaid added a commit to raxod502-plaid/helm-charts that referenced this issue Dec 21, 2020

Implement runWithUser/fsGroup within entrypoint

73d15f5

Addresses runatlantis#14

raxod502-plaid added a commit to raxod502-plaid/atlantis that referenced this issue Dec 21, 2020

Implement runWithUser/fsGroup within entrypoint

cd936ce

Addresses runatlantis/helm-charts#14

This was referenced Dec 21, 2020

Implement runWithUser/fsGroup within entrypoint runatlantis/atlantis#1326

Merged

Implement runWithUser/fsGroup within entrypoint #21

Merged

spirosoik pushed a commit to spirosoik/helm-charts that referenced this issue Aug 3, 2021

[stable/cluster-overprovisioner] Upde cluster-overprovisioner deploym…

2343771

…ent label (runatlantis#14) * Updating cluster-overprovisioner label * Bumb chart version * Update README

jamengual pushed a commit to runatlantis/atlantis that referenced this issue May 16, 2022

Implement runWithUser/fsGroup within entrypoint (#1326)

efdb3ff

Addresses runatlantis/helm-charts#14

jamengual pushed a commit that referenced this issue May 16, 2022

Implement runWithUser/fsGroup within entrypoint (#21)

a107789

* Implement runWithUser/fsGroup within entrypoint Addresses #14 * Bump chart version

jamengual mentioned this issue Jun 21, 2022

Revert Security Context Change #156

Merged

ysoldak mentioned this issue Jun 22, 2022

Fix slow startup by doing chmod once #158

Merged

krrrr38 pushed a commit to krrrr38/atlantis that referenced this issue Dec 16, 2022

Implement runWithUser/fsGroup within entrypoint (runatlantis#1326)

5570cae

Addresses runatlantis/helm-charts#14

nitrocode mentioned this issue Dec 29, 2022

Add company Logos and success histories to the Atlantis page runatlantis/atlantis#2626

Open

nitrocode closed this as completed Dec 30, 2022

UtpalJayNadiger mentioned this issue Aug 16, 2023

Atlantis raxod502/intuitive-explanations#26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Startup slow due to fsGroup with large data volume #14

Startup slow due to fsGroup with large data volume #14

raxod502-plaid commented Nov 23, 2020

gmaghera commented May 26, 2022

raxod502-plaid commented May 26, 2022

yuha0 commented Jun 21, 2022

jamengual commented Jun 21, 2022

gmaghera commented Jun 21, 2022

jamengual commented Jun 21, 2022 via email

yuha0 commented Jun 21, 2022

jamengual commented Jun 21, 2022 •

edited

Loading

gmaghera commented Jun 21, 2022

jamengual commented Jun 21, 2022

gmaghera commented Jun 21, 2022

nitrocode commented Dec 29, 2022

gmaghera commented Dec 29, 2022

nitrocode commented Dec 29, 2022

nitrocode commented Dec 30, 2022

Startup slow due to fsGroup with large data volume #14

Startup slow due to fsGroup with large data volume #14

Comments

raxod502-plaid commented Nov 23, 2020

gmaghera commented May 26, 2022

raxod502-plaid commented May 26, 2022

yuha0 commented Jun 21, 2022

jamengual commented Jun 21, 2022

gmaghera commented Jun 21, 2022

jamengual commented Jun 21, 2022 via email

yuha0 commented Jun 21, 2022

jamengual commented Jun 21, 2022 • edited Loading

gmaghera commented Jun 21, 2022

jamengual commented Jun 21, 2022

gmaghera commented Jun 21, 2022

nitrocode commented Dec 29, 2022

gmaghera commented Dec 29, 2022

nitrocode commented Dec 29, 2022

nitrocode commented Dec 30, 2022

jamengual commented Jun 21, 2022 •

edited

Loading