-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error using shared Terraform providers with a large number of parallel project plans #2242
Comments
I can't see where the plugin-cache location is set (no env variable or configuration file) unless it's done by the Atlantis executable, but I do see that the Terraform docs state it's not concurrency safe (this may only apply to parallel |
is this still happening with |
Unsure @jamengual - we are still using the nightly build created by this PR: #2180 |
We will be working on bringing that feature back since it fixes a number of issues, still work in progress |
hi any update on this ? |
Hi @tl-alex-nicot - we have not been using such a large number of parallel plans since encountering the issue during the upgrade. |
We are using Atlantis For us, it seems to happen as soon as two different PRs plan concurrently, no matter how big they are. We started to see this behavior a couple of weeks ago, it was all good before, so not sure what could be the trigger, to be honest. Is there any way of forcing Atlantis to use different binaries for terraform and the providers and not cache anything? |
I will try with an older version of terraform too like 1.3 and check the changelog because they changed the behavior of the internal cache in 1.4 |
Hi @jamengual |
What's the resolution here? I am having this problem, w/o running any parallel tasks (afaik):
I see the link to #3201, but i'm committing my lock file, so not sure if that's relevant in this case? |
Hello The way to fix this is to commit the lock file in your source control tool if you can. For example, we do this for our one: terraform providers lock \
-platform=linux_arm64 \
-platform=linux_amd64 \
-platform=darwin_amd64 \
-platform=darwin_arm64 If is all good, on the first run you may see a log line like this: Initializing provider plugins...
- Finding hashicorp/vault versions matching "3.14.0"...
- Finding jfrog/artifactory versions matching "7.5.0"...
- Finding hashicorp/random versions matching "3.5.1"...
- Installing hashicorp/vault v3.14.0...
- Installed hashicorp/vault v3.14.0 (signed by HashiCorp)
- Installing jfrog/artifactory v7.5.0...
- Installed jfrog/artifactory v7.5.0 (signed by a HashiCorp partner, key ID 6B219DCCD7639232)
- Installing hashicorp/random v3.5.1...
- Installed hashicorp/random v3.5.1 (signed by HashiCorp) And after that, all the following ones should look like this: Initializing provider plugins...
- Finding hashicorp/vault versions matching "3.14.0"...
- Finding jfrog/artifactory versions matching "7.5.0"...
- Finding hashicorp/random versions matching "3.5.1"...
- Using previously-installed hashicorp/vault v3.14.0
- Using previously-installed jfrog/artifactory v7.5.0
- Using previously-installed hashicorp/random v3.5.1 note the More info about the lock file can be found here |
Thanks! I'll give that a try. |
Is there any way to workaround the issue without going through this? |
Even when committing lock files this is still an issue for us when setting
This pretty much breaks plugin caching as atlantis/terraform will download the plugins for every configuration/state that is being planned. For us it is a worthwhile trade off to deal with added bandwidth costs over constantly re-planning individual configurations and dealing with lost developer productivity that the re-planning causes. |
What if you deploy your one registry internally and point TF to that URL?
in that case the providers will be downloaded from there instead.
https://github.com/outsideris/citizen for example.
…On Wed, Aug 30, 2023 at 2:02 PM Jason Reslock ***@***.***> wrote:
Even when committing lock files this is still an issue for us when setting parallel_plan:
true in our atlantis.yaml files for repositories with a large number of
configurations to plan. The workaround I've used is to modify your atlantis
workflow with some custom commands to "trick" terraform into pre-populating
the plugin cache directory.
workflows:
default:
plan:
steps:
- run: terraform get # pre-fetch the modules or the command below will fail
- run: terraform providers mirror .terraform.d/plugins # get all of the providers and put them in the current working dir
- init
- plan
apply:
steps:
- apply
This pretty much breaks plugin caching as atlantis/terraform will download
the plugins for every configuration/state that is being planned. For us it
is a worthwhile trade off to deal with added bandwidth costs over
constantly re-planning individual configurations and dealing with lost
developer productivity that the re-planning causes.
—
Reply to this email directly, view it on GitHub
<#2242 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQ3ERFSYLVWI35JQJAFLATXX6S57ANCNFSM5U7EMTOQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This isn't necessarily a downloading problem. It looks like a concurrent filesystem access problem. Not pulling from hashicorp's registry may speed up the downloads but it does not stop N terraform procs from trying to read the same file from disk all at once. From reading terraform docs it is clearly stated that the plugin cache dir is not concurrency safe. That is the real root of the issue here. It isn't an atlantis bug...however since atlantis can run many |
You can see here that even CDKTF had this problem and they changed their implementation to run init serially to work around it: All the way at the bottom of the stack we see this issue and I believe that is where all of this starts/started: |
A workaround for those who do not wish to use lock files is to set the - env:
name: TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE
value: "true" This allows For me, this at least solved the case where the issue was |
This worked for me. Thanks @matthiasr |
Community Note
Overview of the Issue
This is an edge case error.
When planning a large number of projects in parallel (in our case 108), the following error appears in several plan outputs:
Error: fork/exec .terraform/providers/registry.terraform.io/hashicorp/aws/3.75.1/linux_amd64/terraform-provider-aws_v3.75.1_x5: text file busy
Looking into it further, it seems that a single provider is shared across all plans running in parallel, linked in each directory.
I believe (but don't know) that this is because all of the plans are attempting to use the same AWS provider and it can possibly only be accessed by so many at once?
This is obviously sensible to avoid re-downloading the same provider
n
times per pull request.This was only noticed once the lock formula was updated to include a hash of the path under workspace names to allow for parallel plans for different states using the same workspace name #2180
The fix is to break the PR into a smaller number of projects per PR (worked with 90)
Reproduction Steps
Additional Context
Occurred during the migration from ECS to EKS for Atlantis and all of the states needed updated
The text was updated successfully, but these errors were encountered: