Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error using shared Terraform providers with a large number of parallel project plans #2242

Open
snorlaX-sleeps opened this issue May 3, 2022 · 20 comments
Labels
bug Something isn't working work-in-progress

Comments

@snorlaX-sleeps
Copy link
Contributor

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

This is an edge case error.
When planning a large number of projects in parallel (in our case 108), the following error appears in several plan outputs:
Error: fork/exec .terraform/providers/registry.terraform.io/hashicorp/aws/3.75.1/linux_amd64/terraform-provider-aws_v3.75.1_x5: text file busy

Looking into it further, it seems that a single provider is shared across all plans running in parallel, linked in each directory.
I believe (but don't know) that this is because all of the plans are attempting to use the same AWS provider and it can possibly only be accessed by so many at once?

This is obviously sensible to avoid re-downloading the same provider n times per pull request.
This was only noticed once the lock formula was updated to include a hash of the path under workspace names to allow for parallel plans for different states using the same workspace name #2180

The fix is to break the PR into a smaller number of projects per PR (worked with 90)

Reproduction Steps

  • Have an Atlantis repo with a large number of projects
  • Run a parallel plan for all projects at the same time

Additional Context

Occurred during the migration from ECS to EKS for Atlantis and all of the states needed updated

@snorlaX-sleeps snorlaX-sleeps added the bug Something isn't working label May 3, 2022
@snorlaX-sleeps
Copy link
Contributor Author

I can't see where the plugin-cache location is set (no env variable or configuration file) unless it's done by the Atlantis executable, but I do see that the Terraform docs state it's not concurrency safe (this may only apply to parallel init rather than parallel plan)

@jamengual
Copy link
Contributor

is this still happening with v0.19.8?

@jamengual jamengual added the waiting-on-response Waiting for a response from the user label Aug 26, 2022
@snorlaX-sleeps
Copy link
Contributor Author

Unsure @jamengual - we are still using the nightly build created by this PR: #2180
So haven't got any new features.
Was there a change about how parallel_plans interact with a single provider?

@jamengual
Copy link
Contributor

We will be working on bringing that feature back since it fixes a number of issues, still work in progress

@jamengual jamengual added work-in-progress and removed waiting-on-response Waiting for a response from the user labels Aug 26, 2022
@tl-alex-nicot
Copy link

hi any update on this ?

@snorlaX-sleeps
Copy link
Contributor Author

Hi @tl-alex-nicot - we have not been using such a large number of parallel plans since encountering the issue during the upgrade.
I will soon be doing another upgrade and test (to v0.22) and will encounter it again if it still occurs

@AlessioCasco
Copy link

AlessioCasco commented Mar 29, 2023

We are using Atlantis v0.23.2 (commit: 5774453) (build date: 2023-03-03T23:50:16Z) along with terraform v1.4.2 and still see the issue.

For us, it seems to happen as soon as two different PRs plan concurrently, no matter how big they are.
Tomorrow we'll try version v0.23.3 of Atlantis and let you know, even tho it seems that the release doesn't tackle any of these issues.

We started to see this behavior a couple of weeks ago, it was all good before, so not sure what could be the trigger, to be honest.

Is there any way of forcing Atlantis to use different binaries for terraform and the providers and not cache anything?
We have around 30/40 projects, and being able to plan one at a time is a very big problem

@jamengual
Copy link
Contributor

I will try with an older version of terraform too like 1.3 and check the changelog because they changed the behavior of the internal cache in 1.4

@jamengual
Copy link
Contributor

#3201

@AlessioCasco
Copy link

Hi @jamengual
Interesting, I missed that, I'm quite sure that's the issue

@morganhein
Copy link

What's the resolution here? I am having this problem, w/o running any parallel tasks (afaik):

running "/usr/local/bin/terraform init -input=false" in "/home/atlantis/.atlantis/repos/telegraphio/infra-as-code/14/default/terraform/accounts/sandbox/initCore": exit status 1

Initializing the backend...

Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
Initializing modules...
- core_bastion in ../../../modules/core-bastion
Downloading git::ssh://[email protected]/Hapag-Lloyd/terraform-aws-bastion-host-ssm.git?ref=2.4.0 for core_bastion.bastion_host...
- core_bastion.bastion_host in .terraform/modules/core_bastion.bastion_host
Downloading registry.terraform.io/terraform-aws-modules/iam/aws 5.11.2 for core_bastion.bastion_host.instance_profile_role...
- core_bastion.bastion_host.instance_profile_role in .terraform/modules/core_bastion.bastion_host.instance_profile_role/modules/iam-assumable-role
Downloading registry.terraform.io/terraform-aws-modules/iam/aws 5.11.2 for core_bastion.bastion_user...
- core_bastion.bastion_user in .terraform/modules/core_bastion.bastion_user/modules/iam-user
- core_network in ../../../modules/core-network
- core_network.vpc_backbone in ../../../modules/aws-vpc
- core_rds in ../../../modules/core-rds
- core_rds.rds in ../../../modules/aws-rds
- core_s3 in ../../../modules/core

Initializing provider plugins...
- Finding hashicorp/archive versions matching ">= 2.0.0"...
- Reusing previous version of hashicorp/aws from the dependency lock file
- Reusing previous version of cloudflare/cloudflare from the dependency lock file
- Finding hashicorp/random versions matching ">= 3.4.3"...
- Installing hashicorp/archive v2.3.0...
- Installed hashicorp/archive v2.3.0 (signed by HashiCorp)
- Installing hashicorp/aws v4.60.0...
- Installing cloudflare/cloudflare v4.2.0...
- Installed cloudflare/cloudflare v4.2.0 (self-signed, key ID C76001609EE3B136)
- Installing hashicorp/random v3.5.1...
- Installed hashicorp/random v3.5.1 (signed by HashiCorp)

Partner and community providers are signed by their developers.
If you'd like to know more about provider signing, you can read about it here:
https://www.terraform.io/docs/cli/plugins/signing.html
╷
│ Error: Failed to install provider
│ 
│ Error while installing hashicorp/aws v4.60.0: open
│ /home/atlantis/.atlantis/plugin-cache/registry.terraform.io/hashicorp/aws/4.60.0/linux_amd64/terraform-provider-aws_v4.60.0_x5:
│ text file busy
╵

I see the link to #3201, but i'm committing my lock file, so not sure if that's relevant in this case?

@AlessioCasco
Copy link

AlessioCasco commented Apr 17, 2023

Hello

The way to fix this is to commit the lock file in your source control tool if you can.
Since you will probably build the .terraform.lock.hcl from a different machine and not from the one that runs Atlantis, make sure to populate the lock file with all the architectures you use in your infrastructure.

For example, we do this for our one:

terraform providers lock \
  -platform=linux_arm64 \ 
  -platform=linux_amd64 \
  -platform=darwin_amd64 \
  -platform=darwin_arm64

If is all good, on the first run you may see a log line like this:

Initializing provider plugins...
- Finding hashicorp/vault versions matching "3.14.0"...
- Finding jfrog/artifactory versions matching "7.5.0"...
- Finding hashicorp/random versions matching "3.5.1"...
- Installing hashicorp/vault v3.14.0...
- Installed hashicorp/vault v3.14.0 (signed by HashiCorp)
- Installing jfrog/artifactory v7.5.0...
- Installed jfrog/artifactory v7.5.0 (signed by a HashiCorp partner, key ID 6B219DCCD7639232)
- Installing hashicorp/random v3.5.1...
- Installed hashicorp/random v3.5.1 (signed by HashiCorp)

And after that, all the following ones should look like this:

Initializing provider plugins...
- Finding hashicorp/vault versions matching "3.14.0"...
- Finding jfrog/artifactory versions matching "7.5.0"...
- Finding hashicorp/random versions matching "3.5.1"...
- Using previously-installed hashicorp/vault v3.14.0
- Using previously-installed jfrog/artifactory v7.5.0
- Using previously-installed hashicorp/random v3.5.1

note the Using previously-installed

More info about the lock file can be found here

@morganhein
Copy link

Thanks! I'll give that a try.

@gfoligna-nyshex
Copy link

Hello

The way to fix this is to commit the lock file in your source control tool if you can. Since you will probably build the .terraform.lock.hcl from a different machine and not from the one that runs Atlantis, make sure to populate the lock file with all the architectures you use in your infrastructure.

For example, we do this for our one:

terraform providers lock \
  -platform=linux_arm64 \ 
  -platform=linux_amd64 \
  -platform=darwin_amd64 \
  -platform=darwin_arm64

If is all good, on the first run you may see a log line like this:

Initializing provider plugins...
- Finding hashicorp/vault versions matching "3.14.0"...
- Finding jfrog/artifactory versions matching "7.5.0"...
- Finding hashicorp/random versions matching "3.5.1"...
- Installing hashicorp/vault v3.14.0...
- Installed hashicorp/vault v3.14.0 (signed by HashiCorp)
- Installing jfrog/artifactory v7.5.0...
- Installed jfrog/artifactory v7.5.0 (signed by a HashiCorp partner, key ID 6B219DCCD7639232)
- Installing hashicorp/random v3.5.1...
- Installed hashicorp/random v3.5.1 (signed by HashiCorp)

And after that, all the following ones should look like this:

Initializing provider plugins...
- Finding hashicorp/vault versions matching "3.14.0"...
- Finding jfrog/artifactory versions matching "7.5.0"...
- Finding hashicorp/random versions matching "3.5.1"...
- Using previously-installed hashicorp/vault v3.14.0
- Using previously-installed jfrog/artifactory v7.5.0
- Using previously-installed hashicorp/random v3.5.1

note the Using previously-installed

More info about the lock file can be found here

Is there any way to workaround the issue without going through this?

@jreslock
Copy link

Even when committing lock files this is still an issue for us when setting parallel_plan: true in our atlantis.yaml files for repositories with a large number of configurations to plan. The workaround I've used is to modify your atlantis workflow with some custom commands to "trick" terraform into pre-populating the plugin cache directory.

workflows:
  default:
    plan:
      steps:
        - run: terraform get # pre-fetch the modules or the command below will fail
        - run: terraform providers mirror .terraform.d/plugins # get all of the providers and put them in the current working dir
        - init
        - plan
    apply:
      steps:
        - apply

This pretty much breaks plugin caching as atlantis/terraform will download the plugins for every configuration/state that is being planned. For us it is a worthwhile trade off to deal with added bandwidth costs over constantly re-planning individual configurations and dealing with lost developer productivity that the re-planning causes.

@jamengual
Copy link
Contributor

jamengual commented Aug 30, 2023 via email

@jreslock
Copy link

What if you deploy your one registry internally and point TF to that URL? in that case the providers will be downloaded from there instead. https://github.com/outsideris/citizen for example.

On Wed, Aug 30, 2023 at 2:02 PM Jason Reslock @.> wrote: Even when committing lock files this is still an issue for us when setting parallel_plan: true in our atlantis.yaml files for repositories with a large number of configurations to plan. The workaround I've used is to modify your atlantis workflow with some custom commands to "trick" terraform into pre-populating the plugin cache directory. workflows: default: plan: steps: - run: terraform get # pre-fetch the modules or the command below will fail - run: terraform providers mirror .terraform.d/plugins # get all of the providers and put them in the current working dir - init - plan apply: steps: - apply This pretty much breaks plugin caching as atlantis/terraform will download the plugins for every configuration/state that is being planned. For us it is a worthwhile trade off to deal with added bandwidth costs over constantly re-planning individual configurations and dealing with lost developer productivity that the re-planning causes. — Reply to this email directly, view it on GitHub <#2242 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ3ERFSYLVWI35JQJAFLATXX6S57ANCNFSM5U7EMTOQ . You are receiving this because you were mentioned.Message ID: @.>

This isn't necessarily a downloading problem. It looks like a concurrent filesystem access problem. Not pulling from hashicorp's registry may speed up the downloads but it does not stop N terraform procs from trying to read the same file from disk all at once. From reading terraform docs it is clearly stated that the plugin cache dir is not concurrency safe. That is the real root of the issue here. It isn't an atlantis bug...however since atlantis can run many init operations simultaneously it is prone to triggering this behavior in terraform.

@jreslock
Copy link

You can see here that even CDKTF had this problem and they changed their implementation to run init serially to work around it:
hashicorp/terraform-cdk#2741

All the way at the bottom of the stack we see this issue and I believe that is where all of this starts/started:
hashicorp/terraform#31964

@matthiasr
Copy link

A workaround for those who do not wish to use lock files is to set the plugin_cache_may_break_dependency_lock_file option, for example by setting the TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE=true environment variable for all of Atlantis, or as a workflow step like

        - env:
            name: TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE
            value: "true"

This allows terraform init to re-use the provider from the cache despite the lack of a lock file, meaning it does not try to re-download and overwrite it.

For me, this at least solved the case where the issue was open: …: text file busy. I have not encountered the fork/exec: …: text file busy error but I suspect it happens when the race goes the other way: one init is currently writing to the file, then another tries to run it. This may still happen with this setting, but it should only happen once per provider version. After it has been cached once, any number of runs can use it in parallel.

@albertollamaso
Copy link
Contributor

Even when committing lock files this is still an issue for us when setting parallel_plan: true in our atlantis.yaml files for repositories with a large number of configurations to plan. The workaround I've used is to modify your atlantis workflow with some custom commands to "trick" terraform into pre-populating the plugin cache directory.

workflows:
  default:
    plan:
      steps:
        - run: terraform get # pre-fetch the modules or the command below will fail
        - run: terraform providers mirror .terraform.d/plugins # get all of the providers and put them in the current working dir
        - init
        - plan
    apply:
      steps:
        - apply

This pretty much breaks plugin caching as atlantis/terraform will download the plugins for every configuration/state that is being planned. For us it is a worthwhile trade off to deal with added bandwidth costs over constantly re-planning individual configurations and dealing with lost developer productivity that the re-planning causes.

This worked for me. Thanks @matthiasr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working work-in-progress
Projects
None yet
Development

No branches or pull requests

9 participants