Add retries for storing state and releasing locks #18741

brikis98 · 2018-08-26T22:46:59Z

Current Terraform Version

Terraform v0.11.7

Use-cases

If you're running Terraform and you briefly lose Internet connectivity, Terraform will:

Fail to write state to a remote backend (e.g., S3) and instead save a local copy to errored.tfstate.
Fail to release the lock in your remote backend (e.g., DynamoDB).

Attempted Solutions

There's obviously nothing you can do to prevent the connectivity issues, but when they happen, you have to go fix things manually by:

Find the folder where the issue happened and the errored.tfstate file.
Run terraform state push errored.tfstate.
Run terraform apply to get the error about the lock being unreleased and to get the lock ID.
Run terraform force-unlock <LOCK_ID>

However, this solution has a number of problems:

It's tedious, confusing, and error-prone.
It's difficult or impossible to do in some cases (e.g., the issue happened on a CI server that cleans up its workspace).

Proposal

I propose adding a simple retry mechanism with exponential back-off. That is, if Terraform fails to write state to a remote backend, it retries after 1 second, 2 seconds, 4 seconds, etc., up to some reasonable (and configurable) max, such as 5 minutes. This way, at least for transient connectivity issues, Terraform can resolve the issue itself.

References

This issue is exacerbated by:

Various timeout, connectivity, and TLS handshake issues that crop up from time to time in Terraform. For example, see Intermittent net/http: TLS handshake timeout error when downloading providers #16448, Terraform provider downloads fail with TLS handshake timeout #15817, Intermittent remote S3 state failure #10779
Running apply in multiple modules concurrently using a tool such as Terragrunt.

The text was updated successfully, but these errors were encountered:

wendtek · 2018-10-24T20:32:37Z

I think it would additionally be valuable to add retries for other API calls, including reading states. We use S3 remote states and have quite a bit of pulling values from remote states in our automation for terraform deploys. I see a failed job at least a few times a week related to failing to read a state from S3 that would have worked with a retry.

jbardin added the enhancement label Aug 28, 2018

mgood mentioned this issue Nov 28, 2018

Stronger durability of remote state #19488

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retries for storing state and releasing locks #18741

Add retries for storing state and releasing locks #18741

brikis98 commented Aug 26, 2018 •

edited

Loading

wendtek commented Oct 24, 2018

Add retries for storing state and releasing locks #18741

Add retries for storing state and releasing locks #18741

Comments

brikis98 commented Aug 26, 2018 • edited Loading

Current Terraform Version

Use-cases

Attempted Solutions

Proposal

References

wendtek commented Oct 24, 2018

brikis98 commented Aug 26, 2018 •

edited

Loading