-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconciliations blocking unexpectedly #239
Comments
@toastwaffle - are you running with |
@bobh66 yes we are, but if you look at the graphs above we're not CPU-bound. We currently have |
I wonder if the large difference between |
There may be a quirk to the way the locking works that is problematic. If I'm reading the code correctly:
This can result in a long-running RLock process blocking the acquisition of new RLocks if there is also a Lock pending. The Lock request will wait for the long-running RLock process to finish before it runs, and meanwhile will block all other RLock requests so we end up single-threaded until the long-running RLock thread is finished, at which time the Lock process runs and then unblocks all of the other RLock requests. Practically this means that any time a new Workspace is created, regardless of it's content, because it needs to run This is all conjecture for the moment but I think it sounds plausible. Figuring out how to solve it is the next problem. |
Oh, yeah, that makes sense, and explains why in the example above a few things did get processed while the first create was running, until the second create entered the metaphorical "queue" and called Lock() One potential workaround: can we avoid locking entirely if the plugin cache is disabled? Currently the lock is used regardless of whether the plugin cache is enabled |
Just thinking out loud - do we need to be RLock'ing for |
That's a possibility too - the only reason for locking is to prevent cache corruption issues, so if the cache is disabled there is no need to lock. |
I can make a PR for that. Not locking for a plan operation would also help, but we'd still potentially underutilise concurrent reconciles. Challenging to fix without risking starvation though :/ |
Yes - new resource creation during long-running applies/deletes will still cause throttling but at least the other reconciliations could still run assuming no changes in the |
A better overall solution may be to replace the in-memory storage of terraform workspaces with a persistent volume that can be sized as needed. That would allow sufficient storage space to run with caching disabled without using excessive memory, and it also allows for pod updates to not have to re-run terraform init on all Workspaces since the workspaces are persistent. I'm pretty sure we could do this today using a |
It doesn't even need to be mounted on |
@bobh66 we intentionally avoid the lock on |
|
Just to report back on this - I just upgraded our production cluster to 0.14.1 successfully, and discovered a small side effect of not locking for plans. As the provider was starting up and recreating all of the workspaces (hopefully for the last time as we added a persistent volume at the same time), we saw the same plugin cache conflict errors that we saw previously:
What I think is happening here is that while we do only have one The backoff and retries resolved this eventually - on startup it took about 15 minutes to re-initialise our 150 workspaces (with |
I saw the same behavior but I'm not sure it merits reinstating the lock for |
Ack, the PV has definitely helped us. Looking at our graphs today with lots of Workspaces being created and destroyed, I'm confident that this is now fixed. Thank you Bob for figuring out the root cause and helping get the fix reviewed and released! |
Ah sorry for all the issues with the locks. The terraform cache is really really picky. I feel like the execution model for the terraform plugin is just poor in general. A PVC would help. I think sharding would also help greatly if you have a lot of terraform workspaces. @toastwaffle can you share your grafana graph json by chance? I would like to analyze our pipelines similarly. |
We're using some recording rules and custom metrics, but the raw PromQL queries are:
|
I am facing the issue even on PV enabled . Warning: Unable to open CLI configuration file Error: Failed to install provider Error while installing hashicorp/aws v5.44.0: open Error: Failed to install provider Error while installing hashicorp/kubernetes v2.30.0: open |
@balu-ce I guess you might need to disable completely the TF provider cache (maybe by using some environment variable or by putting a configuration value into the /.terraformrc).
After that it makes perfect sense that you don't get any of those "text file busy". |
@project-administrator Mount a PV to /tf is already done. Disabling the plugin cache is fine but it downloads the provider every time. This make the downloading of provider each time which affects the NAT cost. We have around 300+ workspaces Also, this is not happened at version 0.7 |
@balu-ce We're still running the TF version 0.12 because of this issue, and I thought the /tf PV mount and disabling the TF cache could mitigate the issue... But I have not considered the NAT costs and TF downloading the providers for each workspace every hour or so (depending on how many workspaces you have and your concurrency settings). If you have an issue with the latest TF provider version, you might try asking it here: #234 Also, it looks like the provider still does the "terraform init" on every reconciliation: #230 I'm also interested in getting it to work properly, but from your experience, it sounds like it's not working properly yet even with PV mounted to /tf. |
Note that the providers won't be re-downloaded for every reconcile - they're only downloaded when the workspace is created, or if the contents change to require new/different providers. When we refer to 'the plugin cache', that's about sharing providers between workspaces, but without that the providers are still saved locally inside the /tf/ directory (iirc the shared plugin cache works by hard-linking the providers into that directory). The reason I suggest using the PV when not using the plugin cache is so that the per-workspace saved providers are not re-downloaded if the pod restarts. |
@toastwaffle I can see that inside the running provider-terraform pod it would regularly delete the workspace directory under the "/tf" directory, and then re-create it empty, populate it from the git repository, and after that run "terraform init". Do you mean in this case it does not download the providers while it's running "terraform init"? |
That's very surprising - I've only ever used inline workspaces which don't get recreated, so it must be something about how the git-backed workspaces are managed. Probably worth creating a separate issue for that, as it seems very inefficient to clone the git repository from scratch for every reconcile |
The provider does remove the current workspace and recreate it every time for remote repos: https://github.com/upbound/provider-terraform/blob/main/internal/controller/workspace/workspace.go#L233 It is using go-getter which has issues when retrieving a git repo into an existing directory (hashicorp/go-getter#114), so the workaround is to remove the directory and re-pull the repo. Definitely not very efficient. It might be better to use go-git to clone the remote repo and then use |
OK, so if I'm getting it right, currently there is no reliable way to use the "--max-reconcile-rate" greater than "1" with the remote repo? If the plugin cache is enabled, then you get some "text file busy" randomly. |
Well it's pretty reliably doing that, but yes - the current implementation of remote repos seems to limit the concurrency to 1. I wonder if we could use TF_DATA_DIR (https://developer.hashicorp.com/terraform/cli/config/environment-variables#tf_data_dir) to store the I don't know if this is directly related to #230 but it seems like it might be similar. If we move the |
What happened?
Fairly regularly, we're seeing reconciliations back up behind long running terraform apply/destroy operations. We have ~180 workspaces of two types: those which provision GKE clusters (which take 10-15 minutes to apply/destroy), and those which provision DNS records (which take 10-20 seconds to apply/destroy). On average there are twice as many DNS workspaces as GKE workspaces.
I wrote a tool to parse the provider debug logs to illustrate this; the results are in this spreadsheet. The results show each individual reconciliation, ordered by the time at which they started/finished. For confidentiality reasons the workspace names have been replaced with hashes (the first 8 bytes of a sha256 hash).
There are multiple examples of the errant behaviour, but to pick one let's look at row 3469 of the 'Ordered by Start Time' sheet:
did init
field on this row isunknown
because there is no logging when a Workspace's checksum is blank.What this looks like in the logs is:
This behaviour can be seen in the metrics - each time we have a GKE cluster which needs creating/deleting, the number of active workers starts to grow and the reconcile rate drops to 0. Once it finishes, we see a big spike in the reconcile rate as all the backup up reconciles finish at once.
Hypotheses and debugging
RLock
, which should allow for multiple concurrent operationsWhat environment did it happen in?
The text was updated successfully, but these errors were encountered: