-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
google_container_cluster & UPGRADE_MASTER operation #1643
Comments
Terraform itself isn't triggering the operation, sadly. I'm not sure why it immediately tries to upgrade the master. However, we can probably add some retries in our code so we don't fail on a failedPrecondition. Do you have debug logs by any chance? |
Hi @danawillow, thanks for the speedy reply. I have run Terraform with debug enabled to capture the logs at the moment it fails: You can see it creates a node pool successfully, attempts to create the next node pool, but fails as the cluster has the upgrade_master operation running on it. Here are the operations on the cluster from this TF apply:
Worth adding, I can re-run my pipeline to generate and execute a new plan that completes the cluster build once the false master upgrade operation is completed however this breaks our pipeline workflow since manual intervention is required. Many thanks, |
Just sent #1660. I also asked around, and the answer I got was that it might not be upgrading the master version, but resizing the master instead. |
Hi @danawillow - I have built the provider from your branch and have been testing repeatedly however I have yet to be able to trigger the scenario in which the upgrade operation occurs half way though the Terraform apply. In all my attempts to replicate it, the upgrade operation is not triggering until after Terraform has completed its apply. It's a bit odd actually, when I was having the issue, the upgrade master operation would occur seconds after one of the node pools completed, hence interrupting Terraform half way through its apply. For example:
However, looking at the timing of the upgrade operation now, there seems to be a ~20 second idle period from the completion of the last operation before attempting the automated cluster operations. For example:
I might be barking up the wrong tree.. but feeling there could have been a change on the GKE side that ensures any automated operations are not kicked off until there is an idle period as to prevent any automation like Terraform from being interrupted? I will keep testing this next week and see if I can replicate the issue again. Thanks again for your time helping me with this. Cheers, |
Just my luck, after failing to replicate the issue while using the patched provider in #1660 and then posting the last update... I pushed my pipeline to the test environment in a different project and replicated the issue immediately. The Terraform error:
Cluster operations:
Cheers, |
This repro was with #1660, right? Do you know how long it takes for the UPGRADE_MASTER operation to finish? I have it retrying for 10 minutes but maybe it needs more time. |
Yes, I'm using a build of the provider from #1660. The longest upgrade I have seen is 13 minutes so might be worth setting this too at least 15 minutes? Thanks, |
Better idea, now it just uses whatever timeout you set as your create timeout (the default is 30 minutes). Give that a try if you have the chance! |
Thank you! I've just built the new version and committed it into my TF codebase. Will be re-launching a number of GKE clusters tomorrow, so I will do a good amount of testing and report back. |
Hey, I've been trying to replicate the failure scenario the last few days in which two
I don't see how your latest solution would fail in fixing the multiple operation scenario, however all that may need setting is a higher create timeout setting by the user when adding a large number of node pools. I'm currently creating & destroying GKE clusters fairly frequently via Pipeline so I will update once I've seen it fixes the double operation failure I previously hit. Thanks again for all your help @danawillow 👍 Cheers, |
Hypothetically fixes #1643. @thomasriley, are you able to patch this change into your provider to see if it fixed the problem? I haven't been able to get a working repo so I haven't verified the fix yet.
You're welcome! Hope it works :) |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks! |
Hey,
Sorry if this is the wrong place, I'm looking for a little bit of help of how I am using GKE with this provider.
I build a container cluster using
google_container_cluster
which is configured to remove the default node pool. I then have 5 different node pools created for this cluster usinggoogle_container_node_pool
. I have been finding that after it creates the first couple of additional node pools GKE seems to trigger an UPGRADE_MASTER operation on the cluster, which is really odd considering I am launching the cluster on the latest version of Kubernetes in the region I am using. The masters are not actually upgraded!You can see all the operation on the cluster here when I replicate this:
This breaks the Terraform run, as it seems GKE (or something!?!) triggers the UPGRADE_MASTER and then it fails when trying to create the next node pool with an error like the below:
Is Terraform triggering the
UPGRADE_MASTER
operation? Hoping someone else may have run into this issue and point out a mistake I may be making!This seems to be a similar error to #1451 however may be unrelated as I have network policy disabled.
Cheers,
Tom
The text was updated successfully, but these errors were encountered: