Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google_container_cluster & UPGRADE_MASTER operation #1643

Closed
thomasriley opened this issue Jun 12, 2018 · 12 comments · Fixed by #1660
Closed

google_container_cluster & UPGRADE_MASTER operation #1643

thomasriley opened this issue Jun 12, 2018 · 12 comments · Fixed by #1660

Comments

@thomasriley
Copy link

Hey,

Sorry if this is the wrong place, I'm looking for a little bit of help of how I am using GKE with this provider.

I build a container cluster using google_container_cluster which is configured to remove the default node pool. I then have 5 different node pools created for this cluster using google_container_node_pool. I have been finding that after it creates the first couple of additional node pools GKE seems to trigger an UPGRADE_MASTER operation on the cluster, which is really odd considering I am launching the cluster on the latest version of Kubernetes in the region I am using. The masters are not actually upgraded!

You can see all the operation on the cluster here when I replicate this:

NAME                              TYPE              LOCATION      TARGET                               STATUS_MESSAGE  STATUS   START_TIME                      END_TIME
operation-1528806394096-22ff5ae9  CREATE_CLUSTER    europe-west2  europe-west2-es-dev-bookinggo-cloud                  DONE    2018-06-12T12:26:34.096317152Z  2018-06-12T12:29:54.231950083Z
operation-1528806604111-b735fd41  DELETE_NODE_POOL  europe-west2  default-pool                                         DONE    2018-06-12T12:30:04.111993485Z  2018-06-12T12:30:14.675866373Z
operation-1528806623787-26617ce0  CREATE_NODE_POOL  europe-west2  europe-west2-es-dev-logs-warm                        DONE     2018-06-12T12:30:23.787133889Z  2018-06-12T12:31:32.436480077Z
operation-1528806694379-54a77bab  CREATE_NODE_POOL  europe-west2  europe-west2-es-dev-general                          DONE     2018-06-12T12:31:34.379185999Z  2018-06-12T12:32:44.511111228Z
operation-1528806774945-9b1502c5  CREATE_NODE_POOL  europe-west2  europe-west2-es-dev-events-warm                      DONE     2018-06-12T12:32:54.945811472Z  2018-06-12T12:34:03.863015406Z
operation-1528806845520-9a74d61f  CREATE_NODE_POOL  europe-west2  europe-west2-es-dev-logs-hot                         DONE     2018-06-12T12:34:05.520144747Z  2018-06-12T12:35:26.947060015Z
operation-1528806933790-6c067503  UPGRADE_MASTER    europe-west2  europe-west2-es-dev-bookinggo-cloud                  RUNNING  2018-06-12T12:35:33.790505113Z

This breaks the Terraform run, as it seems GKE (or something!?!) triggers the UPGRADE_MASTER and then it fails when trying to create the next node pool with an error like the below:

google_container_node_pool.elastic_node_pool_events_hot: error creating NodePool: googleapi: Error 400: Operation operation-1528806933790-6c067503 is currently upgrading cluster europe-west2-es-dev-bookinggo-cloud. Please wait and try again once it is done., failedPrecondition

Is Terraform triggering the UPGRADE_MASTER operation? Hoping someone else may have run into this issue and point out a mistake I may be making!

This seems to be a similar error to #1451 however may be unrelated as I have network policy disabled.

Cheers,
Tom

@danawillow
Copy link
Contributor

Terraform itself isn't triggering the operation, sadly. I'm not sure why it immediately tries to upgrade the master.

However, we can probably add some retries in our code so we don't fail on a failedPrecondition. Do you have debug logs by any chance?

@danawillow danawillow added the bug label Jun 12, 2018
@thomasriley
Copy link
Author

Hi @danawillow, thanks for the speedy reply.

I have run Terraform with debug enabled to capture the logs at the moment it fails:
https://gist.github.com/thomasriley/43a57c480f5a79e7a46228d3a707230f

You can see it creates a node pool successfully, attempts to create the next node pool, but fails as the cluster has the upgrade_master operation running on it.

Here are the operations on the cluster from this TF apply:

NAME                              TYPE              LOCATION      TARGET                               STATUS_MESSAGE  STATUS   START_TIME                      END_TIME
operation-1528881081434-b62d032d  CREATE_CLUSTER    europe-west2  europe-west2-es-dev-bookinggo-cloud                  DONE    2018-06-13T09:11:21.434939459Z  2018-06-13T09:14:17.688671908Z
operation-1528881261340-a9ad0668  DELETE_NODE_POOL  europe-west2  default-pool                                         DONE    2018-06-13T09:14:21.340776723Z  2018-06-13T09:14:31.751417503Z
operation-1528881281101-c94b4613  CREATE_NODE_POOL  europe-west2  europe-west2-es-dev-general                          DONE    2018-06-13T09:14:41.101446827Z  2018-06-13T09:15:53.915602566Z
operation-1528881362047-414aa9cb  CREATE_NODE_POOL  europe-west2  europe-west2-es-dev-events-hot                       DONE    2018-06-13T09:16:02.047912407Z  2018-06-13T09:17:28.994881856Z
operation-1528881452693-614a5e58  CREATE_NODE_POOL  europe-west2  europe-west2-es-dev-logs-hot                         DONE     2018-06-13T09:17:32.693069908Z  2018-06-13T09:19:09.254316356Z
operation-1528881553471-f846fc0e  CREATE_NODE_POOL  europe-west2  europe-west2-es-dev-events-warm                      DONE     2018-06-13T09:19:13.47146578Z   2018-06-13T09:20:16.167778248Z
operation-1528881619093-5ef6e927  UPGRADE_MASTER    europe-west2  europe-west2-es-dev-bookinggo-cloud                  RUNNING  2018-06-13T09:20:19.093697738Z

Worth adding, I can re-run my pipeline to generate and execute a new plan that completes the cluster build once the false master upgrade operation is completed however this breaks our pipeline workflow since manual intervention is required.

Many thanks,
Tom

@danawillow
Copy link
Contributor

Just sent #1660. I also asked around, and the answer I got was that it might not be upgrading the master version, but resizing the master instead.

@thomasriley
Copy link
Author

Hi @danawillow - I have built the provider from your branch and have been testing repeatedly however I have yet to be able to trigger the scenario in which the upgrade operation occurs half way though the Terraform apply. In all my attempts to replicate it, the upgrade operation is not triggering until after Terraform has completed its apply.

It's a bit odd actually, when I was having the issue, the upgrade master operation would occur seconds after one of the node pools completed, hence interrupting Terraform half way through its apply. For example:

operation-1528795000371-7639cea9  CREATE_NODE_POOL  europe-west2  europe-west2-es-dev-logs-warm                        DONE    2018-06-12T09:16:40.371219001Z  2018-06-12T09:17:48.032124256Z
operation-1528795070885-bdd8387c  UPGRADE_MASTER    europe-west2  europe-west2-es-dev-bookinggo-cloud                  DONE    2018-06-12T09:17:50.885406723Z  2018-06-12T09:28:49.118082157Z

However, looking at the timing of the upgrade operation now, there seems to be a ~20 second idle period from the completion of the last operation before attempting the automated cluster operations. For example:

operation-1529164490685-ad951361  CREATE_NODE_POOL  europe-west2  europe-west2-es-dev-logs-warm                        DONE     2018-06-16T15:54:50.685169722Z  2018-06-16T15:56:44.335868902Z
operation-1529164624339-092e5f42  UPGRADE_MASTER    europe-west2  europe-west2-es-dev-bookinggo-cloud                  RUNNING  2018-06-16T15:57:04.339398719Z

I might be barking up the wrong tree.. but feeling there could have been a change on the GKE side that ensures any automated operations are not kicked off until there is an idle period as to prevent any automation like Terraform from being interrupted?

I will keep testing this next week and see if I can replicate the issue again. Thanks again for your time helping me with this.

Cheers,
Tom

@thomasriley
Copy link
Author

Just my luck, after failing to replicate the issue while using the patched provider in #1660 and then posting the last update... I pushed my pipeline to the test environment in a different project and replicated the issue immediately.

The UPGRADE_MASTER operation occurred mid way through the Terraform apply and no error was thrown and Terraform waited to for the operation to complete before continuing to add the next node pool. However, before it could create the last node pool, there was a second UPGRADE_MASTER operation on the same cluster, and this time it threw the original failedPrecondition error.

Terraform error:

google_container_node_pool.elastic_node_pool_logs_warm: Creation complete after 19m3s (ID: europe-west2/europe-west2-es-test-bookinggo-cloud/europe-west2-es-test-logs-warm)
Error: Error applying plan:

1 error(s) occurred:

* google_container_node_pool.elastic_node_pool_events_hot: 1 error(s) occurred:

* google_container_node_pool.elastic_node_pool_events_hot: error creating NodePool: googleapi: Error 400: Operation operation-1529169867630-be1e5cf3 is currently upgrading cluster europe-west2-es-test-bookinggo-cloud. Please wait and try again once it is done., failedPrecondition

Cluster operations:

NAME                              TYPE              LOCATION      TARGET                                STATUS_MESSAGE  STATUS   START_TIME                      END_TIME
operation-1529169459886-d2faa255  CREATE_CLUSTER    europe-west2  europe-west2-es-test-bookinggo-cloud                  DONE     2018-06-16T17:17:39.886216427Z  2018-06-16T17:20:29.915632372Z
operation-1529169640001-564c823f  DELETE_NODE_POOL  europe-west2  default-pool                                          DONE     2018-06-16T17:20:40.001531469Z  2018-06-16T17:20:48.094865639Z
operation-1529169649680-5262e2ad  CREATE_NODE_POOL  europe-west2  europe-west2-es-test-general                          DONE     2018-06-16T17:20:49.680262582Z  2018-06-16T17:22:34.922830757Z
operation-1529169760329-5c47d5a1  CREATE_NODE_POOL  europe-west2  europe-west2-es-test-events-warm                      DONE     2018-06-16T17:22:40.329276649Z  2018-06-16T17:24:25.125727716Z
operation-1529169867630-be1e5cf3  UPGRADE_MASTER    europe-west2  europe-west2-es-test-bookinggo-cloud                  DONE     2018-06-16T17:24:27.63028663Z   2018-06-16T17:35:37.933709971Z
operation-1529170541380-28d81391  CREATE_NODE_POOL  europe-west2  europe-west2-es-test-logs-hot                         DONE     2018-06-16T17:35:41.380009614Z  2018-06-16T17:37:44.594928099Z
operation-1529170672231-83a8faf4  CREATE_NODE_POOL  europe-west2  europe-west2-es-test-logs-warm                        DONE     2018-06-16T17:37:52.231789697Z  2018-06-16T17:39:47.647916007Z
operation-1529170801491-68078d7a  UPGRADE_MASTER    europe-west2  europe-west2-es-test-bookinggo-cloud                  RUNNING  2018-06-16T17:40:01.491218457Z

Cheers,
Tom

@danawillow
Copy link
Contributor

This repro was with #1660, right? Do you know how long it takes for the UPGRADE_MASTER operation to finish? I have it retrying for 10 minutes but maybe it needs more time.

@thomasriley
Copy link
Author

Yes, I'm using a build of the provider from #1660.

The longest upgrade I have seen is 13 minutes so might be worth setting this too at least 15 minutes?

Thanks,
Tom

@danawillow
Copy link
Contributor

Better idea, now it just uses whatever timeout you set as your create timeout (the default is 30 minutes). Give that a try if you have the chance!

@thomasriley
Copy link
Author

thomasriley commented Jun 18, 2018

Thank you! I've just built the new version and committed it into my TF codebase. Will be re-launching a number of GKE clusters tomorrow, so I will do a good amount of testing and report back.

@thomasriley
Copy link
Author

Hey, I've been trying to replicate the failure scenario the last few days in which two UPGRADE_MASTER operations occur during the Terraform apply. So far I've not been successful in this, however it has been working great for the scenario in which a single operation occurs:

NAME                              TYPE              LOCATION      TARGET                               STATUS_MESSAGE  STATUS  START_TIME                      END_TIME
operation-1529393424996-ec59717b  CREATE_CLUSTER    europe-west2  europe-west2-es-dev-bookinggo-cloud                  DONE    2018-06-19T07:30:24.996609973Z  2018-06-19T07:33:10.520574076Z
operation-1529393594796-94b725d5  DELETE_NODE_POOL  europe-west2  default-pool                                         DONE    2018-06-19T07:33:14.796249235Z  2018-06-19T07:33:25.720047541Z
operation-1529393614568-b649fd53  CREATE_NODE_POOL  europe-west2  es-general-europe-west2-es-dev                       DONE    2018-06-19T07:33:34.568601795Z  2018-06-19T07:35:19.601014643Z
operation-1529393725445-ac4189e2  CREATE_NODE_POOL  europe-west2  es-events-warm-europe-west2-es-dev                   DONE    2018-06-19T07:35:25.445205043Z  2018-06-19T07:42:00.307563612Z
operation-1529394127358-44d13279  CREATE_NODE_POOL  europe-west2  es-logs-hot-europe-west2-es-dev                      DONE    2018-06-19T07:42:07.358588429Z  2018-06-19T07:44:12.245276144Z
operation-1529394255407-0f184226  UPGRADE_MASTER    europe-west2  europe-west2-es-dev-bookinggo-cloud                  DONE    2018-06-19T07:44:15.407444261Z  2018-06-19T07:55:11.802182547Z
operation-1529394918286-7ab77093  CREATE_NODE_POOL  europe-west2  es-events-hot-europe-west2-es-dev                    DONE    2018-06-19T07:55:18.286613521Z  2018-06-19T07:57:25.350317161Z
operation-1529395049171-23d38db0  CREATE_NODE_POOL  europe-west2  es-logs-warm-europe-west2-es-dev                     DONE    2018-06-19T07:57:29.171867062Z  2018-06-19T07:59:25.448887058Z

NAME                              TYPE              LOCATION      TARGET                                STATUS_MESSAGE  STATUS  START_TIME                      END_TIME
operation-1529396288283-19f51295  CREATE_CLUSTER    europe-west2  europe-west2-es-test-bookinggo-cloud                  DONE    2018-06-19T08:18:08.283445025Z  2018-06-19T08:21:02.048569508Z
operation-1529396468631-b7f9fe97  DELETE_NODE_POOL  europe-west2  default-pool                                          DONE    2018-06-19T08:21:08.631128851Z  2018-06-19T08:21:16.624881447Z
operation-1529396478377-14a39592  CREATE_NODE_POOL  europe-west2  es-general-europe-west2-es-test                       DONE    2018-06-19T08:21:18.377017299Z  2018-06-19T08:23:04.19097979Z
operation-1529396589350-f35bad31  CREATE_NODE_POOL  europe-west2  es-logs-warm-europe-west2-es-test                     DONE    2018-06-19T08:23:09.350698977Z  2018-06-19T08:24:59.948954513Z
operation-1529396710377-8fb72268  CREATE_NODE_POOL  europe-west2  es-events-hot-europe-west2-es-test                    DONE    2018-06-19T08:25:10.377548426Z  2018-06-19T08:27:03.113775908Z
operation-1529396831054-9655fe7b  CREATE_NODE_POOL  europe-west2  es-events-warm-europe-west2-es-test                   DONE    2018-06-19T08:27:11.05440879Z   2018-06-19T08:28:58.033879746Z
operation-1529396939630-2adcc5b2  UPGRADE_MASTER    europe-west2  europe-west2-es-test-bookinggo-cloud                  DONE    2018-06-19T08:28:59.630569929Z  2018-06-19T08:39:30.594300531Z
operation-1529397578407-731c5e08  CREATE_NODE_POOL  europe-west2  es-logs-hot-europe-west2-es-test                      DONE    2018-06-19T08:39:38.407205754Z  2018-06-19T08:43:35.410086508Z

I don't see how your latest solution would fail in fixing the multiple operation scenario, however all that may need setting is a higher create timeout setting by the user when adding a large number of node pools.

I'm currently creating & destroying GKE clusters fairly frequently via Pipeline so I will update once I've seen it fixes the double operation failure I previously hit.

Thanks again for all your help @danawillow 👍

Cheers,
Tom

danawillow added a commit that referenced this issue Jun 22, 2018
Hypothetically fixes #1643.

@thomasriley, are you able to patch this change into your provider to see if it fixed the problem? I haven't been able to get a working repo so I haven't verified the fix yet.
@danawillow
Copy link
Contributor

You're welcome! Hope it works :)

@ghost
Copy link

ghost commented Nov 17, 2018

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

@ghost ghost locked and limited conversation to collaborators Nov 17, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants