-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When K8s cluster update issues occur during terraform plan/apply, need to be able to conduct another plan/apply to resume updates #2379
Comments
Tested this retry fix in 1.23.1 and it worked great -- except unfortunately, today was a day where several cloud users experienced a 404 error when running basic cluster and worker commands. Therefore, when doing an apply to test the retry logic, the apply would inevitably fail too. In some cases the error occurred after having updated a worker, but in some cases even sooner. On the 4th attempt of the day, we saw that the provider seemingly tried to destroy and re-create the cluster -- which leads to all workers being deleted, and a disruption to the application that had been installed/configured and running! What could have caused the provider to attempt a destroy/recreate? Did the 404's help trigger an execution path where the provider wanted to start fresh as a worst case recovery mechanism? Regardless, it seems too harsh to delete a cluster...especially since in this case, the re-create failed, leaving a long window of time where it might not be noticed...not to mention the added time needed to re-create the cluster and re-install the application/services -- once this is discovered by someone. Harini @hkantare, please refer to the logs I posted in Slack on this. Here is a relevant log snip:
|
We added more validation along with status code 404 to confirm really if th cluster exists or not to elimintate these kind of intermittent issues from IKS.. |
Using our latest maintenance automation leveraging the latest IBM Terraform provider, we did a bulk cycle on Tuesday, across multiple envs and their vpc-gen2 clusters. We had mixed results with some successes, and some failures on the Terraform apply. In most cases, the "apply" failure as the 2h timeout like so: Documenting here as it is the same as documented in the original writeup of this git issue you are reading now (see at the top): QUESTION: Is there perhaps a known bug (issue) already written for a potentially erroneous timeout? We are suspicious that the worker did update cleanly, and that there might be bug in the logic to help ensure only one worker is updated at a time. It can be a bit tricky as our prior automation suffered from an interesting problem seen with vpc-gen2 workers where our algorithm to loop through them one at a time tried to leverage the total worker count as an iterator. However, the total # of workers will vary by one at the precise time one has been deleted. This simple fact caused our loop to spin, and eventually timeout -- something very similar to what we see with the Terraform provider. Just a guess, but it seems there is something to look at here, as we seek an explanation why we see so many timeouts to look for a single worker to get updated. |
Closing the issue since we fixed some of the issues of upgrades
|
I believe i just ran into this issue using schematics and ibm-cloud terraform: Error: Request failed with status code: 409, ServerErrorResponse: {"incidentID":"47fc3144-fab0-92be-8e4a-12fd3dbc885c,47fc3144-fab0-92be-8e4a-12fd3dbc885c","code":"E0007","description":"A cluster with the same name already exists. Choose another name.","type":"Provisioning"} INTERNAL GITHUB: https://gist.github.ibm.com/jja/9e38d2fe616b93b053fb38271e71985e |
Terraform Version
Terraform 13 w/ v1.21.2 IBM Cloud provider (running under schematics)
Affected Resource(s)
ibm_container_vpc_cluster
Details
This issue is a follow-up issue to #1978. Great progress has been made regarding steady state patching support with the IBM Cloud Provider. The terraform plan/apply is now updating one worker at a time based on the desired
patch_version
of the workers when used withwait_for_worker_update
andupdate_all_workers
set totrue
.When cloud issues, or issues within terraform/schematics arise during a patching/update activity (where all the worker nodes in the cluster are not updated) a second attempt at a terraform plan/apply is unable to patch the remaining workers that were not updated prior to the error in terraform.
Here are a few examples of issues that we have seen during a terraform plan/apply:
Example 1 (IAM token timeout):
Example 2 (IBM Cloud Containers API is down):
Example 3 (Cluster state not equal to normal - timeout):
For example 3, we did add a timeouts block addition for "updates" to see if that issue described above occurs with that setting moving forward:
In the three examples described above, if a subsequent terraform plan/apply was able to continue the patching process and update the remaining outstanding worker nodes, we would be able to easily work around any terraform plan/apply issues.
What we observed during subsequent terraform plan/applies is the terraform does not believe there is anything left to do, since the
patch_version
was set during the previous plan/apply. If subsequent terraform plan/applies could verify the state of the clusters and workers to see if any workers are outdated and not matching thepatch_version
to indicate more work still needs to be done there, that would give us a great solution to dealing with unpredictable issues that can occur over time - where we can simply reapply the terraform again.The text was updated successfully, but these errors were encountered: