-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Container Cluster fails to create when the call to create the cluster times out #4024
Comments
And to follow up, this just happened, when not tracing:
|
Hey @ejschoen Looks like the cluster is taking longer than the default timeout to create. Unfortunately clusters generally take a long time to create, but you can try specifying a longer timeout to give it a chance to become healthy before terraform times out. Terraform must poll the cluster until it becomes healthy so that it can verify that the cluster has successfully created, but clusters can take a long time to create. I would add a longer timeout for the create operation as specified here: https://www.terraform.io/docs/configuration/resources.html#timeouts It would involve adding this block to your container cluster config:
It looks like you are using a module to create the container cluster, so it may require different steps to increase the timeout If increasing the timeout doesn't fix the issue let me know |
My timeouts for create, update, and delete are set to 30 minutes, and the timeout error I see usually occurs within 10 minutes or less. When I have enabled |
Ah, so the HTTP request is timing out. I'll do some digging |
Right. Here's the relevant part of the trace, if it helps:
The every-10-second status polling goes to |
So there are 3 requests that seem to timeout, the operation GET request, the GET request on the cluster, and the DELETE on the cluster. I believe that the timeout on the operation GET request causes terraform to see the cluster as in an error state, triggering the GET and DELETE on the cluster. I'm not sure why any of these requests are not receiving a response. There is not a fix that we can implement in the terraform provider to alleviate this issue though. Possible fixes could include lengthening the timeout on HTTP requests or retrying HTTP requests that timeout, but I don't believe either of these would actually fix this issue. It's unsatisfactory, but I don't think there is a fix for this on the terraform end. |
I agree that retrying or lengthening the timeout would be at best an empirical approach to solving the problem. However..., in my experience, when there is a timeout, it seems to always happen around the time the cluster is changing state--becoming healthy after startup, or finishing deleting the default node pool. At least when I immediately look at the cluster state through the web portal, it's in the state Terraform would want it to be in before moving to the next step in the plan. So I wonder if the state transition is delaying the response from the API, and if a longer timeout might actually get past the problem. Alternatively... and this is a really off-the-wall thought... I did a Google search for complaints about the GKE API timing out, and didn't really get any hits, other than issue #1135 against the Terraform Google provider in February, 2018. So it seems unlikely that the API is really timing out. At least nobody else has reported it that I can see. (And since the reporter for that issue brought it up, I don't think it's a connectivity issue--I've got gigabit fiber and a reliable connection to the Internet backbone.) So, is there any possibility that the HTTP GET is failing at such a low level here that the connection is just closing without a response, and that error is being swallowed by lower level software and not rippling up to the Terraform provider? I'm not a Go programmer, and can't really follow the networking and retry logic in the provider. |
I'm digging into this on the GKE API side. I notice that the cluster was deleted sometime after your gist log is finished. Was the cluster manually deleted after failure? From all intents and purposes I see calls to GetOperation that align with your logs, but the corresponding GetCluster and DeleteCluster do not align with the window in the gist. |
Correct. I deleted the cluster manually, since the Terraform Google provider can't figure out what to do if I try "terraform apply" again. |
I'm not sure what we can do from here. If this reappears let us know, but at the moment my guess is that there is some packet loss somewhere. It's really hard to diagnose if it doesn't happen on demand |
OK, and I certainly understand the difficulty in diagnosing an essentially non-repeatable issue in a distributed system. But just so there's no misunderstanding, this has been an ongoing issue for months in multiple versions of the provider, affecting around 30% of the deployments we attempt. For packet loss to be the issue, there would have to be a 30 second outage at the IP level or below that could not be corrected by TCP retransmission. It's possible, but highly unlikely unless there is a significant issue in communication between Google's us-central1 and my (AT&T) provider. We never experience such outages with any other communication between our systems and Google. |
Please send other log data. In the case of your gist we see the failing call to get the operation succeed and return back through the network layer. The subsequent calls to get cluster and delete cluster don't hit our systems at all. It would also be helpful to diagnose connectivity in the affected system once the behavior presents itself. I was not aware that the issue was so prevalent, but that should make it somewhat easier to track down. |
Will do, the next time we try. For other log data, the only content that I can think of might be Stackdriver logs. I've never looked into these logs for deployment activity; do you have any suggestions for how to best filter them? I suppose I could look at historical records as well corresponding to the gist I sent. |
Please just send more logs like your gist. The timing and other data might be helpful. I have plenty of logs on the GKE side, we need more data from what's happening on your side. |
Sorry for the delay; was out of the country visiting a customer. Just tried to spin up a GKE cluster and it failed the first time. Gist is here: https://gist.github.com/ejschoen/423bc4083ab91df6f6e5454dd306ba34 |
Assuming that the timestamps are reasonably in sync (and your times are in central) our server received the request that timed out at 15:18:43.974850 PDT and sent a response at 15:18:43.984713 PDT. None of the subsequent calls were received on our side. This matches what I saw in the previous gist. Were you able to run any sort of diagnostics at the time of the failure? |
I was on the web dashboard at the time, and had no problem refreshing the display. I'll watch again the next time I spin up a cluster. I did upgrade to a newer Terraform, and was able to spin up the cluster without error, but I don't think the upgrade imported a newer google provider. |
Sorry about this, would you please try setting GODEBUG=http2debug=2 so we get debug output from the HTTP library in the trace? |
Caught another cluster provisioning failure under Terraform 12.0. New gist available here: https://gist.github.com/ejschoen/39af1f955b897486b89cbb0adf22e498 |
This happens to me constantly when provisioning GKEs and tbh is rapidly putting me off the idea of using Terraform for Google resources. I find it's repeatable with a
Please let me know if there's something else I can provide. It's a huge waste of time but I can re-run it on a new project. |
It's possible that the latest Terraform may have solved the problem. Since switching to ~0.12.6, cluster provisioning has been reliable. I haven't heard back as to whether the latest gist I posted (which was pre-0.12.6) revealed anything. |
I am on 0.12.7 and experience this problem consistently.
Just happened again, and the output seems odd... What is happening during the 30 seconds in the middle here where there is no debug output?
|
@anth0d: During that window, the provider is awaiting a response from the GKE REST API, but doesn't receive any. Most individual HTTP calls to the GCP REST API have a 30s timeout in the provider. Out of curiosity, do you have the debug logs for the I don't believe I've ever seen this error locally, or in our CI. Have you encountered the issue in multiple network environments? It's possible that a firewall between you and the GCP API is consistently filtering this traffic. |
I believe the "no route to host" error: is caused by a network connectivity problem or firewalls as @rileykarson suggested. Going back to the latest gist from @ejschoen it looks like the PING frames stop coming through at around the same time as the last response from the API is received.
It doesn't look like terraform receives anything after this and the last operation response until it cancels the request at 10:36:00. If this has been solved since updating to terraform 0.12.6 that would be great, as I don't have any more insight from the provider side. If this comes up again, please let us know |
I'll keep watching, but I think @anth0d described exactly the situation that I had been seeing, in which the polling status request times out at precisely the time that the cluster transitions to being healthy. |
Well, the failures are happening with some repeatability again today (2 out of 3 attempts failed). This was with Terraform 0.12.7 and Google provider 2.14.0. https://gist.github.com/ejschoen/63af4a2f13937cad590a21738a6f10fb Once again, the failure occurs around the time the initial node pool transitions into the RUNNING state. I do see a couple of lines in the log that are interesting. In one line (line 23905), the connection is closed because it has been idle. The next line, there is a transport readframe error--use of closed network connection. After that, no HTTP request succeeds. Is it possible that there is a race condition between connection pools and connection pool users? There are no "no route to host" errors. I also note that if I manually make a request with Curl to the endpoint that timed out, using the Bearer authorization I see in the log, I get this:
wherein the status is now |
Interesting on the connection closed due to idle. I see that same pattern happen in the previous log: https://gist.github.com/ejschoen/39af1f955b897486b89cbb0adf22e498 at line 42122, with the readFrame error as well. But in that case there are responses that come through afterwards. In this most recent log, it looks like three separate requests timed out and were closed. These between 2019-09-11T11:51:52.016-0500 and 2019-09-11T11:52:52.017-0500, a full minute apart. These are all made through the container cluster betav1 API client: https://godoc.org/google.golang.org/api/container/v1beta1 These calls were to three different endpoints: retrieving an operation, retrieving the cluster and attempting to delete the cluster. I don't believe that the terraform provider code could be responsible for this, as it uses the API client for all of these calls. I can investigate adding a retry to the container operation in the case of a timeout, but if calls to three separate endpoints fail over a minute apart it may not be fixed by a retry |
Repeatable failures this morning attempting to create a cluster. Batting 0 for 4. Two gists attached here: Terraform 0.12.7/Google Provider 2.15.0: https://gist.github.com/ejschoen/d31e2b01c287d10d5c6a9a2915010de6 Terraform 0.12.9/Google Provider 2.15.0: Also failed twice with Google Provider 2.14.0 before I tried the newer provider and then the newer Terraform. |
This may be unrelated--I'm not sure. I've switched to using Google's cloud shell to run Terraform, and I see--reliably--failures of a different kind. At the point the original node pool is finished initializing, the master node becomes unavailable for a little while, 30 seconds to a minute. I wonder if this is related to the communication failure/timeout that we're seeing when we try to spin up a cluster via Terraform from outside of GCP: that we see a timeout from outside, but an error from inside.
|
I had the same issue repeatedly with 2.5.1 provider. After upgrading to 2.20 terraform created the cluster successfully. |
Glad to hear that 2.20 might have cleared it up. Are you still seeing this issue @ejschoen ? |
It looks to me as if the timeout errors have gone away. I haven't seen them recently, using Terraform 0.12.20 and Google provider 2.15.0, from either the google cloud shell or from our local infrastructure spinning up a cluster in GKE. I do pretty often see the error that I noted on December 4, in which Terraform fails due to transient issues on the master node (maybe?), at the point that the non-default initial node pool has been created:
I see these both in cloud shell and from our local infrastructure. From the error message, it seems clear that it's not a timeout issue. |
Well at least one of the errors is taken care of! I'm not too familiar with Kubernetes, but that error message seems to be coming from helm rather than Terraform. I'm going to close this issue out now as the initial bug report is no longer reproducible. Feel free to file another issue if you believe the |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks! |
Community Note
Terraform Version
Terraform v0.12.0
Affected Resource(s)
google_container_cluster
Terraform Configuration Files
Debug Output
https://gist.github.com/ejschoen/f395c1a25b43783ff9f9b2a9fc0dd0c1
Panic Output
Expected Behavior
Cluster should have been created successfully and Terraform should have proceeded to the next step in the plan.
Actual Behavior
Terraform exited, due to a timeout in cluster status polling, without visibly retrying, from what I can see in the trace log. Terraform concluded the cluster was created "in an error state" (line 7720 in the Gist of the trace log) and attempted to delete it.
Steps to Reproduce
terraform apply
Important Factoids
This is not reliably reproducible. It tends to happen more often, annoyingly, when I am not tracing. I have been trying to catch the error with tracing for weeks, and finally caught an occurrence of it.
After Terraform exits, the cluster appears to be up and healthy, with a default n1-standard-1 1 -node node pool running Kubernetes 1.12.8-gke-10.
References
The text was updated successfully, but these errors were encountered: