Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container Cluster fails to create when the call to create the cluster times out #4024

Closed
ejschoen opened this issue Jul 14, 2019 · 34 comments
Closed
Assignees
Labels
bug forward/review In review; remove label to forward service/container

Comments

@ejschoen
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
  • If an issue is assigned to the "modular-magician" user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to "hashibot", a community member has claimed the issue already.

Terraform Version

Terraform v0.12.0

  • provider.google v2.9.0
  • provider.helm v0.10.0
  • provider.kubernetes v1.7.0

Affected Resource(s)

google_container_cluster

Terraform Configuration Files

provider "google" {
  project     = var.google_project
  region      = var.google_region
  zone        = element(var.google_zones,0)
}



module "kubernetes-engine" {
  source = "./gke"
  google_zones       = var.google_zones
  google_project     = var.google_project
  google_region      = var.google_region
  cluster_name       = var.cluster_name
  preemptible        = var.preemptible
  disk_size_gb       = var.disk_size_gb
  disk_type          = var.disk_type
  machine_type       = var.machine_type
  cluster_node_count = var.cluster_node_count

}

Debug Output

https://gist.github.com/ejschoen/f395c1a25b43783ff9f9b2a9fc0dd0c1

Panic Output

Expected Behavior

Cluster should have been created successfully and Terraform should have proceeded to the next step in the plan.

Actual Behavior

Terraform exited, due to a timeout in cluster status polling, without visibly retrying, from what I can see in the trace log. Terraform concluded the cluster was created "in an error state" (line 7720 in the Gist of the trace log) and attempted to delete it.

Steps to Reproduce

  1. terraform apply

Important Factoids

This is not reliably reproducible. It tends to happen more often, annoyingly, when I am not tracing. I have been trying to catch the error with tracing for weeks, and finally caught an occurrence of it.

After Terraform exits, the cluster appears to be up and healthy, with a default n1-standard-1 1 -node node pool running Kubernetes 1.12.8-gke-10.

References

@ghost ghost added the bug label Jul 14, 2019
@ejschoen
Copy link
Author

And to follow up, this just happened, when not tracing:

module.kubernetes-engine.google_container_cluster.cluster: Still creating... [2m30s elapsed]
module.kubernetes-engine.google_container_cluster.cluster: Still creating... [2m40s elapsed]
module.kubernetes-engine.google_container_cluster.cluster: Still creating... [2m50s elapsed]
module.kubernetes-engine.google_container_cluster.cluster: Still creating... [3m0s elapsed]
module.kubernetes-engine.google_container_cluster.cluster: Still creating... [3m10s elapsed]
module.kubernetes-engine.google_container_cluster.cluster: Still creating... [3m20s elapsed]
module.kubernetes-engine.google_container_cluster.cluster: Still creating... [3m30s elapsed]
module.kubernetes-engine.google_container_cluster.cluster: Still creating... [3m40s elapsed]
module.kubernetes-engine.google_container_cluster.cluster: Still creating... [3m50s elapsed]

Error: Error while waiting to delete default node pool: Error waiting for removing default node pool: error while retrieving operation: Get https://container.googleapis.com/v1beta1/projects/i2kconnect-1038/locations/us-central1-f/operations/operation-1563119523151-76b9eaf2?alt=json&prettyPrint=false: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

  on gke/main.tf line 1, in resource "google_container_cluster" "cluster":
   1: resource "google_container_cluster" "cluster" {

@slevenick slevenick reopened this Jul 15, 2019
@slevenick
Copy link
Collaborator

Hey @ejschoen

Looks like the cluster is taking longer than the default timeout to create. Unfortunately clusters generally take a long time to create, but you can try specifying a longer timeout to give it a chance to become healthy before terraform times out. Terraform must poll the cluster until it becomes healthy so that it can verify that the cluster has successfully created, but clusters can take a long time to create.

I would add a longer timeout for the create operation as specified here: https://www.terraform.io/docs/configuration/resources.html#timeouts

It would involve adding this block to your container cluster config:

timeouts {
  create = "60m"
  delete = "2h"
}

It looks like you are using a module to create the container cluster, so it may require different steps to increase the timeout

If increasing the timeout doesn't fix the issue let me know

@ejschoen
Copy link
Author

My timeouts for create, update, and delete are set to 30 minutes, and the timeout error I see usually occurs within 10 minutes or less. When I have enabled TF_LOG=trace I always see the failure after it appears that the expected response from the 10 second status polling loop has timed out (i.e., in around 30 seconds from the last status request?)

@ghost ghost removed the waiting-response label Jul 15, 2019
@slevenick
Copy link
Collaborator

Ah, so the HTTP request is timing out. I'll do some digging

@ejschoen
Copy link
Author

Right. Here's the relevant part of the trace, if it helps:

2019-07-14T09:51:23.809-0500 [DEBUG] plugin.terraform-provider-google_v2.9.0_x4: ---[ REQUEST ]---------------------------------------
2019-07-14T09:51:23.809-0500 [DEBUG] plugin.terraform-provider-google_v2.9.0_x4: GET /v1beta1/projects/i2kconnect-1038/locations/us-central1-f/operations/operation-1563115688240-dfd1330d?alt=json&prettyPrint=false HTTP/1.1
2019-07-14T09:51:23.809-0500 [DEBUG] plugin.terraform-provider-google_v2.9.0_x4: Host: container.googleapis.com
2019-07-14T09:51:23.809-0500 [DEBUG] plugin.terraform-provider-google_v2.9.0_x4: User-Agent: google-api-go-client/0.5 Terraform/0.12.2 (+https://www.terraform.io) terraform-provider-google/2.9.0
2019-07-14T09:51:23.809-0500 [DEBUG] plugin.terraform-provider-google_v2.9.0_x4: Accept-Encoding: gzip
2019-07-14T09:51:23.809-0500 [DEBUG] plugin.terraform-provider-google_v2.9.0_x4: 
2019-07-14T09:51:23.809-0500 [DEBUG] plugin.terraform-provider-google_v2.9.0_x4: 
2019-07-14T09:51:23.809-0500 [DEBUG] plugin.terraform-provider-google_v2.9.0_x4: -----------------------------------------------------

--- elided stuff

module.kubernetes-engine.google_container_cluster.cluster: Still creating... [3m20s elapsed]

--- elided stuff


2019-07-14T09:51:53.809-0500 [DEBUG] plugin.terraform-provider-google_v2.9.0_x4: 2019/07/14 09:51:53 [DEBUG] Google API Request Details:
2019-07-14T09:51:53.809-0500 [DEBUG] plugin.terraform-provider-google_v2.9.0_x4: ---[ REQUEST ]---------------------------------------
2019-07-14T09:51:53.809-0500 [DEBUG] plugin.terraform-provider-google_v2.9.0_x4: GET /v1beta1/projects/i2kconnect-1038/locations/us-central1-f/clusters/test-gke?alt=json&prettyPrint=false HTTP/1.1
2019-07-14T09:51:53.809-0500 [DEBUG] plugin.terraform-provider-google_v2.9.0_x4: Host: container.googleapis.com
2019-07-14T09:51:53.809-0500 [DEBUG] plugin.terraform-provider-google_v2.9.0_x4: User-Agent: google-api-go-client/0.5 Terraform/0.12.2 (+https://www.terraform.io) terraform-provider-google/2.9.0
2019-07-14T09:51:53.809-0500 [DEBUG] plugin.terraform-provider-google_v2.9.0_x4: Accept-Encoding: gzip
2019-07-14T09:51:53.809-0500 [DEBUG] plugin.terraform-provider-google_v2.9.0_x4: 
2019-07-14T09:51:53.809-0500 [DEBUG] plugin.terraform-provider-google_v2.9.0_x4: 
2019-07-14T09:51:53.809-0500 [DEBUG] plugin.terraform-provider-google_v2.9.0_x4: -----------------------------------------------------

The every-10-second status polling goes to /v1beta1/projects/i2kconnect-1038/locations/us-central1-f/operations/operation-1563115688240-dfd1330d, but if Google doesn't answer after 30 seconds, the retry goes elsewhere: /v1beta1/projects/i2kconnect-1038/locations/us-central1-f/clusters/test-gke. Is that significant? 30 seconds after the retry, there is no answer, and the provider declares the cluster to have been created in an error state, and tries to destroy it.

@slevenick
Copy link
Collaborator

So there are 3 requests that seem to timeout, the operation GET request, the GET request on the cluster, and the DELETE on the cluster. I believe that the timeout on the operation GET request causes terraform to see the cluster as in an error state, triggering the GET and DELETE on the cluster.

I'm not sure why any of these requests are not receiving a response. There is not a fix that we can implement in the terraform provider to alleviate this issue though.

Possible fixes could include lengthening the timeout on HTTP requests or retrying HTTP requests that timeout, but I don't believe either of these would actually fix this issue.

It's unsatisfactory, but I don't think there is a fix for this on the terraform end.

@ejschoen
Copy link
Author

I agree that retrying or lengthening the timeout would be at best an empirical approach to solving the problem.

However..., in my experience, when there is a timeout, it seems to always happen around the time the cluster is changing state--becoming healthy after startup, or finishing deleting the default node pool. At least when I immediately look at the cluster state through the web portal, it's in the state Terraform would want it to be in before moving to the next step in the plan. So I wonder if the state transition is delaying the response from the API, and if a longer timeout might actually get past the problem.

Alternatively... and this is a really off-the-wall thought... I did a Google search for complaints about the GKE API timing out, and didn't really get any hits, other than issue #1135 against the Terraform Google provider in February, 2018. So it seems unlikely that the API is really timing out. At least nobody else has reported it that I can see. (And since the reporter for that issue brought it up, I don't think it's a connectivity issue--I've got gigabit fiber and a reliable connection to the Internet backbone.) So, is there any possibility that the HTTP GET is failing at such a low level here that the connection is just closing without a response, and that error is being swallowed by lower level software and not rippling up to the Terraform provider? I'm not a Go programmer, and can't really follow the networking and retry logic in the provider.

@ghost ghost removed the waiting-response label Jul 15, 2019
@kschaab
Copy link

kschaab commented Jul 16, 2019

I'm digging into this on the GKE API side. I notice that the cluster was deleted sometime after your gist log is finished. Was the cluster manually deleted after failure? From all intents and purposes I see calls to GetOperation that align with your logs, but the corresponding GetCluster and DeleteCluster do not align with the window in the gist.

@ejschoen
Copy link
Author

Correct. I deleted the cluster manually, since the Terraform Google provider can't figure out what to do if I try "terraform apply" again.

@slevenick
Copy link
Collaborator

I'm not sure what we can do from here. If this reappears let us know, but at the moment my guess is that there is some packet loss somewhere. It's really hard to diagnose if it doesn't happen on demand

@ejschoen
Copy link
Author

OK, and I certainly understand the difficulty in diagnosing an essentially non-repeatable issue in a distributed system. But just so there's no misunderstanding, this has been an ongoing issue for months in multiple versions of the provider, affecting around 30% of the deployments we attempt. For packet loss to be the issue, there would have to be a 30 second outage at the IP level or below that could not be corrected by TCP retransmission. It's possible, but highly unlikely unless there is a significant issue in communication between Google's us-central1 and my (AT&T) provider. We never experience such outages with any other communication between our systems and Google.

@kschaab
Copy link

kschaab commented Jul 23, 2019

Please send other log data. In the case of your gist we see the failing call to get the operation succeed and return back through the network layer. The subsequent calls to get cluster and delete cluster don't hit our systems at all. It would also be helpful to diagnose connectivity in the affected system once the behavior presents itself. I was not aware that the issue was so prevalent, but that should make it somewhat easier to track down.

@ejschoen
Copy link
Author

ejschoen commented Jul 23, 2019

Will do, the next time we try. For other log data, the only content that I can think of might be Stackdriver logs. I've never looked into these logs for deployment activity; do you have any suggestions for how to best filter them? I suppose I could look at historical records as well corresponding to the gist I sent.

@kschaab
Copy link

kschaab commented Jul 26, 2019

Please just send more logs like your gist. The timing and other data might be helpful. I have plenty of logs on the GKE side, we need more data from what's happening on your side.

@ejschoen
Copy link
Author

ejschoen commented Aug 4, 2019

Sorry for the delay; was out of the country visiting a customer. Just tried to spin up a GKE cluster and it failed the first time. Gist is here:

https://gist.github.com/ejschoen/423bc4083ab91df6f6e5454dd306ba34

@kschaab
Copy link

kschaab commented Aug 6, 2019

Assuming that the timestamps are reasonably in sync (and your times are in central) our server received the request that timed out at 15:18:43.974850 PDT and sent a response at 15:18:43.984713 PDT. None of the subsequent calls were received on our side. This matches what I saw in the previous gist. Were you able to run any sort of diagnostics at the time of the failure?

@ejschoen
Copy link
Author

ejschoen commented Aug 6, 2019

I was on the web dashboard at the time, and had no problem refreshing the display. I'll watch again the next time I spin up a cluster. I did upgrade to a newer Terraform, and was able to spin up the cluster without error, but I don't think the upgrade imported a newer google provider.

@kschaab
Copy link

kschaab commented Aug 7, 2019

Sorry about this, would you please try setting GODEBUG=http2debug=2 so we get debug output from the HTTP library in the trace?

@ejschoen
Copy link
Author

ejschoen commented Aug 9, 2019

Caught another cluster provisioning failure under Terraform 12.0. New gist available here: https://gist.github.com/ejschoen/39af1f955b897486b89cbb0adf22e498
Generated with TF_LOG=trace GODEBUG=http2debug=2

@anth0d
Copy link

anth0d commented Aug 30, 2019

This happens to me constantly when provisioning GKEs and tbh is rapidly putting me off the idea of using Terraform for Google resources.

I find it's repeatable with a google_container_cluster having the attribute remove_default_node_pool = true. The issue seems to happen while attempting to delete the default node pool.

google_container_cluster.primary: Still creating... [4m20s elapsed]

Error: Error while waiting to delete default node pool: Error waiting for removing default node pool: error while retrieving operation: Get https://container.googleapis.com/v1beta1/projects/.../locations/.../operations/operation-...?alt=json&prettyPrint=false: read tcp [2605:a601:aad8:7800:649d:9d2b:5b0c:a0ab]:61414->[2607:f8b0:4000:80e::200a]:443: read: no route to host

  on kube.tf line 2, in resource "google_container_cluster" "primary":
   2: resource "google_container_cluster" "primary" {

Please let me know if there's something else I can provide. It's a huge waste of time but I can re-run it on a new project.

@ejschoen
Copy link
Author

It's possible that the latest Terraform may have solved the problem. Since switching to ~0.12.6, cluster provisioning has been reliable. I haven't heard back as to whether the latest gist I posted (which was pre-0.12.6) revealed anything.

@anth0d
Copy link

anth0d commented Aug 30, 2019

I am on 0.12.7 and experience this problem consistently.

$ terraform -version
Terraform v0.12.7
+ provider.google v2.14.0
+ provider.google-beta v2.14.0

Just happened again, and the output seems odd... What is happening during the 30 seconds in the middle here where there is no debug output?

2019-08-30T11:54:41.323-0500 [DEBUG] plugin.terraform-provider-google-beta_v2.14.0_x4: 2019/08/30 11:54:41 [TRACE] Waiting 10s before next try
google_container_cluster.primary: Still creating... [4m20s elapsed]
2019-08-30T11:54:51.327-0500 [DEBUG] plugin.terraform-provider-google-beta_v2.14.0_x4: 2019/08/30 11:54:51 [DEBUG] Waiting for state to become: [success]
2019-08-30T11:54:51.327-0500 [DEBUG] plugin.terraform-provider-google-beta_v2.14.0_x4: 2019/08/30 11:54:51 [DEBUG] Google API Request Details:
2019-08-30T11:54:51.327-0500 [DEBUG] plugin.terraform-provider-google-beta_v2.14.0_x4: ---[ REQUEST ]---------------------------------------
2019-08-30T11:54:51.327-0500 [DEBUG] plugin.terraform-provider-google-beta_v2.14.0_x4: GET /v1beta1/projects/project-id/locations/us-central1-f/operations/operation-1567183996343-3dc3efa5?alt=json&prettyPrint=false HTTP/1.1
2019-08-30T11:54:51.327-0500 [DEBUG] plugin.terraform-provider-google-beta_v2.14.0_x4: Host: container.googleapis.com
2019-08-30T11:54:51.327-0500 [DEBUG] plugin.terraform-provider-google-beta_v2.14.0_x4: User-Agent: google-api-go-client/0.5 Terraform/0.12.6 (+https://www.terraform.io) terraform-provider-google-beta/2.14.0
2019-08-30T11:54:51.327-0500 [DEBUG] plugin.terraform-provider-google-beta_v2.14.0_x4: Accept-Encoding: gzip
2019-08-30T11:54:51.327-0500 [DEBUG] plugin.terraform-provider-google-beta_v2.14.0_x4:
2019-08-30T11:54:51.327-0500 [DEBUG] plugin.terraform-provider-google-beta_v2.14.0_x4:
2019-08-30T11:54:51.327-0500 [DEBUG] plugin.terraform-provider-google-beta_v2.14.0_x4: -----------------------------------------------------
google_container_cluster.primary: Still creating... [4m30s elapsed]
google_container_cluster.primary: Still creating... [4m40s elapsed]
google_container_cluster.primary: Still creating... [4m50s elapsed]
2019-08-30T11:55:21.028-0500 [DEBUG] plugin.terraform-provider-google-beta_v2.14.0_x4: 2019/08/30 11:55:21 [DEBUG] Unlocking "google-container-cluster/project-id/us-central1-f/primary"
2019-08-30T11:55:21.028-0500 [DEBUG] plugin.terraform-provider-google-beta_v2.14.0_x4: 2019/08/30 11:55:21 [DEBUG] Unlocked "google-container-cluster/project-id/us-central1-f/primary"
2019/08/30 11:55:21 [DEBUG] google_container_cluster.primary: apply errored, but we're indicating that via the Error pointer rather than returning it: Error while waiting to delete default node pool: Error waiting for removing default node pool: error while retrieving operation: Get https://container.googleapis.com/v1beta1/projects/project-id/locations/us-central1-f/operations/operation-1567183996343-3dc3efa5?alt=json&prettyPrint=false: read tcp [2605:a601:aad8:7800:fd0e:e101:c511:f7ee]:61717->[2607:f8b0:4000:80e::200a]:443: read: no route to host
2019/08/30 11:55:21 [TRACE] <root>: eval: *terraform.EvalMaybeTainted
2019/08/30 11:55:21 [TRACE] EvalMaybeTainted: google_container_cluster.primary encountered an error during creation, so it is now marked as tainted
2019/08/30 11:55:21 [TRACE] <root>: eval: *terraform.EvalWriteState
2019/08/30 11:55:21 [TRACE] EvalWriteState: writing current state object for google_container_cluster.primary
2019/08/30 11:55:21 [ERROR] <root>: eval: *terraform.EvalApplyPost, err: Error while waiting to delete default node pool: Error waiting for removing default node pool: error while retrieving operation: Get https://container.googleapis.com/v1beta1/projects/project-id/locations/us-central1-f/operations/operation-1567183996343-3dc3efa5?alt=json&prettyPrint=false: read tcp [2605:a601:aad8:7800:fd0e:e101:c511:f7ee]:61717->[2607:f8b0:4000:80e::200a]:443: read: no route to host
2019/08/30 11:55:21 [ERROR] <root>: eval: *terraform.EvalSequence, err: Error while waiting to delete default node pool: Error waiting for removing default node pool: error while retrieving operation: Get https://container.googleapis.com/v1beta1/projects/project-id/locations/us-central1-f/operations/operation-1567183996343-3dc3efa5?alt=json&prettyPrint=false: read tcp [2605:a601:aad8:7800:fd0e:e101:c511:f7ee]:61717->[2607:f8b0:4000:80e::200a]:443: read: no route to host
2019/08/30 11:55:21 [TRACE] [walkApply] Exiting eval tree: google_container_cluster.primary
2019/08/30 11:55:21 [TRACE] vertex "google_container_cluster.primary": visit complete

Error: Error while waiting to delete default node pool: Error waiting for removing default node pool: error while retrieving operation: Get https://container.googleapis.com/v1beta1/projects/project-id/locations/us-central1-f/operations/operation-1567183996343-3dc3efa5?alt=json&prettyPrint=false: read tcp [2605:a601:aad8:7800:fd0e:e101:c511:f7ee]:61717->[2607:f8b0:4000:80e::200a]:443: read: no route to host

  on kube.tf line 2, in resource "google_container_cluster" "primary":
   2: resource "google_container_cluster" "primary" {


@rileykarson
Copy link
Collaborator

@anth0d: During that window, the provider is awaiting a response from the GKE REST API, but doesn't receive any. Most individual HTTP calls to the GCP REST API have a 30s timeout in the provider.

Out of curiosity, do you have the debug logs for the DELETE call immediately prior? I don't expect to see anything out of place, this is more out of curiosity.

I don't believe I've ever seen this error locally, or in our CI. Have you encountered the issue in multiple network environments? It's possible that a firewall between you and the GCP API is consistently filtering this traffic.

@slevenick
Copy link
Collaborator

I believe the "no route to host" error:
2019/08/30 11:55:21 [DEBUG] google_container_cluster.primary: apply errored, but we're indicating that via the Error pointer rather than returning it: Error while waiting to delete default node pool: Error waiting for removing default node pool: error while retrieving operation: Get https://container.googleapis.com/v1beta1/projects/project-id/locations/us-central1-f/operations/operation-1567183996343-3dc3efa5?alt=json&prettyPrint=false: read tcp [2605:a601:aad8:7800:fd0e:e101:c511:f7ee]:61717->[2607:f8b0:4000:80e::200a]:443: read: no route to host

is caused by a network connectivity problem or firewalls as @rileykarson suggested.

Going back to the latest gist from @ejschoen it looks like the PING frames stop coming through at around the same time as the last response from the API is received.

2019-08-09T10:35:20.981-0500 [DEBUG] plugin.terraform-provider-google_v2.12.0_x4: 2019/08/09 10:35:20 http2: Framer 0xc0008fe1c0: read PING len=8 ping="\x00\x00\x00\x00\x00\x00\x008"
2019-08-09T10:35:20.981-0500 [DEBUG] plugin.terraform-provider-google_v2.12.0_x4: 2019/08/09 10:35:20 http2: Transport received PING len=8 ping="\x00\x00\x00\x00\x00\x00\x008"
2019-08-09T10:35:20.981-0500 [DEBUG] plugin.terraform-provider-google_v2.12.0_x4: 2019/08/09 10:35:20 http2: Framer 0xc0008fe1c0: wrote PING flags=ACK len=8 ping="\x00\x00\x00\x00\x00\x00\x008"

It doesn't look like terraform receives anything after this and the last operation response until it cancels the request at 10:36:00.

If this has been solved since updating to terraform 0.12.6 that would be great, as I don't have any more insight from the provider side. If this comes up again, please let us know

@ejschoen
Copy link
Author

I'll keep watching, but I think @anth0d described exactly the situation that I had been seeing, in which the polling status request times out at precisely the time that the cluster transitions to being healthy.

@ejschoen
Copy link
Author

Well, the failures are happening with some repeatability again today (2 out of 3 attempts failed). This was with Terraform 0.12.7 and Google provider 2.14.0.

https://gist.github.com/ejschoen/63af4a2f13937cad590a21738a6f10fb

Once again, the failure occurs around the time the initial node pool transitions into the RUNNING state.

I do see a couple of lines in the log that are interesting. In one line (line 23905), the connection is closed because it has been idle. The next line, there is a transport readframe error--use of closed network connection. After that, no HTTP request succeeds. Is it possible that there is a race condition between connection pools and connection pool users?

There are no "no route to host" errors.

I also note that if I manually make a request with Curl to the endpoint that timed out, using the Bearer authorization I see in the log, I get this:

{
  "name": "operation-1568220617167-be7ddb9f",
  "zone": "us-central1-f",
  "operationType": "CREATE_CLUSTER",
  "status": "DONE",
  "selfLink": "https://container.googleapis.com/v1beta1/projects/266650211792/zones/us-central1-f/operations/operation-1568220617167-be7ddb9f",
  "targetLink": "https://container.googleapis.com/v1beta1/projects/266650211792/zones/us-central1-f/clusters/test-gke",
  "startTime": "2019-09-11T16:50:17.167734574Z",
  "endTime": "2019-09-11T16:53:01.701970347Z",
  "progress": {
    "metrics": [
      {
        "name": "CLUSTER_HEALTHCHECKING",
        "intValue": "1"
      },
      {
        "name": "CLUSTER_HEALTHCHECKING_TOTAL",
        "intValue": "2"
      }
    ]
  }
}

wherein the status is now DONE, but in the last successful request to that endpoint, the status was RUNNING. I guess this is because healthchecking has finished.

@slevenick
Copy link
Collaborator

Interesting on the connection closed due to idle. I see that same pattern happen in the previous log: https://gist.github.com/ejschoen/39af1f955b897486b89cbb0adf22e498 at line 42122, with the readFrame error as well. But in that case there are responses that come through afterwards.

In this most recent log, it looks like three separate requests timed out and were closed. These between 2019-09-11T11:51:52.016-0500 and 2019-09-11T11:52:52.017-0500, a full minute apart. These are all made through the container cluster betav1 API client: https://godoc.org/google.golang.org/api/container/v1beta1

These calls were to three different endpoints: retrieving an operation, retrieving the cluster and attempting to delete the cluster.

I don't believe that the terraform provider code could be responsible for this, as it uses the API client for all of these calls. I can investigate adding a retry to the container operation in the case of a timeout, but if calls to three separate endpoints fail over a minute apart it may not be fixed by a retry

@ejschoen
Copy link
Author

ejschoen commented Oct 7, 2019

Repeatable failures this morning attempting to create a cluster. Batting 0 for 4. Two gists attached here:

Terraform 0.12.7/Google Provider 2.15.0: https://gist.github.com/ejschoen/d31e2b01c287d10d5c6a9a2915010de6

Terraform 0.12.9/Google Provider 2.15.0:
https://gist.github.com/ejschoen/7d4dcc0a0135ed03eddd23960c60e089

Also failed twice with Google Provider 2.14.0 before I tried the newer provider and then the newer Terraform.

@ejschoen
Copy link
Author

ejschoen commented Dec 5, 2019

This may be unrelated--I'm not sure. I've switched to using Google's cloud shell to run Terraform, and I see--reliably--failures of a different kind. At the point the original node pool is finished initializing, the master node becomes unavailable for a little while, 30 seconds to a minute. I wonder if this is related to the communication failure/timeout that we're seeing when we try to spin up a cluster via Terraform from outside of GCP: that we see a timeout from outside, but an error from inside.

module.kubernetes-engine.google_container_node_pool.cluster_nodes: Still creating... [1m0s elapsed]
module.helm.helm_release.chart: Still creating... [1m0s elapsed]
module.kubernetes-engine.google_container_node_pool.cluster_nodes: Still creating... [1m10s elapsed]
module.helm.helm_release.chart: Still creating... [1m10s elapsed]
module.kubernetes-engine.google_container_node_pool.cluster_nodes: Creation complete after 1m17s [id=us-central1-f/test-gke/test-gke-node-pool]
module.helm.helm_release.chart: Still creating... [1m20s elapsed]
module.helm.helm_release.chart: Still creating... [1m30s elapsed]

Error: rpc error: code = Unknown desc = Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

  on ../helm/main.tf line 29, in resource "helm_release" "chart":
  29: resource "helm_release" "chart" {

@sagikazarmark
Copy link

I had the same issue repeatedly with 2.5.1 provider. After upgrading to 2.20 terraform created the cluster successfully.

@slevenick
Copy link
Collaborator

Glad to hear that 2.20 might have cleared it up. Are you still seeing this issue @ejschoen ?

@ejschoen
Copy link
Author

ejschoen commented Jan 30, 2020

It looks to me as if the timeout errors have gone away. I haven't seen them recently, using Terraform 0.12.20 and Google provider 2.15.0, from either the google cloud shell or from our local infrastructure spinning up a cluster in GKE.

I do pretty often see the error that I noted on December 4, in which Terraform fails due to transient issues on the master node (maybe?), at the point that the non-default initial node pool has been created:

module.kubernetes-engine.google_container_node_pool.cluster_nodes: Creation complete after 1m36s [id=us-central1-f/test-gke/test-gke-node-pool]

Error: rpc error: code = Unknown desc = Could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

  on ../helm/main.tf line 47, in resource "helm_release" "chart":
  47: resource "helm_release" "chart" {

I see these both in cloud shell and from our local infrastructure. From the error message, it seems clear that it's not a timeout issue.

@slevenick
Copy link
Collaborator

Well at least one of the errors is taken care of!

I'm not too familiar with Kubernetes, but that error message seems to be coming from helm rather than Terraform.

I'm going to close this issue out now as the initial bug report is no longer reproducible. Feel free to file another issue if you believe the Could not get apiVersions error is due to a bug in the Terraform provider

@ghost
Copy link

ghost commented Mar 28, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

@ghost ghost locked and limited conversation to collaborators Mar 28, 2020
@github-actions github-actions bot added service/container forward/review In review; remove label to forward labels Jan 15, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug forward/review In review; remove label to forward service/container
Projects
None yet
Development

No branches or pull requests

6 participants