[gce]: DeleteInstances 409 case #5192

Freyert · 2022-09-15T20:27:30Z

Which component this PR applies to?

cluster-autoscaler gce provider.

What type of PR is this?

/kind bug

What this PR does / why we need it:

If the cluster autoscaler for some reason re-enters the DeleteInstances function after an operation to delete nodes has started the autoscaler will not correctly wait on the operation.

Instead, the autoscaler will loop over this error (from what I've seen) putting the cluster in a state where it can't be modified again until the delete operation finishes.

Which issue(s) this PR fixes:

Special notes for your reviewer:

If this PR is merged it would close #5213

I expect #5213 is potentially the more correct PR so I don't imagine this one will be merged.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2022-09-15T20:27:37Z

Welcome @Freyert!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

continue on 409s as the operation already exists

Freyert · 2022-09-15T20:38:36Z

@x13n we have a P1 GCP support ticket related to this issue. I've asked support to forward the ticket to your team. Hopefully it gets to you, but I'll also try connecting with you on slack.

Freyert · 2022-09-15T20:58:16Z

JSON Log autoscaler delete fail

{
  "insertId": "XXXXXXXXXX",
  "jsonPayload": {
    "reportingComponent": "",
    "reason": "DeleteUnregisteredFailed",
    "kind": "Event",
    "type": "Warning",
    "eventTime": null,
    "apiVersion": "v1",
    "involvedObject": {
      "resourceVersion": "2173434705",
      "uid": "XXXXXXXXXXXX",
      "name": "cluster-autoscaler-status",
      "kind": "ConfigMap",
      "apiVersion": "v1",
      "namespace": "kube-system"
    },
    "reportingInstance": "",
    "source": {
      "component": "cluster-autoscaler"
    },
    "metadata": {
      "resourceVersion": "113800445",
      "managedFields": [
        {
          "time": "XXXXXXXX",
          "operation": "Update",
          "fieldsV1": {
            "f:count": {},
            "f:lastTimestamp": {},
            "f:source": {
              "f:component": {}
            },
            "f:message": {},
            "f:involvedObject": {},
            "f:type": {},
            "f:firstTimestamp": {},
            "f:reason": {}
          },
          "fieldsType": "FieldsV1",
          "apiVersion": "v1",
          "manager": "cluster-autoscaler"
        }
      ],
      "creationTimestamp": "XXXXXXXXXXX",
      "namespace": "kube-system",
      "name": "cluster-autoscaler-status.XXXXXXXXXX",
      "uid": "XXXXXXXXXXXX"
    },
    "message": "Failed to remove node gce://ctp-production-us/us-central1-f/gke-production-XXXXXXXXXX: error while getting operation operation-XXXXX-YYYYYYY-ZZZZZZ-000000 on https://www.googleapis.com/compute/v1/projects/XXXXXX/zones/XXXXXXX/instanceGroupManagers/gke-production-XXXXXXXX: <nil>"
  },
  "resource": {
    "type": "k8s_cluster",
    "labels": {
      "cluster_name": "XXXXXXX",
      "project_id": "XXXXXXX",
      "location": "XXXXXXXX"
    }
  },
  "timestamp": "XXXXXXX",
  "severity": "WARNING",
  "logName": "projects/XXXXXX/logs/events",
  "receiveTimestamp": "XXXXXXXXX"
}

Corresponding API Error

{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "status": {
      "code": 3,
      "message": "INVALID_USAGE",
      "details": [
        {
          "@type": "type.googleapis.com/google.protobuf.Struct",
          "value": {
            "invalidUsage": {
              "userVisibleReason": "Cannot flag instance https://www.googleapis.com/compute/v1/projects/$PROJECT/zones/$ZONE/instances/$NODE_POOL to be deleted. Instance is already being deleted.",
              "resource": {
                "resourceType": "INSTANCE",
                "resourceName": "$NODE_POOL",
                "project": {
                  "canonicalProjectId": "$PROJECT_ID"
                },
                "scope": {
                  "scopeType": "ZONE",
                  "scopeName": "$ZONE"
                }
              }
            }
          }
        }
      ]
    },
    "authenticationInfo": {
      "principalEmail": "$PRINCIPAL_EMAIL",
      "principalSubject": "XXXXXX:$PRINCIPAL_EMAIL"
    },
    "requestMetadata": {
      "callerIp": "$CLIENT_IP",
      "callerSuppliedUserAgent": "google-api-go-client/0.5 cluster-autoscaler,gzip(gfe)",
      "requestAttributes": {},
      "destinationAttributes": {}
    },
    "serviceName": "compute.googleapis.com",
    "methodName": "v1.compute.instanceGroupManagers.deleteInstances",
    "resourceName": "projects/$PROJECT/zones/$ZONE/instanceGroupManagers/$INSTANCE_GROUP_MANAGERS",
    "request": {
      "@type": "type.googleapis.com/compute.instanceGroupManagers.deleteInstances"
    }
  },
  "insertId": "$INSERT_ID",
  "resource": {
    "type": "gce_instance_group_manager",
    "labels": {
      "project_id": "$PROJECT",
      "instance_group_manager_name": "$INSTANCE_GROUP_MANAGERS",
      "location": "$ZONE",
      "instance_group_manager_id": "$INSTANCE_GROUP_MANAGER_ID"
    }
  },
  "timestamp": "$TIMESTAMP.029142Z",
  "severity": "ERROR",
  "logName": "projects/$PROJECT/logs/cloudaudit.googleapis.com%2Factivity",
  "operation": {
    "id": "$OPERATION_ID",
    "producer": "compute.googleapis.com",
    "last": true
  },
  "receiveTimestamp": "$TIMESTAMP.651600414Z"
}

namm2 · 2022-09-16T06:32:57Z

cluster-autoscaler/cloudprovider/gce/autoscaling_gce_client.go

@@ -259,7 +259,8 @@ func (client *autoscalingGceClientV1) DeleteInstances(migRef GceRef, instances [
 		req.Instances = append(req.Instances, GenerateInstanceUrl(i))
 	}
 	op, err := client.gceService.InstanceGroupManagers.DeleteInstances(migRef.Project, migRef.Zone, migRef.Name, &req).Do()
-	if err != nil {
+	wasConflictErr := op != nil && op.HttpErrorStatusCode == http.StatusConflict


I'd create a new function to valuate the http status code for cleaner code, e.g.

func isDeleteInstanceConflict(errorCode string) bool { return errorCode == http.StatusConflict } if !isDeleteInstanceConflict(op.HttpErrorStatusCode) && err != nil { ....

x13n · 2022-09-20T15:05:27Z

cluster-autoscaler/cloudprovider/gce/autoscaling_gce_client.go

@@ -259,7 +259,8 @@ func (client *autoscalingGceClientV1) DeleteInstances(migRef GceRef, instances [
 		req.Instances = append(req.Instances, GenerateInstanceUrl(i))
 	}
 	op, err := client.gceService.InstanceGroupManagers.DeleteInstances(migRef.Project, migRef.Zone, migRef.Name, &req).Do()
-	if err != nil {
+	wasConflictErr := op != nil && op.HttpErrorStatusCode == http.StatusConflict
+	if !wasConflictErr && err != nil {
 		return err


Can you move wasConflictErr var inside the if err != nil condition & log a warning whenever this happens? CA generally shouldn't attempt to delete the same VM twice so it would be good to at least leave a trace that this happened.

x13n · 2022-09-20T15:10:43Z

cluster-autoscaler/cloudprovider/gce/autoscaling_gce_client.go

@@ -259,7 +259,8 @@ func (client *autoscalingGceClientV1) DeleteInstances(migRef GceRef, instances [
 		req.Instances = append(req.Instances, GenerateInstanceUrl(i))
 	}
 	op, err := client.gceService.InstanceGroupManagers.DeleteInstances(migRef.Project, migRef.Zone, migRef.Name, &req).Do()


What if there was a conflict deleting one instance, but not the others?

Hmmm . . . in my mind a "delete" request is idempotent so if there is currently a delete in progress then there is no reason to consider it an error.

So delete conflicts now become successes.

If you have N successes and N delete conflicts you now have 2N successes.

If one of those N requests had an error besides a conflict error you still return that.

What do you think? Is there something more to be concerned about here?

I guess at this point it's an issue with the API itself?

If the API is only returning a single error that could be very problematic.

Maybe you were thinking what if there was a 500 error, but we wait for the op to finish because we think there was only a 409 error?

https://cloud.google.com/compute/docs/reference/rest/v1/instanceGroupManagers/deleteInstances

skipInstancesOnValidationError
Specifies whether the request should proceed despite the inclusion of instances that are not members of the group or that are already in the process of being deleted or abandoned. If this field is set to false and such an instance is specified in the request, the operation fails. The operation always fails if the request contains a malformed instance URL or a reference to an instance that exists in a zone or region other than the group's zone or region.

@x13n I've made an alternative PR #5213 which just sets the skipInstancesOnValidationError parameter. I think that would correct your concern here.

k8s-ci-robot · 2022-09-26T20:19:46Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Freyert
Once this PR has been reviewed and has the lgtm label, please assign towca for approval by writing /assign @towca in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/cloudprovider/gce/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 15, 2022

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Sep 15, 2022

k8s-ci-robot requested review from aleksandra-malinowska and x13n September 15, 2022 20:28

[gce]: DeleteInstances 409 case

f5e6fea

continue on 409s as the operation already exists

Freyert force-pushed the gce-409-continue branch from ad4443f to f5e6fea Compare September 15, 2022 20:28

namm2 reviewed Sep 16, 2022

View reviewed changes

x13n requested changes Sep 20, 2022

View reviewed changes

jbartosik added the area/cluster-autoscaler label Sep 26, 2022

log MIG delete conflict

082e961

Freyert force-pushed the gce-409-continue branch from 1ca89a9 to 082e961 Compare September 26, 2022 20:26

Freyert requested review from x13n and removed request for aleksandra-malinowska September 26, 2022 20:28

Freyert mentioned this pull request Sep 26, 2022

[gce]: skip instances on validation error #5213

Merged

k8s-ci-robot closed this in #5213 Oct 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gce]: DeleteInstances 409 case #5192

[gce]: DeleteInstances 409 case #5192

Freyert commented Sep 15, 2022 •

edited

Loading

k8s-ci-robot commented Sep 15, 2022

Freyert commented Sep 15, 2022

Freyert commented Sep 15, 2022

namm2 Sep 16, 2022

x13n Sep 20, 2022

Freyert Sep 26, 2022

x13n Sep 20, 2022

Freyert Sep 26, 2022 •

edited

Loading

Freyert Sep 26, 2022

Freyert Sep 26, 2022 •

edited

Loading

Freyert Sep 26, 2022

k8s-ci-robot commented Sep 26, 2022

[gce]: DeleteInstances 409 case #5192

[gce]: DeleteInstances 409 case #5192

Conversation

Freyert commented Sep 15, 2022 • edited Loading

Which component this PR applies to?

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Sep 15, 2022

Freyert commented Sep 15, 2022

Freyert commented Sep 15, 2022

namm2 Sep 16, 2022

Choose a reason for hiding this comment

x13n Sep 20, 2022

Choose a reason for hiding this comment

Freyert Sep 26, 2022

Choose a reason for hiding this comment

x13n Sep 20, 2022

Choose a reason for hiding this comment

Freyert Sep 26, 2022 • edited Loading

Choose a reason for hiding this comment

Freyert Sep 26, 2022

Choose a reason for hiding this comment

Freyert Sep 26, 2022 • edited Loading

Choose a reason for hiding this comment

Freyert Sep 26, 2022

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 26, 2022

Freyert commented Sep 15, 2022 •

edited

Loading

Freyert Sep 26, 2022 •

edited

Loading

Freyert Sep 26, 2022 •

edited

Loading