MCM does not reset `.status.failedMachines` of MachineDeployment #456

rfranzke · 2020-04-29T07:46:06Z

What happened:
MCM does not update the .status.failedMachine of the MachineDeployment after the .status.lastOperation of the Machine changes (e.g., from Failed -> Processing (e.g., after the credentials have been fixed)):

  status:
    availableReplicas: 2
    conditions:
    - lastTransitionTime: "2020-04-29T06:51:38Z"
      lastUpdateTime: "2020-04-29T06:51:38Z"
      message: Deployment does not have minimum availability.
      reason: MinimumReplicasUnavailable
      status: "False"
      type: Available
    failedMachines:
    - lastOperation:
        description: 'Failed to list VMs while deleting the machine "shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5"
          AuthFailure: AWS was not able to validate the provided access credentials
          status code: 401, request id: 6e99231c-654e-4b05-8801-310e3532b4e9'
        lastUpdateTime: "2020-04-29T06:53:33Z"
        state: Failed
        type: Delete
      name: shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5
      ownerRef: shoot--foo--bar-cpu-worker-z1-5cdcb46f64
    observedGeneration: 2

  spec:
    class:
      kind: AWSMachineClass
      name: shoot--foo--bar-cpu-worker-z1-ff76e
    nodeTemplate:
      metadata:
        creationTimestamp: null
        labels:
          node.kubernetes.io/role: node
          worker.garden.sapcloud.io/group: cpu-worker
          worker.gardener.cloud/pool: cpu-worker
      spec: {}
    providerID: aws:///eu-west-1/i-05f4737c3ef646f89
  status:
    currentStatus:
      lastUpdateTime: "2020-04-29T07:41:44Z"
      phase: Pending
      timeoutActive: true
    lastOperation:
      description: Creating machine on cloud provider
      lastUpdateTime: "2020-04-29T07:41:44Z"
      state: Processing
      type: Create
    node: ip-10-250-9-55.eu-west-1.compute.internal

(compare the timestamps)

What you expected to happen:
The .status.failedMachines is properly updated when .status.lastOperation of Machine objects are changed.

The text was updated successfully, but these errors were encountered:

hardikdr · 2020-08-14T03:50:45Z

Dupe of #476 .
Closing in favor of the other.

hardikdr · 2020-08-14T03:50:53Z

/close

ggaurav10 · 2020-08-14T05:13:52Z

#476 is about the failed machine metric counter, whereas this issue is about reseting the status field. These appear to be different issues, and hence are not duplicate.
Reopening this.

ggaurav10 · 2020-08-14T05:46:59Z

@rfranzke currently the .status.failedMachines field in the machine deployment is reset only when all the machines controlled by the machine deployment are healthy (running and have joined the cluster). Until then it contains the latest set of failed machine operations.
Refer:

machine-controller-manager/pkg/controller/deployment_machineset_util.go

Lines 128 to 135 in e21b931

    
           // Update the FailedMachines field only if we see new failures 
        
           // Clear FailedMachines if ready replicas equals total replicas, 
        
           // which means the machineset doesn't have any machine objects which are in any failed state 
        
           if len(failedMachines) > 0 { 
        
           	newStatus.FailedMachines = &failedMachines 
        
           } else if int32(readyReplicasCount) == is.Status.Replicas { 
        
           	newStatus.FailedMachines = nil 
        
           }

Otherwise for some of the fail scenarios it can get difficult to catch the failed operations and reasons because of very quick create and delete operations. As a quick experiment, return a random failure here:

machine-controller-manager/pkg/controller/machine_bootstrap_token.go

Line 56 in e21b931

return nil

In the observation reported in the issue, once the machine joins and the machine deployment becomes healthy (all machines join the cluster), the failedMachines field would have been cleared.

Please let me know if there is any issue created because of current behaviour. We can take corrective measures accordingly.
Thanks.

rfranzke · 2020-08-14T05:58:49Z

I still think this should be improved, i.e., the status should reflect the actual state of the world (which it currently does not - the machine is in the processing of joining the cluster and no longer failed). It's not too critical/important in the sense that we need it tomorrow, but it would improve user and ops experience/productivity to not show false negative status information.

rfranzke added the kind/bug Bug label Apr 29, 2020

ggaurav10 mentioned this issue Apr 29, 2020

MCM does not report last operation for machines if credentials are invalid #455

Closed

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jun 29, 2020

gardener-robot closed this as completed Aug 14, 2020

ggaurav10 reopened this Aug 14, 2020

prashanth26 added area/usability Usability related and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Aug 16, 2020

timuthy mentioned this issue Aug 20, 2020

Report permission issue in MachineDeployment #501

Closed

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 16, 2020

gardener-robot closed this as completed Nov 6, 2020

hardikdr reopened this Nov 6, 2020

prashanth26 added the priority/4 Priority (lower number equals higher priority) label Jul 21, 2021

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jan 18, 2022

elankath mentioned this issue Feb 21, 2023

Improve Monitoring/Alerting/Metrics #211

Open

7 tasks

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 25, 2023

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCM does not reset `.status.failedMachines` of MachineDeployment #456

MCM does not reset `.status.failedMachines` of MachineDeployment #456

rfranzke commented Apr 29, 2020

hardikdr commented Aug 14, 2020

hardikdr commented Aug 14, 2020

ggaurav10 commented Aug 14, 2020

ggaurav10 commented Aug 14, 2020

rfranzke commented Aug 14, 2020

MCM does not reset .status.failedMachines of MachineDeployment #456

MCM does not reset .status.failedMachines of MachineDeployment #456

Comments

rfranzke commented Apr 29, 2020

hardikdr commented Aug 14, 2020

hardikdr commented Aug 14, 2020

ggaurav10 commented Aug 14, 2020

ggaurav10 commented Aug 14, 2020

rfranzke commented Aug 14, 2020

MCM does not reset `.status.failedMachines` of MachineDeployment #456

MCM does not reset `.status.failedMachines` of MachineDeployment #456