Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCM does not reset .status.failedMachines of MachineDeployment #456

Open
rfranzke opened this issue Apr 29, 2020 · 5 comments
Open

MCM does not reset .status.failedMachines of MachineDeployment #456

rfranzke opened this issue Apr 29, 2020 · 5 comments
Labels
area/usability Usability related kind/bug Bug lifecycle/rotten Nobody worked on this for 12 months (final aging stage) needs/planning Needs (more) planning with other MCM maintainers priority/3 Priority (lower number equals higher priority) size/s Size of pull request is small (see gardener-robot robot/bots/size.py)

Comments

@rfranzke
Copy link
Member

What happened:
MCM does not update the .status.failedMachine of the MachineDeployment after the .status.lastOperation of the Machine changes (e.g., from Failed -> Processing (e.g., after the credentials have been fixed)):

  status:
    availableReplicas: 2
    conditions:
    - lastTransitionTime: "2020-04-29T06:51:38Z"
      lastUpdateTime: "2020-04-29T06:51:38Z"
      message: Deployment does not have minimum availability.
      reason: MinimumReplicasUnavailable
      status: "False"
      type: Available
    failedMachines:
    - lastOperation:
        description: 'Failed to list VMs while deleting the machine "shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5"
          AuthFailure: AWS was not able to validate the provided access credentials
          status code: 401, request id: 6e99231c-654e-4b05-8801-310e3532b4e9'
        lastUpdateTime: "2020-04-29T06:53:33Z"
        state: Failed
        type: Delete
      name: shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5
      ownerRef: shoot--foo--bar-cpu-worker-z1-5cdcb46f64
    observedGeneration: 2
  spec:
    class:
      kind: AWSMachineClass
      name: shoot--foo--bar-cpu-worker-z1-ff76e
    nodeTemplate:
      metadata:
        creationTimestamp: null
        labels:
          node.kubernetes.io/role: node
          worker.garden.sapcloud.io/group: cpu-worker
          worker.gardener.cloud/pool: cpu-worker
      spec: {}
    providerID: aws:///eu-west-1/i-05f4737c3ef646f89
  status:
    currentStatus:
      lastUpdateTime: "2020-04-29T07:41:44Z"
      phase: Pending
      timeoutActive: true
    lastOperation:
      description: Creating machine on cloud provider
      lastUpdateTime: "2020-04-29T07:41:44Z"
      state: Processing
      type: Create
    node: ip-10-250-9-55.eu-west-1.compute.internal

(compare the timestamps)

What you expected to happen:
The .status.failedMachines is properly updated when .status.lastOperation of Machine objects are changed.

@rfranzke rfranzke added the kind/bug Bug label Apr 29, 2020
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jun 29, 2020
@hardikdr
Copy link
Member

Dupe of #476 .
Closing in favor of the other.

@hardikdr
Copy link
Member

/close

@ggaurav10
Copy link
Contributor

#476 is about the failed machine metric counter, whereas this issue is about reseting the status field. These appear to be different issues, and hence are not duplicate.
Reopening this.

@ggaurav10 ggaurav10 reopened this Aug 14, 2020
@ggaurav10
Copy link
Contributor

@rfranzke currently the .status.failedMachines field in the machine deployment is reset only when all the machines controlled by the machine deployment are healthy (running and have joined the cluster). Until then it contains the latest set of failed machine operations.
Refer:

// Update the FailedMachines field only if we see new failures
// Clear FailedMachines if ready replicas equals total replicas,
// which means the machineset doesn't have any machine objects which are in any failed state
if len(failedMachines) > 0 {
newStatus.FailedMachines = &failedMachines
} else if int32(readyReplicasCount) == is.Status.Replicas {
newStatus.FailedMachines = nil
}

Otherwise for some of the fail scenarios it can get difficult to catch the failed operations and reasons because of very quick create and delete operations. As a quick experiment, return a random failure here:

In the observation reported in the issue, once the machine joins and the machine deployment becomes healthy (all machines join the cluster), the failedMachines field would have been cleared.

Please let me know if there is any issue created because of current behaviour. We can take corrective measures accordingly.
Thanks.

@rfranzke
Copy link
Member Author

I still think this should be improved, i.e., the status should reflect the actual state of the world (which it currently does not - the machine is in the processing of joining the cluster and no longer failed). It's not too critical/important in the sense that we need it tomorrow, but it would improve user and ops experience/productivity to not show false negative status information.

@prashanth26 prashanth26 added area/usability Usability related and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Aug 16, 2020
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 16, 2020
@hardikdr hardikdr reopened this Nov 6, 2020
@prashanth26 prashanth26 added the priority/4 Priority (lower number equals higher priority) label Jul 21, 2021
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jan 18, 2022
@himanshu-kun himanshu-kun added priority/2 Priority (lower number equals higher priority) priority/3 Priority (lower number equals higher priority) size/s Size of pull request is small (see gardener-robot robot/bots/size.py) needs/planning Needs (more) planning with other MCM maintainers and removed priority/4 Priority (lower number equals higher priority) priority/2 Priority (lower number equals higher priority) lifecycle/rotten Nobody worked on this for 12 months (final aging stage) labels Feb 14, 2023
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 25, 2023
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/usability Usability related kind/bug Bug lifecycle/rotten Nobody worked on this for 12 months (final aging stage) needs/planning Needs (more) planning with other MCM maintainers priority/3 Priority (lower number equals higher priority) size/s Size of pull request is small (see gardener-robot robot/bots/size.py)
Projects
None yet
Development

No branches or pull requests

6 participants