Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MachinePool remains in WaitingForReplicasReady because CAPA does not reconcile node references after instance refresh #4618

Open
AndiDog opened this issue Nov 6, 2023 · 13 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@AndiDog
Copy link
Contributor

AndiDog commented Nov 6, 2023

/kind bug

What steps did you take and what happened:

Related to kubernetes-sigs/cluster-api#8858, #4071

CAPA's AWSMachinePool reconciler unconditionally returns return ctrl.Result{}, r.reconcileNormal(ctx, machinePoolScope, infraScope, infraScope), i.e. does not schedule reconciliation of the ASG's EC2 instances into .Status.Instances at regular intervals.

I made a change where CAPA triggers an instance refresh (e.g. change of AMI IDs), rolling out new EC2 instances. The parent MachinePool object remained in non-ready state with reason WaitingForReplicasReady, with CAPI continuously logging NodeRefs != ReadyReplicas messages. Only the next, random reconciliation of my AWSMachinePool object solves this by checking which instances exist in the ASG.

What did you expect to happen:

CAPA should regularly reconcile in order to check the ASG for a changed set of instances. Particularly if it's expected because CAPA triggered an instance refresh.

Environment:

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 6, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cnmcavoy
Copy link
Contributor

cnmcavoy commented Nov 9, 2023

In your opinion, is this resolved by the MachinePools Machines implementation in CAPA? If not, what is missing that would need to be added?

#4527

@AndiDog
Copy link
Contributor Author

AndiDog commented Nov 13, 2023

@cnmcavoy I think your PR is separate from this issue. CAPI reconciles based on AWSMachinePool.Spec.ProviderIDList (unless you tell me that will change once infra providers create <Infra>Machine objects for machine pools?). That field is already correctly updated in awsmachinepool_controller.go, but CAPA does not regularly update it when the ASG (or an explicit instance refresh) creates/rolls instances.

@cnmcavoy
Copy link
Contributor

@cnmcavoy I think your PR is separate from this issue. CAPI reconciles based on AWSMachinePool.Spec.ProviderIDList (unless you tell me that will change once infra providers create <Infra>Machine objects for machine pools?). That field is already correctly updated in awsmachinepool_controller.go, but CAPA does not regularly update it when the ASG (or an explicit instance refresh) creates/rolls instances.

Correct... I agree that this isn't solved by #4527.

My understanding is that the solution requires a way to detect any change in the status of an ASGs instances and trigger a new reconcile of the AWSMachinePool. One approach would be to implement this ontop of the work in #4527 and have the AWSMachine enqueue their AWSMachinePool when their status changes.

Alternatively, another approach would be to use the AWS events and set up the resources to receive those. I believe there is a way to have AWS send something when the ASG changes.

@AndiDog
Copy link
Contributor Author

AndiDog commented Nov 15, 2023

A bulletproof solution would be to reconcile every 1-5 minutes (configurable?!) for AWSMachinePool. No matter if using events or not, since they may not arrive correctly if the controller or network is misconfigured (assuming this feature were implemented).

There's Amazon EventBridge, but it can mainly perform actions in other AWS services, so I'm not sure if it could trigger a call to a controller webhook in order for it to reconcile.

I like the idea of observing the AWSMachine state and bubbling that up to AWSMachinePool. An ASG instance refresh would include node termination after some minutes, so there's an event on Node (which we don't watch or finalize, I assume?), or a Kubernetes Event (which we don't watch). An ASG scale-up (instance added) might – in the success case – have a "node added" event in Kubernetes. Did you have something in mind how the event observation could technically work?

If we don't have a clear idea, should we first fix the low-hanging fruit and use a regular reconciliation interval (RequeueAfter)?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 14, 2024
@AndiDog
Copy link
Contributor Author

AndiDog commented Apr 10, 2024

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 10, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@AndiDog
Copy link
Contributor Author

AndiDog commented Jul 17, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 17, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 14, 2024
@AndiDog
Copy link
Contributor Author

AndiDog commented Nov 19, 2024

#5174 may fix this, given that it ensures updating the instances/nodes list regularly

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

4 participants