Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage collector for machine-controller-manager #3

Closed
4 of 9 tasks
hardikdr opened this issue Jan 12, 2018 · 1 comment
Closed
4 of 9 tasks

Garbage collector for machine-controller-manager #3

hardikdr opened this issue Jan 12, 2018 · 1 comment
Assignees
Labels
kind/enhancement Enhancement, improvement, extension

Comments

@hardikdr
Copy link
Member

hardikdr commented Jan 12, 2018

Story

  • As provider I want no "unwanted" resources to burn money, so that I don't have to pay for them.

Motivation

The machine controller manager should make sure not to leave any orphan machines (or other resources) in the IaaS (a risk stemming from our ability to create resources in the IaaS in a headless way), otherwise we might have to deal with potentially unbound and enormous costs (!). We should therefore implement our controllers in such a way as to keep track of our resources and delete orphans consequently.

Acceptance Criteria

  • No orphan VMs, i.e. we know exactly which machines belong to which cluster and pool, even if we lose the ETCD (IaaS holds information, e.g. in tags, so that we can potentially clean-up manually)
  • If the IaaS is unresponsive to deletions, we still must respect the max + surge settings for a cluster and pool, i.e. whatever issues we see, we may never exceed the sum of max + surge
  • If we can't check actual vs. max + surge (due to a bug in our code or a problem in the IaaS), our code paths should make sure not to create any new machines anymore
  • Protection against melt-down, i.e. scenarios in which machines are recreated again and again, i.e. the limits are never violated, but machines gets recreated endlessly.

Implementation Proposal

Vedran Lerenc

In any case (and I mean really independent of the concrete circumstances/bug/whatever led to the issue), we should have two simple checks:
1.) Never ever have more than max+surge machine resources/objects (for a particular machine deployment)
2.) Never ever have more than max+surge actual VMs in the infrastructure (for a particular machine deployment)
That should be a general safety net, so that we do not have to think anymore about specific error scenarios. Just two very basic and general checks that always must be true, otherwise no additional resources/objects respectively VMs can be created.
Does that make sense/can this be done?

Hardik Dodiya

I think its doable. We can have a separate controller safety-controller, running separately.
Its main job would be to check number of existing VMs and existing MachineCRDs. At any point if number exceeds max+surge[or predefined number] , it can put "freeze" label on corresponding MachineSet/MachineDeployment. It should then instruct the mcm to not take any actions on those MachineDeployment/MachineSet objects with freeze label.
Just quick thought , sounds viable ?
Events from those freezed objects would then be rejected by the controller's syncHandler, it resume once label is removed.
This situation would be most like to occur when:
1. Cloud provider create() call fails[or timesout] for client-sdk but actually creates the VM on cloud-provider. 
2. User provides machineclass which passed all the validations from mcm, but actual machine could not connect to the cluster. May be buggy cloud-config or so.
* by separate controller, I meant controller inside the mcm, and runs just the way machineset/deployment controllers are running.

Vedran Lerenc

Thank you very much! That sounds perfect (from where I am standing, not being an expert on your code). It ticks my checkboxes:
- independent controller (yes, part of the controller manager, but independent in the code path), exactly what I was hoping for
- can pull the line/stops the world (provisioning/ops on machines) to limit any uncontrolled operations on IaaS components that are subject to a charge
As said, it is you who define the “how”, but I hope we can achieve the “what” together, i.e. controlling/limiting the costs in case of potential bugs.

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide?
@hardikdr hardikdr added the kind/discussion Discussion (enaging others in deciding about multiple options) label Feb 5, 2018
@vlerenc
Copy link
Member

vlerenc commented Feb 11, 2018

@hardikdr and @prashanth26 I changed the description, because it's actually no option to leave orphans around, i.e. it's no "should"/"could", but a "must". We create (potentially very expensive) resources for our customers, so we "must" make sure not to leave orphans.

@vlerenc vlerenc removed the kind/discussion Discussion (enaging others in deciding about multiple options) label Feb 11, 2018
@hardikdr hardikdr changed the title Garbage collector for node-controller-manager Garbage collector for machine-controller-manager Feb 15, 2018
@prashanth26 prashanth26 self-assigned this Feb 26, 2018
@prashanth26 prashanth26 added the kind/enhancement Enhancement, improvement, extension label Mar 14, 2018
@ccwienk ccwienk added the lifecycle/rotten Nobody worked on this for 12 months (final aging stage) label Aug 16, 2018
@vlerenc vlerenc removed the lifecycle/rotten Nobody worked on this for 12 months (final aging stage) label Aug 16, 2018
prashanth26 pushed a commit that referenced this issue Sep 7, 2018
External driver implementation for AWS.
AxiomSamarth pushed a commit to AxiomSamarth/machine-controller-manager that referenced this issue Jul 9, 2021
aylei pushed a commit to aylei/machine-controller-manager that referenced this issue Aug 9, 2021
Create compute cli with default service account
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension
Projects
None yet
Development

No branches or pull requests

4 participants