Garbage collector for machine-controller-manager #3

hardikdr · 2018-01-12T18:59:31Z

Story

As provider I want no "unwanted" resources to burn money, so that I don't have to pay for them.

Motivation

The machine controller manager should make sure not to leave any orphan machines (or other resources) in the IaaS (a risk stemming from our ability to create resources in the IaaS in a headless way), otherwise we might have to deal with potentially unbound and enormous costs (!). We should therefore implement our controllers in such a way as to keep track of our resources and delete orphans consequently.

Acceptance Criteria

No orphan VMs, i.e. we know exactly which machines belong to which cluster and pool, even if we lose the ETCD (IaaS holds information, e.g. in tags, so that we can potentially clean-up manually)
If the IaaS is unresponsive to deletions, we still must respect the max + surge settings for a cluster and pool, i.e. whatever issues we see, we may never exceed the sum of max + surge
If we can't check actual vs. max + surge (due to a bug in our code or a problem in the IaaS), our code paths should make sure not to create any new machines anymore
Protection against melt-down, i.e. scenarios in which machines are recreated again and again, i.e. the limits are never violated, but machines gets recreated endlessly.

Implementation Proposal

Vedran Lerenc

In any case (and I mean really independent of the concrete circumstances/bug/whatever led to the issue), we should have two simple checks:
1.) Never ever have more than max+surge machine resources/objects (for a particular machine deployment)
2.) Never ever have more than max+surge actual VMs in the infrastructure (for a particular machine deployment)
That should be a general safety net, so that we do not have to think anymore about specific error scenarios. Just two very basic and general checks that always must be true, otherwise no additional resources/objects respectively VMs can be created.
Does that make sense/can this be done?

Hardik Dodiya

I think its doable. We can have a separate controller safety-controller, running separately.
Its main job would be to check number of existing VMs and existing MachineCRDs. At any point if number exceeds max+surge[or predefined number] , it can put "freeze" label on corresponding MachineSet/MachineDeployment. It should then instruct the mcm to not take any actions on those MachineDeployment/MachineSet objects with freeze label.
Just quick thought , sounds viable ?
Events from those freezed objects would then be rejected by the controller's syncHandler, it resume once label is removed.
This situation would be most like to occur when:
1. Cloud provider create() call fails[or timesout] for client-sdk but actually creates the VM on cloud-provider. 
2. User provides machineclass which passed all the validations from mcm, but actual machine could not connect to the cluster. May be buggy cloud-config or so.
* by separate controller, I meant controller inside the mcm, and runs just the way machineset/deployment controllers are running.

Vedran Lerenc

Thank you very much! That sounds perfect (from where I am standing, not being an expert on your code). It ticks my checkboxes:
- independent controller (yes, part of the controller manager, but independent in the code path), exactly what I was hoping for
- can pull the line/stops the world (provisioning/ops on machines) to limit any uncontrolled operations on IaaS components that are subject to a charge
As said, it is you who define the “how”, but I hope we can achieve the “what” together, i.e. controlling/limiting the costs in case of potential bugs.

Definition of Done

Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
Integration Tests are provided: Have you written automated integration tests?
Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
Operations guide: Have you updated the operations guide?

The text was updated successfully, but these errors were encountered:

vlerenc · 2018-02-11T09:49:43Z

@hardikdr and @prashanth26 I changed the description, because it's actually no option to leave orphans around, i.e. it's no "should"/"could", but a "must". We create (potentially very expensive) resources for our customers, so we "must" make sure not to leave orphans.

External driver implementation for AWS.

Update comments

Create compute cli with default service account

hardikdr added the kind/discussion Discussion (enaging others in deciding about multiple options) label Feb 5, 2018

vlerenc removed the kind/discussion Discussion (enaging others in deciding about multiple options) label Feb 11, 2018

hardikdr changed the title ~~Garbage collector for node-controller-manager~~ Garbage collector for machine-controller-manager Feb 15, 2018

prashanth26 self-assigned this Feb 26, 2018

prashanth26 mentioned this issue Mar 14, 2018

Integration of safety controller #53

Merged

prashanth26 added the kind/enhancement Enhancement, improvement, extension label Mar 14, 2018

prashanth26 mentioned this issue Mar 19, 2018

Added GCP support for safety controller #57

Merged

prashanth26 closed this as completed Mar 20, 2018

ccwienk added the lifecycle/rotten Nobody worked on this for 12 months (final aging stage) label Aug 16, 2018

vlerenc removed the lifecycle/rotten Nobody worked on this for 12 months (final aging stage) label Aug 16, 2018

prashanth26 pushed a commit that referenced this issue Sep 7, 2018

Merge pull request #3 from amshuman-kr/grpc2

bd9f6e1

External driver implementation for AWS.

AxiomSamarth pushed a commit to AxiomSamarth/machine-controller-manager that referenced this issue Jul 9, 2021

Merge pull request gardener#3 from toniajuliejackson/patch-1

cca82c5

Update comments

aylei pushed a commit to aylei/machine-controller-manager that referenced this issue Aug 9, 2021

Merge pull request gardener#3 from pingcap/gcp-remove-credentials

e6baac4

Create compute cli with default service account

vlerenc mentioned this issue Jun 30, 2022

No orphan collection during machine Controller freeze #722

Closed

elankath mentioned this issue Feb 21, 2023

Improve overshooting logic #346

Open

unmarshall mentioned this issue Nov 17, 2023

Proposal: Post Create Initialization #868

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbage collector for machine-controller-manager #3

Garbage collector for machine-controller-manager #3

hardikdr commented Jan 12, 2018 •

edited by prashanth26

Loading

vlerenc commented Feb 11, 2018 •

edited

Loading

Garbage collector for machine-controller-manager #3

Garbage collector for machine-controller-manager #3

Comments

hardikdr commented Jan 12, 2018 • edited by prashanth26 Loading

Story

Motivation

Acceptance Criteria

Implementation Proposal

Definition of Done

vlerenc commented Feb 11, 2018 • edited Loading

hardikdr commented Jan 12, 2018 •

edited by prashanth26

Loading

vlerenc commented Feb 11, 2018 •

edited

Loading