Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve logic of VM-Deletion and Safey-controller for Azure #200

Closed
hardikdr opened this issue Dec 10, 2018 · 2 comments
Closed

Improve logic of VM-Deletion and Safey-controller for Azure #200

hardikdr opened this issue Dec 10, 2018 · 2 comments
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/bug Bug kind/question Question (asking for help, advice, or technical detail) needs/review Needs review platform/azure Microsoft Azure platform/infrastructure priority/2 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)

Comments

@hardikdr
Copy link
Member

We need to consider the possibility of cloud-provider taking too long while deleting the VM. At the moment, we issue Delete() call for the VM, and if there is no immediate error-response we try replacing the VM in parallel. We have seen incidents[in Azure] where cloud-provider deletion of the VM was stuck on cloud-provider and which led MCM to create more machines than expected.
As an improvement we could enhance both deletion logic and safety-controller in following ways:

  • Deletion Logic:

    • [Azure-specific]: Stop VM before deleting the VM. We understood from Microsoft contact that stopping the VM will pull out the VM from quota-calculation- this should also save some cost for us, as VM is not costed after stopping. Another supporting pointer is that deletion mostly gets stuck on Azure due to detaching/attaching of the VM, which will be avoided if stopped first- and deletion of the VM will anyway dependent on Azure which we anyways cannot do anything about.
  • Safety-controller:

    • Currently, the freeze logic in safety-controller only checks the number of Machine objects and decides to freeze if they exceed the intended number.
    • We need to think about enabling similar logic for resources in cloud-provider. Essentially safety-controller should also consider VMs stuck in Deletion or Failed after deletion -- as active resources and freeze if they exceed the intended number. We could expose new configuration-knob for this number.
@prashanth26 prashanth26 added kind/bug Bug priority/critical Needs to be resolved soon, because it impacts users negatively kind/question Question (asking for help, advice, or technical detail) platform/az needs/review Needs review status/closed Issue is closed (either delivered or triaged) area/quality Output qualification (tests, checks, scans, automation in general, etc.) related labels Dec 26, 2018
@gardener-robot-ci-1 gardener-robot-ci-1 added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Feb 25, 2019
@chgeuer
Copy link
Contributor

chgeuer commented Feb 28, 2019

Hi @hardikdr one quick feedback on the deletion logic: I think you're operating under a wrong assumption. Here's the situation on Azure: When a VM is in status=Stopped (deallocated), it actually still counts against the vCPU core quota. In my sub, I have a single (stopped/de-allocated) 2-core VM, and my quota usage is 2/100...

The stopped/deallocated VM does not cost you by-the-minute CPU costs (because doesn't block cores on a physical host). Let's discuss that next Tuesday as well.

@prashanth26
Copy link
Contributor

Closing this issue in favor of #242

@ghost ghost added the platform/azure Microsoft Azure platform/infrastructure label Mar 7, 2020
@gardener-robot gardener-robot added priority/2 Priority (lower number equals higher priority) and removed priority/critical Needs to be resolved soon, because it impacts users negatively labels Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/bug Bug kind/question Question (asking for help, advice, or technical detail) needs/review Needs review platform/azure Microsoft Azure platform/infrastructure priority/2 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

No branches or pull requests

5 participants