Improve logic of VM-Deletion and Safey-controller for Azure #200

hardikdr · 2018-12-10T23:32:42Z

We need to consider the possibility of cloud-provider taking too long while deleting the VM. At the moment, we issue Delete() call for the VM, and if there is no immediate error-response we try replacing the VM in parallel. We have seen incidents[in Azure] where cloud-provider deletion of the VM was stuck on cloud-provider and which led MCM to create more machines than expected.
As an improvement we could enhance both deletion logic and safety-controller in following ways:

Deletion Logic:
- [Azure-specific]: Stop VM before deleting the VM. We understood from Microsoft contact that stopping the VM will pull out the VM from quota-calculation- this should also save some cost for us, as VM is not costed after stopping. Another supporting pointer is that deletion mostly gets stuck on Azure due to detaching/attaching of the VM, which will be avoided if stopped first- and deletion of the VM will anyway dependent on Azure which we anyways cannot do anything about.
Safety-controller:
- Currently, the freeze logic in safety-controller only checks the number of Machine objects and decides to freeze if they exceed the intended number.
- We need to think about enabling similar logic for resources in cloud-provider. Essentially safety-controller should also consider VMs stuck in Deletion or Failed after deletion -- as active resources and freeze if they exceed the intended number. We could expose new configuration-knob for this number.

The text was updated successfully, but these errors were encountered:

chgeuer · 2019-02-28T15:06:27Z

Hi @hardikdr one quick feedback on the deletion logic: I think you're operating under a wrong assumption. Here's the situation on Azure: When a VM is in status=Stopped (deallocated), it actually still counts against the vCPU core quota. In my sub, I have a single (stopped/de-allocated) 2-core VM, and my quota usage is 2/100...

The stopped/deallocated VM does not cost you by-the-minute CPU costs (because doesn't block cores on a physical host). Let's discuss that next Tuesday as well.

prashanth26 · 2019-04-11T11:28:55Z

Closing this issue in favor of #242

prashanth26 mentioned this issue Dec 11, 2018

Azure VM deletion takes too long #201

Closed

prashanth26 mentioned this issue Dec 24, 2018

Azure: Poweroff VM before deletion #206

Merged

PadmaB mentioned this issue Jan 24, 2019

Improve Monitoring/Alerting/Metrics #211

Open

7 tasks

gardener-robot-ci-1 added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Feb 25, 2019

prashanth26 closed this as completed Apr 11, 2019

ghost added the platform/azure Microsoft Azure platform/infrastructure label Mar 7, 2020

gardener-robot added priority/2 Priority (lower number equals higher priority) and removed priority/critical Needs to be resolved soon, because it impacts users negatively labels Mar 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve logic of VM-Deletion and Safey-controller for Azure #200

Improve logic of VM-Deletion and Safey-controller for Azure #200

hardikdr commented Dec 10, 2018

chgeuer commented Feb 28, 2019

prashanth26 commented Apr 11, 2019

Improve logic of VM-Deletion and Safey-controller for Azure #200

Improve logic of VM-Deletion and Safey-controller for Azure #200

Comments

hardikdr commented Dec 10, 2018

chgeuer commented Feb 28, 2019

prashanth26 commented Apr 11, 2019