Improve logic of VM-Deletion and Safey-controller for Azure #200
Labels
area/quality
Output qualification (tests, checks, scans, automation in general, etc.) related
kind/bug
Bug
kind/question
Question (asking for help, advice, or technical detail)
needs/review
Needs review
platform/azure
Microsoft Azure platform/infrastructure
priority/2
Priority (lower number equals higher priority)
status/closed
Issue is closed (either delivered or triaged)
We need to consider the possibility of cloud-provider taking too long while deleting the VM. At the moment, we issue Delete() call for the VM, and if there is no immediate error-response we try replacing the VM in parallel. We have seen incidents[in Azure] where cloud-provider deletion of the VM was stuck on cloud-provider and which led MCM to create more machines than expected.
As an improvement we could enhance both deletion logic and safety-controller in following ways:
Deletion Logic:
Stop
VM before deleting the VM. We understood from Microsoft contact that stopping the VM will pull out the VM from quota-calculation- this should also save some cost for us, as VM is not costed afterstopping
. Another supporting pointer is that deletion mostly gets stuck on Azure due to detaching/attaching of the VM, which will be avoided if stopped first- and deletion of the VM will anyway dependent on Azure which we anyways cannot do anything about.Safety-controller:
Machine
objects and decides to freeze if they exceed the intended number.Deletion
orFailed
after deletion -- as active resources and freeze if they exceed the intended number. We could expose new configuration-knob for this number.The text was updated successfully, but these errors were encountered: