Bugfix/azure wait for disk detach #248

prashanth26 · 2019-03-25T12:17:45Z

What this PR does / why we need it:
This PR attempts to fix the Azure issue of VMs stuck in deletion state.

Now pods are tried to be evicted in the timeout period, and if it fails the pods are forcefully deleted by setting the grace period to 0. The data disks are then detached and the VM is deleted.

Which issue(s) this PR fixes:
Fixes #242

Special notes for your reviewer:

Release note:

The drain is always invoked even the case of forceful deletion

Drain now tries to evict pods and if eviction fails, it forcefully deletes the pods

Azure explicitly detaches data disks before VM deletion

- Disks are now detached before deletion on Azure - Drain pod maximum grace period is aligned with drain timeout - Azure now longer poweroffs/shutdown VM before deletion

prashanth26 · 2019-03-28T11:23:27Z

@hardikdr /needs-review

hardikdr · 2019-04-01T06:14:10Z

pkg/controller/machine.go

@@ -470,7 +468,7 @@ func (c *controller) machineDelete(machine *v1alpha1.Machine, driver driver.Driv
 					c.targetCoreClient,
 					timeOutDuration, // TODO: Will need to configure timeout
 					nodeName,
-					-1,
+					int(timeOutDuration.Seconds()),


What is it for ?

I thought to cap the graceful termination of any pod by the drain timeout should help in overall drain option going through in the drain timeout. But i dunno if it works as expected.

I am hoping that if a pod is set to a graceful termination of 2Hours, and our drain timeout is 5mins we max would only give 5mins?

I have reverted this change here - 7510f9c#diff-d8287fe74b5273163c9f6b6c635ad912R475.

hardikdr · 2019-04-01T06:25:55Z

pkg/driver/driver_azure.go

+		// There are disks attached hence need to detach them
+		vm.StorageProfile.DataDisks = &[]compute.DataDisk{}
+
+		_, errChan := vmClient.CreateOrUpdate(d.AzureMachineClass.Spec.ResourceGroup, machineID, vm, cancel)


Can you please explain how does it work?
Where are we making explicit detach calls to VM? , or does the empty-disk array instructs API to delete all disk?

Yes hardik. VM with an update call with empty data disks is how detachment of data disks is done in Azure. They lack a proper SDK call for it. Refer here - Azure/azure-sdk-for-go#1638 (comment)

- Every pod termination now tries to be evicted for drain timeout period - If it fails to be evicted, it is deleted forcefully by setting the graceful period to 0s leading to a forceful deletion

- Drain is now invoked even in the case of forceful deletion (and) - After drain timeout duration of deletion call

prashanth26 requested review from ggaurav10 and a team as code owners March 25, 2019 12:17

prashanth26 force-pushed the bugfix/azure-wait-for-disk-detach branch 3 times, most recently from 364ab2c to 6aa7754 Compare March 28, 2019 11:04

prashanth26 added 3 commits March 28, 2019 16:36

Azure machine deletion now waits for disk detachment

9580bf6

- Disks are now detached before deletion on Azure - Drain pod maximum grace period is aligned with drain timeout - Azure now longer poweroffs/shutdown VM before deletion

Improved log messages to emphasis on machine id

1594500

Simple tests for machine deletion

77072d2

prashanth26 force-pushed the bugfix/azure-wait-for-disk-detach branch from 6aa7754 to 77072d2 Compare March 28, 2019 11:07

hardikdr reviewed Apr 1, 2019

View reviewed changes

prashanth26 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 2, 2019

gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 2, 2019

prashanth26 force-pushed the bugfix/azure-wait-for-disk-detach branch 4 times, most recently from adca4ce to 85feb71 Compare April 9, 2019 11:53

prashanth26 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 9, 2019

gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 9, 2019

gardener-robot-ci-1 added the needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 9, 2019

prashanth26 added 2 commits April 10, 2019 11:43

Drain now tries to evict and if fails forcefully deletes the pods

5d492e6

- Every pod termination now tries to be evicted for drain timeout period - If it fails to be evicted, it is deleted forcefully by setting the graceful period to 0s leading to a forceful deletion

Drain is always invoked now

7510f9c

- Drain is now invoked even in the case of forceful deletion (and) - After drain timeout duration of deletion call

prashanth26 force-pushed the bugfix/azure-wait-for-disk-detach branch from 85feb71 to 7510f9c Compare April 10, 2019 06:15

prashanth26 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 11, 2019

gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 11, 2019

hardikdr approved these changes Apr 15, 2019

View reviewed changes

prashanth26 merged commit 9706076 into gardener:master Apr 16, 2019

prashanth26 mentioned this pull request Apr 22, 2019

Mitigate Azure VM Deletion Race Condition #255

Closed

prashanth26 deleted the bugfix/azure-wait-for-disk-detach branch July 17, 2019 08:00

ghost added component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) platform/azure Microsoft Azure platform/infrastructure labels Mar 7, 2020

gardener-robot added the area/ops-productivity Operator productivity related (how to improve operations) label Jun 18, 2020

himanshu-kun mentioned this pull request Apr 25, 2023

update vm delete to not wait for datadisk detachment gardener/machine-controller-manager-provider-azure#95

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix/azure wait for disk detach #248

Bugfix/azure wait for disk detach #248

prashanth26 commented Mar 25, 2019 •

edited

Loading

prashanth26 commented Mar 28, 2019

hardikdr Apr 1, 2019

prashanth26 Apr 1, 2019 •

edited

Loading

prashanth26 Apr 15, 2019

hardikdr Apr 15, 2019

hardikdr Apr 1, 2019

prashanth26 Apr 1, 2019

Bugfix/azure wait for disk detach #248

Bugfix/azure wait for disk detach #248

Conversation

prashanth26 commented Mar 25, 2019 • edited Loading

prashanth26 commented Mar 28, 2019

hardikdr Apr 1, 2019

Choose a reason for hiding this comment

prashanth26 Apr 1, 2019 • edited Loading

Choose a reason for hiding this comment

prashanth26 Apr 15, 2019

Choose a reason for hiding this comment

hardikdr Apr 15, 2019

Choose a reason for hiding this comment

hardikdr Apr 1, 2019

Choose a reason for hiding this comment

prashanth26 Apr 1, 2019

Choose a reason for hiding this comment

prashanth26 commented Mar 25, 2019 •

edited

Loading

prashanth26 Apr 1, 2019 •

edited

Loading