Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/azure wait for disk detach #248

Merged

Conversation

prashanth26
Copy link
Contributor

@prashanth26 prashanth26 commented Mar 25, 2019

What this PR does / why we need it:
This PR attempts to fix the Azure issue of VMs stuck in deletion state.

Now pods are tried to be evicted in the timeout period, and if it fails the pods are forcefully deleted by setting the grace period to 0. The data disks are then detached and the VM is deleted.

Which issue(s) this PR fixes:
Fixes #242

Special notes for your reviewer:

Release note:

The drain is always invoked even the case of forceful deletion
Drain now tries to evict pods and if eviction fails, it forcefully deletes the pods
Azure explicitly detaches data disks before VM deletion

@prashanth26 prashanth26 added kind/bug Bug priority/critical Needs to be resolved soon, because it impacts users negatively platform/az needs/review Needs review status/new Issue is new and unprocessed size/s Size of pull request is small (see gardener-robot robot/bots/size.py) topology/seed Affects Seed clusters labels Mar 25, 2019
@prashanth26 prashanth26 requested review from ggaurav10 and a team as code owners March 25, 2019 12:17
@prashanth26 prashanth26 force-pushed the bugfix/azure-wait-for-disk-detach branch 3 times, most recently from 364ab2c to 6aa7754 Compare March 28, 2019 11:04
- Disks are now detached before deletion on Azure
- Drain pod maximum grace period is aligned with drain timeout
- Azure now longer poweroffs/shutdown VM before deletion
@prashanth26 prashanth26 force-pushed the bugfix/azure-wait-for-disk-detach branch from 6aa7754 to 77072d2 Compare March 28, 2019 11:07
@prashanth26
Copy link
Contributor Author

@hardikdr /needs-review

@@ -470,7 +468,7 @@ func (c *controller) machineDelete(machine *v1alpha1.Machine, driver driver.Driv
c.targetCoreClient,
timeOutDuration, // TODO: Will need to configure timeout
nodeName,
-1,
int(timeOutDuration.Seconds()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is it for ?

Copy link
Contributor Author

@prashanth26 prashanth26 Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought to cap the graceful termination of any pod by the drain timeout should help in overall drain option going through in the drain timeout. But i dunno if it works as expected.

I am hoping that if a pod is set to a graceful termination of 2Hours, and our drain timeout is 5mins we max would only give 5mins?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted this change here - 7510f9c#diff-d8287fe74b5273163c9f6b6c635ad912R475.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

// There are disks attached hence need to detach them
vm.StorageProfile.DataDisks = &[]compute.DataDisk{}

_, errChan := vmClient.CreateOrUpdate(d.AzureMachineClass.Spec.ResourceGroup, machineID, vm, cancel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain how does it work?
Where are we making explicit detach calls to VM? , or does the empty-disk array instructs API to delete all disk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes hardik. VM with an update call with empty data disks is how detachment of data disks is done in Azure. They lack a proper SDK call for it. Refer here - Azure/azure-sdk-for-go#1638 (comment)

@prashanth26 prashanth26 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 2, 2019
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 2, 2019
@prashanth26 prashanth26 force-pushed the bugfix/azure-wait-for-disk-detach branch 4 times, most recently from adca4ce to 85feb71 Compare April 9, 2019 11:53
@prashanth26 prashanth26 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 9, 2019
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 9, 2019
@gardener-robot-ci-1 gardener-robot-ci-1 added the needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 9, 2019
- Every pod termination now tries to be evicted for drain timeout period
- If it fails to be evicted, it is deleted forcefully by setting the graceful period to 0s leading to a forceful deletion
- Drain is now invoked even in the case of forceful deletion (and)
- After drain timeout duration of deletion call
@prashanth26 prashanth26 force-pushed the bugfix/azure-wait-for-disk-detach branch from 85feb71 to 7510f9c Compare April 10, 2019 06:15
@prashanth26 prashanth26 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 11, 2019
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 11, 2019
@prashanth26 prashanth26 merged commit 9706076 into gardener:master Apr 16, 2019
@prashanth26 prashanth26 deleted the bugfix/azure-wait-for-disk-detach branch July 17, 2019 08:00
@ghost ghost added component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) platform/azure Microsoft Azure platform/infrastructure labels Mar 7, 2020
@gardener-robot gardener-robot added the area/ops-productivity Operator productivity related (how to improve operations) label Jun 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ops-productivity Operator productivity related (how to improve operations) component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) kind/bug Bug needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review platform/azure Microsoft Azure platform/infrastructure priority/critical Needs to be resolved soon, because it impacts users negatively size/s Size of pull request is small (see gardener-robot robot/bots/size.py) status/new Issue is new and unprocessed topology/seed Affects Seed clusters
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Azure machines being stuck on deletion
4 participants