Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancel rolling upgrades before deleting VMSS & extensions #18973

Closed
wants to merge 24 commits into from

Conversation

harshavmb
Copy link
Contributor

@harshavmb harshavmb commented Oct 25, 2022

This is redo work of PR-15634. The old PR was having conflicts & expects an additional argument to cancel before deleting VMSS & extensions.

Since destroy call is a clear intention of deleting VMSS & associated resources like extensions, rolling upgrades could be cancelled as they are trivial.

As of now, we write destroy time provisioners to cancel rolling upgrades which are error prone. We hope this will improve the way VMSS & extensions are deleted in a clean manner.

Our problem statement had been explained to Gauthier Donikian, Hashicorp few months ago. We expect some feedback from Hashicorp team.

Thanks,
Harsha

…adeus.net/iac/terraform-provider-azurerm into feature/SACP-47873-add-internal-ci
Merge in IAC/terraform-provider-azurerm from feature/SACP-47873-add-internal-ci to master

* commit '6d1953a834749477c5693b3fef7a441b1ef0869a':
  SACP-47573: Fallback to Amadeus versioning
  SACP-47573: Adding internal CI
  Changes
  Changes
  with parallelism=2
  SACP:47873: changes
  SACP:47873: changes
  SACP:47873: changes
  SACP:47873: changes
  SACP:47873: changes
  SACP:47873: changes
  SACP:47873: Removing stage from Jenkinsfile
  Add internal CI
…g VMSS

Merge in IAC/terraform-provider-azurerm from feature/SACP-47877-redo-vmss-pr to master

* commit '8ef20ab8515d3ccb45557aea51dfab5e31e6be3f':
  SACP-47877: Rolling Upgrade cancelled before deleting VMSS
@katbyte
Copy link
Collaborator

katbyte commented Oct 25, 2022

thanks for re-opening this @harshavmb - its on our list to review soon - as this is supersedes #15634 i'm going to close that one out

@harshavmb
Copy link
Contributor Author

harshavmb commented Nov 1, 2022

Hi @katbyte @tombuildsstuff,

I've got another idea today to take out the rolling_upgrade_policy outside of azurerm_linux_virtual_machine_scale_set & have a new resource. This way, there is no need of writing explicit cancel rolling upgrade calls made to ARM.

Are you okay with this approach?

@tombuildsstuff tombuildsstuff self-assigned this Nov 7, 2022
@harshavmb
Copy link
Contributor Author

Hi @katbyte / @tombuildsstuff ,

Any update on this PR? We need some update on this before we end up with some code conflicts. The last PR (already closed) was also left un-reviewed for longtime.

@evaroquaux
Copy link

Hello,
Any update on this contribution?
Still highly expected, for a long time...
Thank you

@harshavmb
Copy link
Contributor Author

Hi @stephybun ,

Your feedback is highly appreciated. We have been managing all these times with ugly destroy provisioners. If you think this PR helps, please share your feedback as this has been in your review queue for many months.

Copy link
Member

@stephybun stephybun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @harshavmb and apologies being unable to get to this sooner.

I think this behaviour makes sense and is something we can incorporate into the resource. I've left suggestions which will bring this functionality more in-line with the rest of the provider. These will need to be made before we can consider merging this.

We also need to add a test for all three resources, this is a bit tricky since we need to simulate the behaviour of deleting the VMSS when rolling upgrades are happening in an automated fashion. It isn't blatantly obvious how to do that unless you're very familiar with our testing framework which is why I've gone ahead and written one, unfortunately I'm unable to push it to your branch since I lack permission.

I have the test and all of the suggested changes on my branch, which in the interest of time I could open as a separate PR to supersede this one. Alternatively I can provide the test and test config to you in another way. Let me know which you prefer.

Comment on lines +20 to +22
if err != nil && !(future.Response().StatusCode == http.StatusConflict && strings.Contains(err.Error(), "There is no ongoing Rolling Upgrade to cancel.")) {
return fmt.Errorf("error cancelling rolling upgrades of Virtual Machine Scale Set %q (Resource Group %q): %+v", vmScaleSetName, resourceGroupName, err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. This actually resulted in a crash for me when when trying to reproduce the case described in the comments above. In addition matching on the error messages tends to be brittle since that is something that can change very easily in the API.

A more robust way to check whether to call the cancel function for rolling upgrades would be to query whether there are any ongoing rolling upgrades first. I've left comments in-line with what that would look like.

"strings"
)

func (c *Client) CancelRollingUpgradesBeforeDeletion(ctx context.Context, resourceGroupName string, vmScaleSetName string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just pass in the VMSS resource ID here since that contains all the info we need

Suggested change
func (c *Client) CancelRollingUpgradesBeforeDeletion(ctx context.Context, resourceGroupName string, vmScaleSetName string) error {
func (c *Client) CancelRollingUpgradesBeforeDeletion(ctx context.Context, id parse.VirtualMachineScaleSetId) error {

)

func (c *Client) CancelRollingUpgradesBeforeDeletion(ctx context.Context, resourceGroupName string, vmScaleSetName string) error {
future, err := c.VMScaleSetRollingUpgradesClient.Cancel(ctx, resourceGroupName, vmScaleSetName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an API call to retrieve the latest rolling upgrades, I think we should do that here and return nil if no rolling upgrade is fine, i.e. there is no need to cancel and trigger the 409 error described down below, and if there is an existing rolling upgrade then we proceed to cancel - all other errors should be caught and surfaced

Suggested change
future, err := c.VMScaleSetRollingUpgradesClient.Cancel(ctx, resourceGroupName, vmScaleSetName)
resp, err := c.VMScaleSetRollingUpgradesClient.GetLatest(ctx, id.ResourceGroup, id.Name)
if err != nil {
if utils.ResponseWasNotFound(resp.Response) {
return nil
}
return fmt.Errorf("retrieving rolling updates for %s: %+v", id, err)
}
future, err := c.VMScaleSetRollingUpgradesClient.Cancel(ctx, resourceGroupName, vmScaleSetName)

Comment on lines +14 to +22
// If rolling upgrades haven't been run (when VMSS are just provisioned with rolling upgrades but no extensions, auto-scaling are run )
// we can not cancel rolling upgrades
// API call :: GET https://management.azure.com/subscriptions/{subId}/resourceGroups/{rgName}/providers/Microsoft.Compute/virtualMachineScaleSets/{vmSSName}/rollingUpgrades/latest?api-version=2021-07-01
// Azure API throws 409 conflict error saying "The entity was not found in this Azure location."
// If the above error message matches, we identify and move forward to delete the VMSS
// in all other cases, it just cancels the rolling upgrades and move ahead to delete the VMSS
if err != nil && !(future.Response().StatusCode == http.StatusConflict && strings.Contains(err.Error(), "There is no ongoing Rolling Upgrade to cancel.")) {
return fmt.Errorf("error cancelling rolling upgrades of Virtual Machine Scale Set %q (Resource Group %q): %+v", vmScaleSetName, resourceGroupName, err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Means none of this is necessary since we aren't hitting this behaviour anymore

Suggested change
// If rolling upgrades haven't been run (when VMSS are just provisioned with rolling upgrades but no extensions, auto-scaling are run )
// we can not cancel rolling upgrades
// API call :: GET https://management.azure.com/subscriptions/{subId}/resourceGroups/{rgName}/providers/Microsoft.Compute/virtualMachineScaleSets/{vmSSName}/rollingUpgrades/latest?api-version=2021-07-01
// Azure API throws 409 conflict error saying "The entity was not found in this Azure location."
// If the above error message matches, we identify and move forward to delete the VMSS
// in all other cases, it just cancels the rolling upgrades and move ahead to delete the VMSS
if err != nil && !(future.Response().StatusCode == http.StatusConflict && strings.Contains(err.Error(), "There is no ongoing Rolling Upgrade to cancel.")) {
return fmt.Errorf("error cancelling rolling upgrades of Virtual Machine Scale Set %q (Resource Group %q): %+v", vmScaleSetName, resourceGroupName, err)
}

Comment on lines +24 to +28
if err := future.WaitForCompletionRef(ctx, c.VMScaleSetExtensionsClient.Client); err != nil && !(future.Response().StatusCode == http.StatusConflict && strings.Contains(err.Error(), "There is no ongoing Rolling Upgrade to cancel.")) {
return fmt.Errorf("waiting for cancelling rolling upgrades of Virtual Machine Scale Set %q (Resource Group %q): %+v", vmScaleSetName, resourceGroupName, err)
}
log.Printf("[DEBUG] cancelled Virtual Machine Scale Set Rolling Upgrades %q (Resource Group %q).", vmScaleSetName, resourceGroupName)
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we can simplify the condition here since it should be prevented by the check retrieving any rolling upgrades above

Suggested change
if err := future.WaitForCompletionRef(ctx, c.VMScaleSetExtensionsClient.Client); err != nil && !(future.Response().StatusCode == http.StatusConflict && strings.Contains(err.Error(), "There is no ongoing Rolling Upgrade to cancel.")) {
return fmt.Errorf("waiting for cancelling rolling upgrades of Virtual Machine Scale Set %q (Resource Group %q): %+v", vmScaleSetName, resourceGroupName, err)
}
log.Printf("[DEBUG] cancelled Virtual Machine Scale Set Rolling Upgrades %q (Resource Group %q).", vmScaleSetName, resourceGroupName)
return nil
if err := future.WaitForCompletionRef(ctx, c.VMScaleSetExtensionsClient.Client); err != nil {
return fmt.Errorf("waiting for cancelling of rolling upgrades for %s: %+v", id, err)
}
log.Printf("[DEBUG] cancelled Virtual Machine Scale Set Rolling Upgrades for %s.", id)
return nil

Comment on lines +1083 to +1086
err = meta.(*clients.Client).Compute.CancelRollingUpgradesBeforeDeletion(ctx, id.ResourceGroup, id.Name)
if err != nil {
return fmt.Errorf("error while cancelling rolling upgrade during destroy phase in Linux Virtual Machine Scale Set %q (Resource Group %q) : %+v", id.Name, id.ResourceGroup, err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's simplify this to

Suggested change
err = meta.(*clients.Client).Compute.CancelRollingUpgradesBeforeDeletion(ctx, id.ResourceGroup, id.Name)
if err != nil {
return fmt.Errorf("error while cancelling rolling upgrade during destroy phase in Linux Virtual Machine Scale Set %q (Resource Group %q) : %+v", id.Name, id.ResourceGroup, err)
}
if err := meta.(*clients.Client).Compute.CancelRollingUpgradesBeforeDeletion(ctx, *id); err != nil {
fmt.Errorf("cancelling rolling upgrades for %s: %+v", *id, err)
}

Comment on lines +366 to +369
err = meta.(*clients.Client).Compute.CancelRollingUpgradesBeforeDeletion(ctx, id.ResourceGroup, id.VirtualMachineScaleSetName)
if err != nil {
return fmt.Errorf("error while cancelling rolling upgrade during destroy phase in Linux Virtual Machine Scale Set %q (Resource Group %q) : %+v", id.VirtualMachineScaleSetName, id.ResourceGroup, err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Suggested change
err = meta.(*clients.Client).Compute.CancelRollingUpgradesBeforeDeletion(ctx, id.ResourceGroup, id.VirtualMachineScaleSetName)
if err != nil {
return fmt.Errorf("error while cancelling rolling upgrade during destroy phase in Linux Virtual Machine Scale Set %q (Resource Group %q) : %+v", id.VirtualMachineScaleSetName, id.ResourceGroup, err)
}
if err := meta.(*clients.Client).Compute.CancelRollingUpgradesBeforeDeletion(ctx, *id); err != nil {
fmt.Errorf("cancelling rolling upgrades for %s: %+v", *id, err)
}

Comment on lines +1113 to +1116
err = meta.(*clients.Client).Compute.CancelRollingUpgradesBeforeDeletion(ctx, id.ResourceGroup, id.Name)
if err != nil {
return fmt.Errorf("error while cancelling rolling upgrade during destroy phase in Linux Virtual Machine Scale Set %q (Resource Group %q) : %+v", id.Name, id.ResourceGroup, err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here

Suggested change
err = meta.(*clients.Client).Compute.CancelRollingUpgradesBeforeDeletion(ctx, id.ResourceGroup, id.Name)
if err != nil {
return fmt.Errorf("error while cancelling rolling upgrade during destroy phase in Linux Virtual Machine Scale Set %q (Resource Group %q) : %+v", id.Name, id.ResourceGroup, err)
}
if err := meta.(*clients.Client).Compute.CancelRollingUpgradesBeforeDeletion(ctx, *id); err != nil {
fmt.Errorf("cancelling rolling upgrades for %s: %+v", *id, err)
}

@harshavmb
Copy link
Contributor Author

Thanks for this PR @harshavmb and apologies being unable to get to this sooner.

I think this behaviour makes sense and is something we can incorporate into the resource. I've left suggestions which will bring this functionality more in-line with the rest of the provider. These will need to be made before we can consider merging this.

We also need to add a test for all three resources, this is a bit tricky since we need to simulate the behaviour of deleting the VMSS when rolling upgrades are happening in an automated fashion. It isn't blatantly obvious how to do that unless you're very familiar with our testing framework which is why I've gone ahead and written one, unfortunately I'm unable to push it to your branch since I lack permission.

I have the test and all of the suggested changes on my branch, which in the interest of time I could open as a separate PR to supersede this one. Alternatively I can provide the test and test config to you in another way. Let me know which you prefer.

Hi @stephybun ,

Many Thanks for your review. If you have all the changes you needed to push from your branch, I'm fine with it. This PR can be closed when you create the PR.

Also, I must admit I can't write the three test cases you mentioned.

Thanks again for your help. Very much appreciated.

Thanks,
Harsha

@stephybun
Copy link
Member

@harshavmb I'll get a PR open in the morning, I think it may be cutting it a bit short to make it into this week's release, I need to do some additional testing around the extensions resource but it should make it in the week after.

Apologies again for the long delay on this one.

@stephybun
Copy link
Member

Closing in favour of #22991

@stephybun stephybun closed this Aug 17, 2023
Copy link

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 17, 2024
@harshavmb harshavmb deleted the redo-vm-ss-pr branch May 17, 2024 05:31
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants