Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Linux VM extension bootstrap script and conditions #1232

Merged

Conversation

CecileRobertMichon
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon commented Mar 12, 2021

What type of PR is this?

What this PR does / why we need it: Implements bootstrap failure detection using conditions and VM extensions as proposed in #1076.

This is what a user will see when an AzureMachine has been successfully provisioned but has not yet finished running the bootstrap script:

$ k get azuremachine default-template-md-0-w78jt
NAME                                       READY   STATE
default-template-md-0-w78jt                false   Updating
status:
    conditions:
  - lastTransitionTime: "2021-03-17T00:02:14Z"
    reason: VMUpdating
    severity: Info
    status: "False"
    type: Ready
  - lastTransitionTime: "2021-03-17T00:02:14Z"
    reason: BootstrapInProgress
    severity: Info
    status: "False"
    type: BoostrapSucceeded
  - lastTransitionTime: "2021-03-17T00:02:14Z"
    reason: VMUpdating
    severity: Info
    status: "False"
    type: VMRunning

After the bootstrap script has executed successfully, the AzureMachine status shows:

$ k get azuremachine default-template-md-0-w78jt
NAME                                       READY   STATE
default-template-md-0-w78jt                true    Succeeded
status:
  conditions:
  - lastTransitionTime: "2021-03-16T23:56:17Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2021-03-16T23:56:17Z"
    status: "True"
    type: BoostrapSucceeded
  - lastTransitionTime: "2021-03-16T23:56:17Z"
    status: "True"
    type: VMRunning

If for some reason the bootstrap script fails to execute, the status will show:

$ k get azuremachines default-template-md-0-ppjmh
NAME                                       READY   STATE
default-template-md-0-ppjmh                false   Failed
  conditions:
  - lastTransitionTime: "2021-03-17T00:22:39Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2021-03-17T00:22:39Z"
    reason: BootstrapFailed
    severity: Error
    status: "False"
    type: BoostrapSucceeded
  - lastTransitionTime: "2021-03-17T00:22:39Z"
    status: "True"
    type: VMRunning

Also renames type "VMState" to "ProvisioningState" to be more generic and match Azure naming. This is a breaking change for anyone importing CAPZ types, but should have no impact on user templates and existing CRDs (conversion from v1alpha3 to v1alpha4 is auto-generated).

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #603

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

Add Linux VM extension bootstrap script and conditions
Renames type "VMState" to "ProvisioningState"

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 12, 2021
@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Mar 12, 2021
@CecileRobertMichon CecileRobertMichon changed the title [WIP] Add VM extension bootstrap script and [WIP] Add Linux VM extension bootstrap script and conditions Mar 12, 2021
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 12, 2021
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 15, 2021
azure/scope/machine.go Outdated Show resolved Hide resolved
azure/services/vmextensions/client.go Outdated Show resolved Hide resolved
azure/services/vmextensions/vmextensions.go Outdated Show resolved Hide resolved
azure/services/vmssextensions/vmssextensions.go Outdated Show resolved Hide resolved
azure/defaults.go Outdated Show resolved Hide resolved
@shysank
Copy link
Contributor

shysank commented Mar 16, 2021

Just realized it's a wip pr, please feel free to ignore the review comments that are not relevant.

return err
}
_, err = future.Result(ac.vmextensions)
_, err := ac.vmextensions.CreateOrUpdate(ctx, resourceGroupName, vmName, name, parameters)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devigned thoughts on this approach of doing "async" reconciliation? Is it significantly less efficient than storing the future or since we only care about eventually getting the extension status do you think it's appropriate in this situation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking in the future we can move this over to VM CreateOrUpdate once that's async. For now the extension could be a long-running operation (timeout is 20 minutes) so we don't want that to be part of a blocking operation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be fine. Though, it would be easy to call this multiple times if one is not tracking an ongoing operation via a secret. With the AMP future, we check to see if the operation is done prior to proceeding, which the Azure SDK for Go can help with. In this case, you'd be responsible for managing that state.

@CecileRobertMichon CecileRobertMichon force-pushed the extension-script branch 2 times, most recently from 436e638 to e700487 Compare March 16, 2021 20:02
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 16, 2021
switch infrav1.ProvisioningState(provisioningState) {
case infrav1.Succeeded:
m.V(4).Info("extension provisioning state is succeeded", "vm extension", extensionName, "scale set", m.Name())
conditions.MarkTrue(m.AzureMachinePool, infrav1.BootstrapSucceededCondition)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be nice to set conditions per VMSS VM, hopefully that gets easier with #819

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now it's at the AzureMachinePool.Status level:

status:
    conditions:
    - lastTransitionTime: "2021-03-16T23:53:39Z"
      status: "True"
      type: BoostrapSucceeded
    - lastTransitionTime: "2021-03-16T23:49:37Z"
      status: "True"
      type: ScaleSetRunning

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 16, 2021
@CecileRobertMichon
Copy link
Contributor Author

Thanks for the initial review @shysank! I've addressed most of your comments. I still have docs and unit tests as TODOs.

api/v1alpha4/types.go Outdated Show resolved Hide resolved
@CecileRobertMichon CecileRobertMichon force-pushed the extension-script branch 2 times, most recently from dea6219 to 669b1c0 Compare March 18, 2021 23:20
@CecileRobertMichon CecileRobertMichon changed the title [WIP] Add Linux VM extension bootstrap script and conditions Add Linux VM extension bootstrap script and conditions Mar 18, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 18, 2021
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Mar 19, 2021

@CecileRobertMichon: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-cluster-api-provider-azure-apidiff 638c14f link /test pull-cluster-api-provider-azure-apidiff

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@CecileRobertMichon
Copy link
Contributor Author

This is ready for review now

@nader-ziada
Copy link
Contributor

nader-ziada commented Mar 22, 2021

when I try running it, I don't see the Updating phase

NAME                           READY   STATE
nz-test1-control-plane-wk926   true    Succeeded
nz-test1-md-0-gc5gc

then goes directly to

NAME                           READY   STATE
nz-test1-control-plane-wk926   true    Succeeded
nz-test1-md-0-gc5gc            true    Succeeded

not sure why yet, will dig into it a bit more

update:

I see this error in my logs

[manager] E0322 19:19:04.361187       1 azuremachine_controller.go:310] controllers/AzureMachine "msg"="transient failure to reconcile AzureMachine, retrying" "error"="unable to create vm extension: reconcile error occurred that can be recovered. Object will be requeued after 30s The actual error is: extension still provisioning" "AzureCluster"="nz-test1" "azureMachine"="nz-test1-md-0-gc5gc" "cluster"="nz-test1" "machine"="nz-test1-md-0-67888c96b8-h94st" "namespace"="default" "name"="nz-test1-md-0-gc5gc"

Copy link
Contributor

@devigned devigned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Only had a minor comment about async APIs.

return err
}
_, err = future.Result(ac.vmextensions)
_, err := ac.vmextensions.CreateOrUpdate(ctx, resourceGroupName, vmName, name, parameters)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be fine. Though, it would be easy to call this multiple times if one is not tracking an ongoing operation via a secret. With the AMP future, we check to see if the operation is done prior to proceeding, which the Azure SDK for Go can help with. In this case, you'd be responsible for managing that state.

@CecileRobertMichon
Copy link
Contributor Author

when I try running it, I don't see the Updating phase

@nader-ziada I wonder if you're doing something differently, this is what I see when scaling up:

Every 2.0s: kubectl get azuremachines                                                                                                 
Ceciles-MacBook-Pro.local: Tue Mar 23 18:42:51 2021

NAME                                   READY   STATE
default-template-control-plane-5zq49   true    Succeeded
default-template-md-0-b9cpb            true    Succeeded
default-template-md-0-bnq44
default-template-md-0-vkd4b            true    Succeeded
Every 2.0s: kubectl get azuremachines                                                                                                 
Ceciles-MacBook-Pro.local: Tue Mar 23 18:43:08 2021

NAME                                   READY   STATE
default-template-control-plane-5zq49   true    Succeeded
default-template-md-0-b9cpb            true    Succeeded
default-template-md-0-bnq44            false   Updating
default-template-md-0-vkd4b            true    Succeeded
Every 2.0s: kubectl get azuremachines                                                                                                 
Ceciles-MacBook-Pro.local: Tue Mar 23 18:43:38 2021

NAME                                   READY   STATE
default-template-control-plane-5zq49   true    Succeeded
default-template-md-0-b9cpb            true    Succeeded
default-template-md-0-bnq44            true    Succeeded
default-template-md-0-vkd4b            true    Succeeded

That first step I'm hoping we can change to default-template-md-0-bnq44 false Creating as part of #1067 (by returning and setting status before the VM is done creating)

@nader-ziada
Copy link
Contributor

I tried it again and seems to work as expected

NAME                                READY   STATE
nz-test1-control-plane-pq8lt        true    Succeeded
nz-test1-md-0-qs2nf                 false   Updating
nz-test1-md-0-qs2nf                 false   Updating
nz-test1-md-0-qs2nf                 true    Succeeded

@nader-ziada
Copy link
Contributor

/lgtm

@devigned any comments?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 25, 2021
Copy link
Contributor

@devigned devigned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: devigned

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 25, 2021
@k8s-ci-robot k8s-ci-robot merged commit 4f3f98b into kubernetes-sigs:master Mar 25, 2021
@k8s-ci-robot k8s-ci-robot added this to the v0.5.0 milestone Mar 25, 2021
@CecileRobertMichon CecileRobertMichon deleted the extension-script branch February 17, 2023 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bootstrap failure detection
6 participants