Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for FailureDomains to AzureMachinePool #667

Conversation

fiunchinho
Copy link
Contributor

@fiunchinho fiunchinho commented Jun 2, 2020

What this PR does / why we need it:
It was not possible to choose the FailureDomains when creating a MachinePool because it was using a custom AzureMachineTemplateSpec. This PR changes the code so that MachinePool uses the same AzureMachineTemplateSpec than other parts of the code.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #663

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Release note:

Added `FailureDomains` field to `AzureMachinePoolSpec`

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 2, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: fiunchinho
To complete the pull request process, please assign justaugustus
You can assign the PR to them by writing /assign @justaugustus in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Welcome @fiunchinho!

It looks like this is your first PR to kubernetes-sigs/cluster-api-provider-azure 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-provider-azure has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Jun 2, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @fiunchinho. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 2, 2020
@fiunchinho fiunchinho force-pushed the machinepool-machinetemplate branch from 4380596 to cb9a19f Compare June 2, 2020 07:40
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 2, 2020
@fiunchinho fiunchinho force-pushed the machinepool-machinetemplate branch from cb9a19f to 238f3cd Compare June 2, 2020 08:00
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 2, 2020
@nader-ziada
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 2, 2020
@fiunchinho
Copy link
Contributor Author

Build is failing with

  Incompatible changes:
  - AzureMachinePoolSpec.Template: changed from AzureMachineTemplate to sigs.k8s.io/cluster-api-provider-azure/api/v1alpha3.AzureMachineSpec
  - AzureMachineTemplate: removed

Not sure how to solve it or what does it mean. I need directions, please.

@nader-ziada
Copy link
Contributor

@fiunchinho I believe this is informational to bring attention to the fact that there are breaking changes to the api

@CecileRobertMichon
Copy link
Contributor

/hold

@devigned @juan-lee I remember we had a discussion about this during machine pool implementation, what was the reasoning for using a different spec?

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 2, 2020
@devigned
Copy link
Contributor

devigned commented Jun 2, 2020

I believe this was the choice since AzureMachineSpec is likely to have more settings in the VM specific type. One example is in AzureMachinePool, FailureDomains (AZs) would likely be set on the AMP rather than on the AzureMachineSpec since they are provider controlled.

Another problem would be OSDisk. The name of the OSDisk is not honored in VirtualMachineScaleSets. If you specify a name, the PUT to VMSS fails.

I believe there were other concerns where the template would differ.

Should there be some base structure that are shared, perhaps. Though, I would prefer a little copy paste of data structure rather than more complex composition to reduce lines of code. My 2¢.

@fiunchinho
Copy link
Contributor Author

fiunchinho commented Jun 3, 2020

So instead of removing AzureMachineTemplate from azuremachinepools_types.go, should I add a FailureDomain field to it? or rather to AzureMachinePoolSpec?

@devigned
Copy link
Contributor

devigned commented Jun 3, 2020

I think FailureDomains should be []string clusterv1.FailureDomains on the AzureMachinePoolSpec, rather than the AzureMachineTemplate since the underlying provider will decide how to balance machines across FDs.

FailureDomains on the AzureMachinePoolSpec would map to Zones: []string in the VMSS REST API.

Does this make sense to folks?

@CecileRobertMichon
Copy link
Contributor

I think FailureDomains should be []string on the AzureMachinePoolSpec, rather than the AzureMachineTemplate since the underlying provider will decide how to balance machines across FDs.
FailureDomains on the AzureMachinePoolSpec would map to Zones: []string in the VMSS REST API.
Does this make sense to folks?

We definitely want to leverage Scale Set AZ placement rather than defining our own logic for placing individual machines in zones, that was one of the motivations for implementing MachinePools / VMSS in the first place. Should it be of type clusterv1.FailureDomains to match the existing failure domain type for AzureCluster though?

I think the default behavior needs to try to spread instances on all the available failure domains (from the AzureCluster status), and only use the zones defined in the AzureMachinePoolSpec for explicit placement, as described in https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/docs/topics/failure-domains.md#L17.

We also had a discussion in the CAPZ office hours with @richardcase about changing the existing logic and not needing the field in AzureMachine, we need to follow up on that.

@devigned
Copy link
Contributor

devigned commented Jun 3, 2020

@CecileRobertMichon thank you for the correction and further explanation!

@k8s-ci-robot k8s-ci-robot removed the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 4, 2020
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 8, 2020
@fiunchinho fiunchinho force-pushed the machinepool-machinetemplate branch from ade8919 to d66a625 Compare June 8, 2020 15:41
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 8, 2020
@fiunchinho fiunchinho force-pushed the machinepool-machinetemplate branch from d66a625 to 11dba83 Compare June 8, 2020 15:47
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 8, 2020
@fiunchinho fiunchinho force-pushed the machinepool-machinetemplate branch from 11dba83 to bd20079 Compare June 8, 2020 16:04
@fiunchinho
Copy link
Contributor Author

Submitted kubernetes-sigs/cluster-api#3157

@fiunchinho
Copy link
Contributor Author

Given that kubernetes-sigs/cluster-api#3157 gets merged, any feedback about the implementation in this PR?

@devigned
Copy link
Contributor

failureDomain (string): the string identifier of the failure domain the instance is running in for the purposes of backwards compatibility and migrating to the v1alpha3 FailureDomain support (where FailureDomain is specified in Machine.Spec.FailureDomain). This field is meant to be temporary to aid in migration of data that was previously defined on the provider type and providers will be expected to remove the field in the next version that provides breaking API changes, favoring the value defined on Machine.Spec.FailureDomain instead. If supporting conversions from previous types, the provider will need to support a conversion from the provider-specific field that was previously used to the failureDomain field to support the automated migration path.

via Machine Infra Provider Spec

With the above guidance in mind, I would expect the following pending kubernetes-sigs/cluster-api#3157.

FailureDomains []string on MachinePool would be mapped into a Zones []string on the VMSSSpec, which would then be used to set compute.VirtualMachineScaleSet.Zones.

The compute.VirtualMachineScaleSet.Zones property is immutable after creation in Azure. If a change occurs to MachinePool.Zones after the AzureMachinePool has been created, then the AzureMachinePool and the MachinePool should go into a failed state with an error message indicating the reason for the failure.

@CecileRobertMichon, what do you think about having Zones on the status for AzureMachinePool since zones can only be specified on create?

Anyone have any other thoughts or fill in any blanks I missed?

@CecileRobertMichon
Copy link
Contributor

CecileRobertMichon commented Jun 15, 2020

The compute.VirtualMachineScaleSet.Zones property is immutable after creation in Azure. If a change occurs to MachinePool.Zones after the AzureMachinePool has been created, then the AzureMachinePool and the MachinePool should go into a failed state with an error message indicating the reason for the failure.

There should be a webhook validation on Update() that doesn't allow changes to FailureDomains if this applies to all providers (do we know of any cases where failure domains would be mutable?) that way it the machine pool never goes into failed state.

@CecileRobertMichon, what do you think about having Zones on the status for AzureMachinePool since zones can only be specified on create?

That makes sense to me but we don't do this for Machines currently, we should be consistent and do it for both? Maybe as a follow up? Keep in mind that AZs aren't supported in every region so that field won't always be set.

@devigned
Copy link
Contributor

That makes sense to me but we don't do this for Machines currently, we should be consistent and do it for both? Maybe as a follow up? Keep in mind that AZs aren't supported in every region so that field won't always be set.

The more I think about this, the less value I think it provides. If the MachinePoolSpec says it is in those zones and the infrastructure is in a succeeded state, then I think we are right to assume the zones in the spec are reconciled to the infrastructure.

There should be a webhook validation on Update() that doesn't allow changes to FailureDomains if this applies to all providers (do we know of any cases where failure domains would be mutable?) that way it the machine pool never goes into failed state.

This would be great if we could say FailureDomains are immutable for all, but I don't think we can. For example, AWS AutoScale groups can add zones.

@CecileRobertMichon
Copy link
Contributor

@fiunchinho would you like to move forward with this PR now that capi v0.3.7 is in? It looks like it's very close

@k8s-ci-robot
Copy link
Contributor

@fiunchinho: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 6, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2021
@k8s-ci-robot
Copy link
Contributor

@fiunchinho: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-cluster-api-provider-azure-build bd20079 link /test pull-cluster-api-provider-azure-build
pull-cluster-api-provider-azure-test bd20079 link /test pull-cluster-api-provider-azure-test
pull-cluster-api-provider-azure-e2e bd20079 link /test pull-cluster-api-provider-azure-e2e
pull-cluster-api-provider-azure-e2e-windows bd20079 link /test pull-cluster-api-provider-azure-e2e-windows

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 6, 2021
@fiunchinho
Copy link
Contributor Author

Superseeded by #1180

@fiunchinho fiunchinho closed this Feb 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support FailureDomains for AzureMachinePools
7 participants