Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FailureDomain selection order possibly incorrect #679

Closed
richardcase opened this issue Jun 4, 2020 · 16 comments
Closed

FailureDomain selection order possibly incorrect #679

richardcase opened this issue Jun 4, 2020 · 16 comments
Assignees
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/documentation Categorizes issue or PR as related to documentation. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@richardcase
Copy link
Member

/kind bug

What steps did you take and what happened:
With #439 we added support for reporting failure domains (i.e. Azure AZs) back to CAPI so that it could then try to spread machines across failure domains (by populating Machine.Spec.FailureDomain). Logic was added to the machine scope to get the failure domain for use during machine reconciliation based on the precedence:

  1. Machine.Spec.FailureDomain
  2. AzureMachine.Spec.FailureDomain
  3. AzureMachine.Spec.AvailabilityZone.ID (This is DEPRECATED)
  4. No AZ

If a region has availability zones these will be reported back to CAPI, CAPI will then choose one of the AZs and set Machine.Spec.FailureDomain (if that field isn't already set). This means that if AZs are explicitly set in AzureMachine.Spec.FailureDomain or AzureMachine.Spec.AvailabilityZone.ID they will be ignored.

This behavior doesn't feel correct.

What did you expect to happen:
I would expect the user to be able to explicitly set an AZ for machine and for that to be honored.

We could change the order to:

  1. AzureMachine.Spec.AvailabilityZone.ID (This is DEPRECATED)
  2. AzureMachine.Spec.FailureDomain
  3. Machine.Spec.FailureDomain

This would allow us to support old definitions and also take advantage of automatic AZ selection if no AZ is explicitly set.

I also think that AzureMachine.Spec.FailureDomain feels a little redundant and we could change the precedence to:

  1. AzureMachine.Spec.AvailabilityZone.ID (This is DEPRECATED)
  2. Machine.Spec.FailureDomain

So if a user wants to explicitly set the AZ for a machine then they use Machine.Spec.FailureDomain

Environment:

  • cluster-api-provider-azure version: 0.43
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 4, 2020
@richardcase
Copy link
Member Author

/assign

@richardcase
Copy link
Member Author

@CecileRobertMichon @devigned @detiber - be good to get your thoughts on this.

@detiber
Copy link
Member

detiber commented Jun 4, 2020

@richardcase both AzureMachine.Spec.AvailabilityZone.ID and
AzureMachine.Spec.FailureDomain should be considered deprecated in favor of Machine.Spec.FailureDomain.

The addition of AzureMachine.Spec.FailureDomain was required to allow for migrating existing values that were previously set in AzureMachine.Spec.AvailabilityZone.ID to Machine.Spec.FailureDomain automatically for existing users.

The Machine controller will pull the value of .Spec.FailureDomain to populate Machine.Spec.FailureDomain (if it is not already set).

The reason for this was to facilitate the treatment of failure domains as first class across common Cluster API resources. Previously they were only available as an infrastructure provider specific with no real commonality across providers to standardize with.

@richardcase
Copy link
Member Author

Ah yes, thanks @detiber. I forgot we discussed and added this to the machine controller:

if machineScope.AzureMachine.Spec.FailureDomain == nil {

@CecileRobertMichon
Copy link
Contributor

@richardcase does that mean there is nothing to do here for now? Maybe just make this more clear in docs? And then eventually remove the AzureMachine fields when we move to v1alpha4?

@richardcase
Copy link
Member Author

@CecileRobertMichon - yes i think a small docs update for now and then removal of field in v1alpha4 would work.

I'll do the docs update and reference this issue. Should i then raise an issue for removal in v1alpha4 or will it be covered by #618?

@CecileRobertMichon
Copy link
Contributor

I commented #618 (comment) I think that should be enough. Thanks!

@alexeldeib
Copy link
Contributor

/kind documentation
/kind cleanup
/remove-kind bug

@k8s-ci-robot k8s-ci-robot added kind/documentation Categorizes issue or PR as related to documentation. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jul 22, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 20, 2020
@CecileRobertMichon
Copy link
Contributor

/remove-lifecycle stale

@richardcase did you still want to update the docs for this one?

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 3, 2020
@richardcase
Copy link
Member Author

Sorry @CecileRobertMichon i completely forgot about this, yes i will update the dcos for this.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2021
@richardcase
Copy link
Member Author

I've still not got around to this, sorry. With v1alpha4 not far off is there any benefit in doing this?

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 2, 2021
@CecileRobertMichon
Copy link
Contributor

/close

availabilityZones were removed in #1233

@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon: Closing this issue.

In response to this:

/close

availabilityZones were removed in #1233

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@CecileRobertMichon CecileRobertMichon removed this from the next milestone May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/documentation Categorizes issue or PR as related to documentation. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants