Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrate natgateways to use aso #4059

Merged
merged 1 commit into from
Oct 13, 2023

Conversation

nawazkh
Copy link
Member

@nawazkh nawazkh commented Sep 30, 2023

What type of PR is this?
/kind feature

What this PR does / why we need it:

  • Migrate NAT Gateways to ASO framework

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4040

Special notes for your reviewer:

  • Commit d76bf75 updates API. Users will be able to configure NAT Gateway's namespace.
    • I need to check if exposing NAT Gateway's namespace is allowed and possible.
    • Will push for this(BYO Natgateway) once other Azure services have been migrated to ASO.
  • Commit 7d67bd0 generally updates NAT Gateways service to ASO.

  • cherry-pick candidate

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

NAT Gateway service will use the ASO framework

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 30, 2023
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 30, 2023
@nawazkh nawazkh force-pushed the natgateways_to_aso branch 2 times, most recently from 395fa88 to da61e42 Compare September 30, 2023 05:03
@codecov
Copy link

codecov bot commented Sep 30, 2023

Codecov Report

Attention: 16 lines in your changes are missing coverage. Please review.

Comparison is base (2e8a1ea) 57.63% compared to head (53da702) 57.84%.
Report is 24 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4059      +/-   ##
==========================================
+ Coverage   57.63%   57.84%   +0.21%     
==========================================
  Files         188      187       -1     
  Lines       19202    19069     -133     
==========================================
- Hits        11067    11031      -36     
+ Misses       7505     7414      -91     
+ Partials      630      624       -6     
Files Coverage Δ
azure/scope/cluster.go 60.07% <100.00%> (+1.98%) ⬆️
controllers/azurecluster_controller.go 37.07% <ø> (ø)
controllers/azurecluster_reconciler.go 59.28% <100.00%> (+0.95%) ⬆️
azure/services/natgateways/natgateways.go 40.00% <41.66%> (-36.00%) ⬇️
azure/services/natgateways/spec.go 76.92% <72.72%> (-14.39%) ⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nawazkh nawazkh force-pushed the natgateways_to_aso branch 2 times, most recently from 592a613 to cee8b8d Compare October 1, 2023 05:24
@nawazkh nawazkh self-assigned this Oct 2, 2023
@nawazkh nawazkh force-pushed the natgateways_to_aso branch 3 times, most recently from bc1ef69 to d76bf75 Compare October 3, 2023 03:41
@nawazkh nawazkh marked this pull request as ready for review October 3, 2023 04:12
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 3, 2023
api/v1beta1/types.go Outdated Show resolved Hide resolved
azure/scope/cluster.go Outdated Show resolved Hide resolved
azure/scope/cluster_test.go Outdated Show resolved Hide resolved
azure/services/aso/aso.go Outdated Show resolved Hide resolved
azure/services/natgateways/natgateways_test.go Outdated Show resolved Hide resolved
azure/services/natgateways/spec.go Outdated Show resolved Hide resolved
@nawazkh nawazkh force-pushed the natgateways_to_aso branch from 7131b5f to 1591fe8 Compare October 3, 2023 22:07
@nawazkh
Copy link
Member Author

nawazkh commented Oct 3, 2023

Squashing all and running a test locally.

@nawazkh nawazkh force-pushed the natgateways_to_aso branch from 1591fe8 to 9540f48 Compare October 3, 2023 22:08
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 3, 2023
@nawazkh nawazkh force-pushed the natgateways_to_aso branch from d3bad65 to 5eb3a9a Compare October 3, 2023 23:08
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 3, 2023
@nawazkh nawazkh force-pushed the natgateways_to_aso branch 2 times, most recently from 26f1f0d to 176ba9f Compare October 4, 2023 04:34
@nawazkh
Copy link
Member Author

nawazkh commented Oct 9, 2023

I think the users will soon reach out to us asking to expose other fields on NAT Gateways than the one we allow them to specify in CAPZ. That is because any value a user adds/edits of a CAPZ-maintained ASO resource gets overwritten with CAPZ's default value. For example, the user cannot add idleTimeoutInMinutes field to the CAPZ managed ASO NAT Gateway(which makes sense).
We should either add all the exposed fields of ASO's NAT Gateway CRD to CAPZ or allow users to edit the editable fields.

I think there's an argument to be made that if users need to customize resources beyond what CAPZ allows, then they should be going the BYO route for those resources.

@CecileRobertMichon @jackfrancis Do you think it's a bug or a feature that fields never set by CAPZ on resources otherwise managed by CAPZ can be modified durably by users out-of-band with CAPZ?

Adding my two cents.
I can not classify it either, but in IMO, at least from NAT Gateways' point of view, I should be able to add the idleTimeoutInMinutes field to the ASO NAT Gateway resource created and managed by CAPZ.
As a user, I know that CAPZ does not expose the idleTimeoutInMinutes field and I am not trying to edit an existing field, I am trying to add a new one that was not exposed by CAPZ.

But allowing to add an unexposed field brings a new question, what if such an addition creates unwanted behavior and mutation within a service..?
Adding idleTimeoutInMinutes in NAT Gateway does not create any mutation per se, but it is a little early to comment about other services.

@jackfrancis
Copy link
Contributor

I think the users will soon reach out to us asking to expose other fields on NAT Gateways than one we allow them to specify in CAPZ. That is because any value a user adds/edits of a CAPZ-maintained ASO resource gets overwritten with CAPZ's default value. For example, the user cannot add idleTimeoutInMinutes field to the CAPZ managed ASO NAT Gateway(which makes sense).
We should either add all the exposed fields of ASO's NAT Gateway CRD to CAPZ or allow users to edit the editable fields.

I think there's an argument to be made that if users need to customize resources beyond what CAPZ allows, then they should be going the BYO route for those resources.

@CecileRobertMichon @jackfrancis Do you think it's a bug or a feature that fields never set by CAPZ on resources otherwise managed by CAPZ can be modified durably by users out-of-band with CAPZ?

I think it should read "can be modified durably by users out-of-band with kubectl", right? That's neither a feature nor a bug, but a practical architectural decision until we protect those ASO resources behind something like a serviceaccount.

@nawazkh
Copy link
Member Author

nawazkh commented Oct 10, 2023

Do you think it's a bug or a feature that fields never set by CAPZ on resources otherwise managed by CAPZ can be modified durably by users out-of-band with CAPZ?

In addition to what Jack said and Cross-posting the summary from a discussion we had async, below are some more observations made:

  • It's not a bug per se that a user is able to set a field out-of-band but not a "supported" feature.
  • Current SDK allows setting Azure resource fields out of band already.
  • We should not make any user-facing behavior changes as part of migration to ASO.
  • We should wait for actual user demand before exposing new features.

@nawazkh
Copy link
Member Author

nawazkh commented Oct 10, 2023

Need to make changes to allow a user to set fields out-of-band.
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 10, 2023
@nawazkh nawazkh force-pushed the natgateways_to_aso branch from 76ac62d to 53da702 Compare October 10, 2023 18:36
@nawazkh
Copy link
Member Author

nawazkh commented Oct 10, 2023

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 10, 2023
@nojnhuh
Copy link
Contributor

nojnhuh commented Oct 11, 2023

@jackfrancis

I think it should read "can be modified durably by users out-of-band with kubectl", right? That's neither a feature nor a bug, but a practical architectural decision until we protect those ASO resources behind something like a serviceaccount.

I mean even today by modifying the Azure resources with az or in the portal or whatever out-of-band with CAPZ and without ASO in the picture at all. This becomes more of an issue when we start using ASO though since ASO's continuous reconciliation will start to overwrite changes made by users outside of CAPZ/ASO that currently do not get overwritten.

I guess my main question here is in @nawazkh's scenario of a NAT gateway managed by CAPZ without idleTimeoutInMinutes set, if a user edits the ASO NatGateway to define that themselves, should CAPZ overwrite that change? What if the user defined the field from the portal?

@CecileRobertMichon
Copy link
Contributor

This becomes more of an issue when we start using ASO though since ASO's continuous reconciliation will start to overwrite changes made by users outside of CAPZ/ASO that currently do not get overwritten.

Could you please expand on this part specifically? Why is ASO overwriting changes made outside of CAPZ/Azure more than the CAPZ reconciler was before?

if a user edits the ASO NatGateway to define that themselves, should CAPZ overwrite that change

Is CAPZ defaulting the field? Is it because the field required?

What if the user defined the field from the portal?

IMO behavior should be the same (if CAPZ is hardcoding and overwriting a field already it would overwrite portal changes today, ASO should not change that behavior). If CAPZ is not overwriting changes from the portal, then it should not overwrite changes from ASO either.

@nojnhuh
Copy link
Contributor

nojnhuh commented Oct 11, 2023

This becomes more of an issue when we start using ASO though since ASO's continuous reconciliation will start to overwrite changes made by users outside of CAPZ/ASO that currently do not get overwritten.

Could you please expand on this part specifically? Why is ASO overwriting changes made outside of CAPZ/Azure more than the CAPZ reconciler was before?

Several resources types created by CAPZ are never modified by CAPZ after they are created, e.g. bastionhosts:

if existing != nil {
if _, ok := existing.(armnetwork.BastionHost); !ok {
return nil, errors.Errorf("%T is not an armnetwork.BastionHost", existing)
}
// bastion host already exists
return nil, nil
}

Currently, users are free to change any parameters with az or the portal after a bastion host is created and CAPZ will never overwrite those changes. If ASO were reconciling the bastion, then those changes would be overwritten if they overlap with CAPZ's initial opinion of what the resource should look like.

if a user edits the ASO NatGateway to define that themselves, should CAPZ overwrite that change

Is CAPZ defaulting the field? Is it because the field required?

I think CAPZ essentially has no knowledge of the field in this case. If CAPZ reconstructs the entire spec of the ASO NatGateway in each reconciliation loop though, implicitly zeroing out the value set by the user could overwrite the user's value.

What if the user defined the field from the portal?

IMO behavior should be the same (if CAPZ is hardcoding and overwriting a field already it would overwrite portal changes today, ASO should not change that behavior). If CAPZ is not overwriting changes from the portal, then it should not overwrite changes from ASO either.

That makes sense to me as long as we accept that changes from the portal that do not get overwritten today may then eventually get overwritten by ASO. I don't think there's a good way around that.

@CecileRobertMichon
Copy link
Contributor

I think CAPZ essentially has no knowledge of the field in this case. If CAPZ reconstructs the entire spec of the ASO NatGateway in each reconciliation loop though, implicitly zeroing out the value set by the user could overwrite the user's value.

I thought the patch helper would handle this and only patch the changed values? CAPI patch helper uses the MergeFrom merge option (https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/client/patch.go#L150) and CAPZ passes the existing ASO object to the patch helper (here https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/azure/services/aso/aso.go#L215).

@nojnhuh
Copy link
Contributor

nojnhuh commented Oct 11, 2023

I thought the patch helper would handle this and only patch the changed values? CAPI patch helper uses the MergeFrom merge option (https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/client/patch.go#L150) and CAPZ passes the existing ASO object to the patch helper (here https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/azure/services/aso/aso.go#L215).

It looks like it does not based on a quick test I ran.

@nawazkh
Copy link
Member Author

nawazkh commented Oct 11, 2023

If CAPZ reconstructs the entire spec of the ASO NatGateway in each reconciliation loop though, implicitly zeroing out the value set by the user could overwrite the user's value.

To add more clarity, there are two types of scenarios in discussion here.

  1. The spec fields which CAPZ has an opinion about (like asonetworkv1.NatGateway.Spec.Location).
    • Ideally, we want the user to "try" editing the editable CAPZ opinionated fields by editing CAPZ CRD.
    • And any changes made via Azure portal or az cli would (and should) be overwritten by the CAPZ controller since CAPZ has a pre-defined opinion about them. (I am only talking about the spec fields here and not the annotations, labels)
  2. The spec fields where CAPZ has no idea that they exist (like asonetworkv1.NatGateway.Spec.IdleTimeoutInMinutes).
    • Since CAPZ does not consider it to be important enough to have an opinion about (read as having defaults or fields in the CRD), the user should be allowed to edit them, and CAPZ should reflect those changes in its CRD.

The current implementation that I have put in for NAT Gateway spec supports scenario 2. Refer: https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/4059/files?diff=split#diff-8592ed50d123a48d0e32068cf51a94a4b6d57588b0526d982c1c4f2723be085cR56-R59

Copy link
Contributor

@willie-yao willie-yao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 12, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 9cdc983e76b352eb418e098e13e20afd0ae42ac7

@@ -158,7 +158,7 @@ WEBHOOK_ROOT ?= $(MANIFEST_ROOT)/webhook
RBAC_ROOT ?= $(MANIFEST_ROOT)/rbac
ASO_CRDS_PATH := $(MANIFEST_ROOT)/aso/crds.yaml
ASO_VERSION := v2.3.0
ASO_CRDS := resourcegroups.resources.azure.com
ASO_CRDS := resourcegroups.resources.azure.com natgateways.network.azure.com
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nojnhuh is this list going to keep growing everytime we add a new CRD? would it make sense to track it in some other way once the list ends up being more than a few?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this will eventually get unwieldy here. Didn't get around to finding a better mechanism yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not blocking for this PR, we can circle back once the list gets big enough

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

great to see all that deleted code and added tests!

Copy link
Contributor

@nojnhuh nojnhuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nojnhuh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 13, 2023
@k8s-ci-robot k8s-ci-robot merged commit 83fe434 into kubernetes-sigs:main Oct 13, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.12 milestone Oct 13, 2023
@nawazkh nawazkh deleted the natgateways_to_aso branch October 13, 2023 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

ASO service: natgateways
6 participants