Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: upstream most of Azure managed CAS changes in cloudprovider/azure for 1.29 #6991

Conversation

comtalyst
Copy link
Contributor

@comtalyst comtalyst commented Jun 28, 2024

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

"Refactor" as a part of fork-upstream (managed-selfhosted) realignment. Should not have any breaking changes.
This codebases realignment will simplify the logistics between the two, cutting a significant portion of maintenance cost.

There will be a separate effort focusing on improve code quality, rather than realigning codebase.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

All are in cloudprovider/azure except go.mod, go.sum, and vendor.
See comments (as annotations) for explanation of each changes.

But, overall:

  • Bump cloud-provider-azure library to v1.29
  • Trivial code improvements (e.g., using constants for reused string literals, method interfaces, etc.)
  • Support cloud provider AAD certificate authentication
  • Add a new GPU label, with the old one being deprecated/switch out soon
  • Temporarily adding new configuration options; primarily for managed offering, not being supported for self-hosted yet so don't rely on them until documented: EnableForceDelete (already in master), EnableDetailedCSEMessage, GetVmssSizeRefreshPeriod
  • Add retry for creatingAzureManager in case of throttled requests (already in master)
  • Add support for edge zones (already in master)

Does this PR introduce a user-facing change?

- Azure: update cloud-provider-azure library version to v1.29.0
- Azure: support cloud provider AAD certificate authentication
- Azure: add support for edge zones

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 28, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @comtalyst. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 28, 2024
@k8s-ci-robot k8s-ci-robot added the area/provider/azure Issues or PRs related to azure provider label Jun 28, 2024
@k8s-ci-robot k8s-ci-robot requested a review from feiskyer June 28, 2024 23:07
@comtalyst comtalyst force-pushed the comtalyst/azure-changes-from-managed-1.29 branch 3 times, most recently from b38947b to 6c53a9d Compare June 29, 2024 19:56
Copy link
Contributor Author

@comtalyst comtalyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(some annotations)

k8s.io/utils v0.0.0-20230726121419-3b25d923346b
sigs.k8s.io/cloud-provider-azure v1.28.0
k8s.io/utils v0.0.0-20231127182322-b307cd553661
sigs.k8s.io/cloud-provider-azure v1.29.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated cloud-provider-azure version. This have resulted in other module changes as well as the switch to go.uber.org/mock/gomock.

@@ -163,7 +163,7 @@ func (as *AgentPool) GetVMIndexes() ([]int, map[int]string, error) {
}

indexes = append(indexes, index)
resourceID, err := convertResourceGroupNameToLower("azure://" + *instance.ID)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only trivial changes in this file.

@@ -185,7 +185,8 @@ func TestGetVMsFromCache(t *testing.T) {
mockVMClient := mockvmclient.NewMockInterface(ctrl)
testAS.manager.azClient.virtualMachinesClient = mockVMClient
mockVMClient.EXPECT().List(gomock.Any(), testAS.manager.config.ResourceGroup).Return(expectedVMs, nil)
ac, err := newAzureCache(testAS.manager.azClient, refreshInterval, testAS.manager.config.ResourceGroup, vmTypeStandard, false, "")
testAS.manager.config.VMType = vmTypeStandard
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only trivial changes in this file.

@@ -35,8 +35,8 @@ import (
"github.com/Azure/azure-sdk-for-go/services/storage/mgmt/2021-09-01/storage"
"github.com/Azure/go-autorest/autorest/date"
"github.com/Azure/go-autorest/autorest/to"
"github.com/golang/mock/gomock"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resulted from cloud-provider-azure bump.

@@ -18,8 +18,9 @@ package azure

import (
"fmt"
"k8s.io/autoscaler/cluster-autoscaler/cloudprovider"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only trivial changes in this file.

@@ -47,8 +52,19 @@ type AzureManager struct {
azClient *azClient
env azure.Environment

azureCache *azureCache
lastRefresh time.Time
// azureCache is used for caching Azure resources.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of instance cache refactor.

@@ -106,8 +122,22 @@ func createAzureManagerInternal(configReader io.Reader, discoveryOpts cloudprovi
return nil, err
}

retryBackoff := wait.Backoff{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"added retry for creatingAzureManager in case of throttled requests"

Copy link
Contributor Author

@comtalyst comtalyst Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not requiring this in master due to #6550.

Thus, this change in this PR will be equivalent to cherry-pick.

@@ -0,0 +1,260 @@
/*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of instance cache refactor.


lastSizeRefresh time.Time
// Current Size (Number of VMs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of instance cache refactor.

@@ -347,7 +379,9 @@ func (m *AzureManager) getFilteredScaleSets(filter []labelAutoDiscoveryConfig) (
curSize = *scaleSet.Sku.Capacity
}

vmss, err := NewScaleSet(spec, m, curSize)
dedicatedHost := scaleSet.VirtualMachineScaleSetProperties != nil && scaleSet.VirtualMachineScaleSetProperties.HostGroup != nil
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of a managed feature.

@comtalyst comtalyst changed the title refactor: upstream most of Azure managed CAS changes in cloudprovider/azure refactor: upstream most of Azure managed CAS changes in cloudprovider/azure for 1.29 Jul 2, 2024
@comtalyst comtalyst marked this pull request as ready for review July 2, 2024 00:45
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 2, 2024
@k8s-ci-robot k8s-ci-robot requested review from tallaxes and x13n July 2, 2024 00:45
@comtalyst comtalyst marked this pull request as draft July 2, 2024 23:53
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 2, 2024
"strings"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not requiring this in master due to #6863

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a reference to the fact that you'll have separate PRs for 1.28, 1.29, 1.30 (not yet), and master?
image

and another PR has already added this line for master so it will be missing from that particular PR? (aka #7003)

if that's the correct understanding, let's link #7003 here as well + update the comment to "this was already added in master by #6863. will not include this change in #7003"

// (DEPRECATED, DO NOT USE) EnableDetailedCSEMessage defines whether to emit error messages in the CSE error body info
EnableDetailedCSEMessage bool `json:"enableDetailedCSEMessage,omitempty" yaml:"enableDetailedCSEMessage,omitempty"`
// (DEPRECATED, DO NOT USE) EnableForceDelete defines whether to enable force deletion on the APIs
EnableForceDelete bool `json:"enableForceDelete,omitempty" yaml:"enableForceDelete,omitempty"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for EnableForceDelete) Not requiring this in master due to #6447

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still mark as deprecation due to unknown future, might be removed later.

@@ -26,13 +26,18 @@ import (
"time"

"github.com/Azure/go-autorest/autorest/azure"
"k8s.io/apimachinery/pkg/util/wait"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not requiring this in master due to #6550

instancesRefreshJitter: az.config.VmssVmsCacheJitter,
},

enableForceDelete: az.config.EnableForceDelete,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not requiring this in master due to #6447

@@ -181,30 +247,25 @@ func (scaleSet *ScaleSet) getCurSize() (int64, error) {
return scaleSet.curSize, nil
}

// GetScaleSetSize gets Scale Set size.
func (scaleSet *ScaleSet) GetScaleSetSize() (int64, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required in master due to 1.30 fork stop cherry-picking the difference (??)

defer scaleSet.sizeMutex.Unlock()

// setScaleSetSize sets ScaleSet size.
func (scaleSet *ScaleSet) setScaleSetSize(size int64, delta int) error {
Copy link
Contributor Author

@comtalyst comtalyst Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required in master due to 1.30 fork stop cherry-picking the difference (??). But just method visibility, should not have a problem.

@@ -490,7 +601,18 @@ func (scaleSet *ScaleSet) DeleteNodes(nodes []*apiv1.Node) error {
ref := &azureRef{
Name: node.Spec.ProviderID,
}
refs = append(refs, ref)

if node.Annotations[cloudprovider.FakeNodeReasonAnnotation] == cloudprovider.FakeNodeUnregistered {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required in master due to 1.30 fork stop cherry-picking the difference (??). This one might worth looking into---later.

node := apiv1.Node{}
nodeName := fmt.Sprintf("%s-asg-%d", scaleSetName, rand.Int63())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivial changes early in this function

@@ -122,17 +138,14 @@ func buildNodeFromTemplate(scaleSetName string, template compute.VirtualMachineS
node.Status.Capacity[apiv1.ResourceCPU] = *resource.NewQuantity(vcpu, resource.DecimalSI)
// isNPSeries returns if a SKU is an NP-series SKU
// SKU API reports GPUs for NP-series but it's actually FPGAs
if !isNPSeries(*template.Sku.Name) {
if isNPSeries(*template.Sku.Name) {
Copy link
Contributor Author

@comtalyst comtalyst Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to add/fix support for FPGA?

Part of "Realign azure template labels with AKS changes" (so as the rest of this file)

)

func buildInstanceOS(template compute.VirtualMachineScaleSet) string {
Copy link
Contributor Author

@comtalyst comtalyst Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really changed, just moved to line 228 133.

@rakechill
Copy link
Contributor

rakechill commented Jul 20, 2024

before I approve:

  • can we publish this PR? I'm fairly certain some checks won't run until then (haven't verified yet)
  • I'm gonna deploy this version locally and make sure things seem reasonable
  • can you create an issue to track things we want to return to for code quality/clarity. don't have to work on it now, but I want it to be easy to refer to once we get initial code diffs merged

@comtalyst comtalyst force-pushed the comtalyst/azure-changes-from-managed-1.29 branch from f67a5d7 to d0cfa6c Compare July 21, 2024 18:29
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 21, 2024
@comtalyst comtalyst marked this pull request as ready for review July 21, 2024 18:45
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 21, 2024
@k8s-ci-robot k8s-ci-robot requested a review from jackfrancis July 21, 2024 18:46
@comtalyst
Copy link
Contributor Author

comtalyst commented Jul 21, 2024

before I approve:

  • can we publish this PR? I'm fairly certain some checks won't run until then (haven't verified yet)
  • I'm gonna deploy this version locally and make sure things seem reasonable
  • can you create an issue to track things we want to return to for code quality/clarity. don't have to work on it now, but I want it to be easy to refer to once we get initial code diffs merged

Don't think I can run some tests until ok-to-test. Waiting for others for this one.
And I don't think you have permissions to approve or mark that ok-to-test?

@tallaxes
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 24, 2024
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 26, 2024
@k8s-ci-robot
Copy link
Contributor

@comtalyst: The /retest command does not accept any targets.
The following commands are available to trigger optional jobs:

  • /test pull-cluster-autoscaler-e2e-azure
  • /test pull-kubernetes-e2e-autoscaling-vpa-full

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-autoscaler-e2e-azure

In response to this:

/retest pull-cluster-autoscaler-e2e-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

@comtalyst: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-autoscaler-e2e-azure 34e510f link false /test pull-cluster-autoscaler-e2e-azure

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@feiskyer
Copy link
Member

@comtalyst would you open a separate PR for master branch?

@comtalyst
Copy link
Contributor Author

@comtalyst would you open a separate PR for master branch?

Yes. In fact is one on each version, including master: #7003, #7076, #7067, #7075.

@tallaxes
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 31, 2024
Copy link
Member

@feiskyer feiskyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: comtalyst, feiskyer, tallaxes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 31, 2024
@feiskyer
Copy link
Member

Could you fix the test failures of pull-cluster-autoscaler-e2e-azure in a separate PR?

@k8s-ci-robot k8s-ci-robot merged commit 7f4b7fa into kubernetes:cluster-autoscaler-release-1.29 Jul 31, 2024
6 of 7 checks passed
@comtalyst
Copy link
Contributor Author

pull-cluster-autoscaler-e2e-azure

That test doesn't support release branches yet. Running it here was a mistake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants