Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VMSS Flex support for MachinePools #2813

Merged
merged 1 commit into from
Jan 10, 2023

Conversation

mboersma
Copy link
Contributor

@mboersma mboersma commented Nov 15, 2022

What type of PR is this?

/kind feature

What this PR does / why we need it:

Implements VMSS Flex mode support for MachinePools.

There is still work to do here, I'm just getting a WIP branch out to help with collaboration. For example:

  • Flex mode still on by default Fixed
  • MP ReadyReplicas not always syncing with actual VM count Fixed
  • faultDomainCount still hard-coded at 3 because cache code isn't working Fixed

Which issue(s) this PR fixes:

Fixes #999
Fixes #2987

Special notes for your reviewer:

This work was started and mostly completed by @devigned and @jackfrancis. Thanks team!

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

VMSS Flex support for MachinePools

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 15, 2022
@mboersma mboersma added this to the v1.7 milestone Nov 17, 2022
@mboersma mboersma force-pushed the vmss-flex branch 2 times, most recently from d873a5c to 1defca4 Compare November 22, 2022 16:50
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 30, 2022
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 30, 2022
@mboersma mboersma force-pushed the vmss-flex branch 2 times, most recently from 7c21d95 to 97e818f Compare November 30, 2022 20:47
Tiltfile Outdated Show resolved Hide resolved
azure/converters/vmss.go Show resolved Hide resolved
azure/scope/machinepool.go Outdated Show resolved Hide resolved
azure/scope/machinepool.go Outdated Show resolved Hide resolved
azure/services/scalesets/scalesets.go Outdated Show resolved Hide resolved
azure/services/scalesets/scalesets.go Outdated Show resolved Hide resolved
azure/services/scalesetvms/scalesetvms.go Outdated Show resolved Hide resolved
azure/services/virtualmachines/client.go Show resolved Hide resolved
azure/types.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 9, 2022
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 9, 2022
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 14, 2022
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 15, 2022
@mboersma mboersma force-pushed the vmss-flex branch 2 times, most recently from 13c4dbe to f1b9a94 Compare December 21, 2022 02:19
@mboersma
Copy link
Contributor Author

mboersma commented Jan 9, 2023

/test pull-cluster-api-provider-azure-e2e

e2e failed to create an HA cluster and e2e-optional failed on the clusterclass spec. Neither appears to be related to these changes, but that's a worrisome number of flakes. ❄️

@mboersma
Copy link
Contributor Author

mboersma commented Jan 9, 2023

/test pull-cluster-api-provider-azure-e2e-optional

@mboersma
Copy link
Contributor Author

mboersma commented Jan 9, 2023

I updated the azure.json-related code not to add enableVmssFlexNodes if the EXP_MACHINE_POOLS feature flag is off (and tested this locally).

@mboersma
Copy link
Contributor Author

mboersma commented Jan 9, 2023

/test pull-cluster-api-provider-azure-e2e-optional

@CecileRobertMichon
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 9, 2023
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign @jackfrancis

@mboersma
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e

Failed to provision HA cluster.

@mboersma
Copy link
Contributor Author

This has passed the -optional test several times now without failure, since we added the flag to azure.json, and all the failures we've seen since have been known flakes. (The apidiff error is because we changed the AzureMachinePool webhook, not blocking IMHO.) So I have confidence this is working and not breaking anything else.

Copy link
Contributor

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@jackfrancis
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 10, 2023
@CecileRobertMichon
Copy link
Contributor

Error from server (NotFound): configmaps "kubeadm-config" not found

seems like a race unrelated to this change, will investigate

/retest

@mboersma
Copy link
Contributor Author

mboersma commented Jan 10, 2023

/retest

I haven't seen this before, yay. I don't think this PR caused it, so retesting.

INFO: "With ipv6 worker node" started at Tue, 10 Jan 2023 20:36:35 UTC on Ginkgo node 6 of 10 and junit test report to file /logs/artifacts/test_e2e_junit.e2e_suite.1.xml
  << Timeline
  [FAILED] Timed out after 1800.000s.
  Expected success, but got an error:
      <*errors.fundamental | 0xc003fa89c0>: {
          msg: "cannot re-use a name that is still in use",
          stack: [0x34dfe66, 0x34dda45, 0x352c418, 0x14ef085, 0x14ee57c, 0x190a67a, 0x190b582, 0x1908d2d, 0x352c09b, 0x351e558, 0x3522130, 0x2f9caf0, 0x35342c8, 0x18e639b, 0x18f9e98, 0x147c741],
      }
      cannot re-use a name that is still in use
  In [It] at: /home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/helpers.go:879 @ 01/10/23 20:30:17.977
  Full Stack Trace
    sigs.k8s.io/cluster-api-provider-azure/test/e2e.InstallHelmChart({_, _}, {{0x4223910, 0xc000428190}, {{0xc000d03e30, 0x22}, {0xc000531510, 0x31}, {0xc000533cf3, 0x17}, ...}, ...}, ...)
    	/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/helpers.go:879 +0x5db
    sigs.k8s.io/cluster-api-provider-azure/test/e2e.InstallAzureDiskCSIDriverHelmChart({_, _}, {{0x4223910, 0xc000428190}, {{0xc000d03e30, 0x22}, {0xc000531510, 0x31}, {0xc000533cf3, 0x17}, ...}, ...}, ...)
    	/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/cloud-provider-azure.go:79 +0x1f8
    sigs.k8s.io/cluster-api-provider-azure/test/e2e.EnsureControlPlaneInitialized({_, _}, {{0x4223910, 0xc000428190}, {{0xc000d03e30, 0x22}, {0xc000531510, 0x31}, {0xc000533cf3, 0x17}, ...}, ...}, ...)
    	/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/common.go:268 +0xb90
    sigs.k8s.io/cluster-api/test/framework/clusterctl.ApplyClusterTemplateAndWait({_, _}, {{0x4223910, 0xc000428190}, {{0xc000d03e30, 0x22}, {0xc000531510, 0x31}, {0xc000533cf3, 0x17}, ...}, ...}, ...)
    	/home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.3.1/framework/clusterctl/clusterctl_helpers.go:334 +0xd30
    sigs.k8s.io/cluster-api-provider-azure/test/e2e.glob..func1.5.1()
    	/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/azure_test.go:303 +0x488
------------------------------
[SynchronizedAfterSuite] PASSED [0.000 seconds]
[SynchronizedAfterSuite] 
/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/e2e_suite_test.go:116

@kubernetes-sigs kubernetes-sigs deleted a comment from k8s-ci-robot Jan 10, 2023
@jackfrancis
Copy link
Contributor

@mboersma fwiw I saw that last week (on another PR)

@CecileRobertMichon
Copy link
Contributor

that shouldn't be happening anymore since I made the helm install idempotent in #2915, I'll look into it

@jackfrancis
Copy link
Contributor

/retest

@jackfrancis
Copy link
Contributor

👁️ 🩸

@kubernetes-sigs kubernetes-sigs deleted a comment from k8s-ci-robot Jan 10, 2023
@kubernetes-sigs kubernetes-sigs deleted a comment from k8s-ci-robot Jan 10, 2023
@k8s-ci-robot k8s-ci-robot merged commit 6d4750c into kubernetes-sigs:main Jan 10, 2023
@mboersma mboersma deleted the vmss-flex branch January 11, 2023 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VMSS Flex mode should be validated by webhook Support VMSS Flex
6 participants