Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix delete for VMSS flex #3256

Merged
merged 1 commit into from
Mar 16, 2023
Merged

Conversation

mboersma
Copy link
Contributor

@mboersma mboersma commented Mar 10, 2023

What type of PR is this?

/kind failing-test

What this PR does / why we need it:

Fixes a bug parsing a VMSS Flex resourceID that led to problems scaling in a MachinePool. Also adds unit test cases that will fail if the code reverts and enables scaling down in the relevant e2e test.

Which issue(s) this PR fixes:

Fixes #3077

Special notes for your reviewer:

All credit goes to @CecileRobertMichon for tracking down the root cause of this issue. 🥇

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

Fix delete for VMSS flex

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 10, 2023
@CecileRobertMichon
Copy link
Contributor

/test pull-cluster-api-provider-azure-e2e-optional

@CecileRobertMichon
Copy link
Contributor

/cherry-pick release-1.8
/cherry-pick release-1.7

@k8s-infra-cherrypick-robot

@CecileRobertMichon: once the present PR merges, I will cherry-pick it on top of release-1.8 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.8
/cherry-pick release-1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@CecileRobertMichon
Copy link
Contributor

/retest

the flex MP failed to provision

@CecileRobertMichon
Copy link
Contributor

/retest

e2e should be fixed now

@mboersma
Copy link
Contributor Author

e2e should be fixed now

Sigh.

@CecileRobertMichon
Copy link
Contributor

/retest

this time for real

@jackfrancis
Copy link
Contributor

flex job failed (hasn't reported), but you can start investigating now @mboersma to see if this indeed didn't fix the issue:

@mboersma
Copy link
Contributor Author

Timed out waiting for 0 ready replicas

Um wut? I don't think that's part of the intended test, will fix.

@CecileRobertMichon
Copy link
Contributor

it's weird because the flex e2e test passed on my WIP PR for the same fix #3249

So either something changed since then or the autorest parse function isn't working as we expected

@mboersma
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

@mboersma
Copy link
Contributor Author

mboersma commented Mar 15, 2023

I've run this flex e2e spec locally, and tested scale in to 0 on Windows clusters manually through Tilt several times, for both Uniform and Flexible, and I don't see a problem.

I think the issue was this PR referenced k8s v1.26.2 which has some Windows cloud-init issues in our reference image that we're working to fix (see kubernetes-sigs/image-builder#1109). I moved it back to v1.26.1.

@mboersma
Copy link
Contributor Author

mboersma commented Mar 15, 2023

The -optional test failed on clusterclass and private cluster :-( but did pass the spec this PR is concerned with:

  checking AzureMachinePool capz-e2e-f78qhy-flex-mp-0 in Flexible orchestration mode @ 03/15/23 20:33:25.945
  checking AzureMachinePool capz-e2e-f78qhy-flex-mp-win in Flexible orchestration mode @ 03/15/23 20:33:25.981
  Scaling machine pool capz-e2e-f78qhy-flex-mp-0 out from 1 to 2 @ 03/15/23 20:33:26.018
  Scaling machine pool capz-e2e-f78qhy-flex-mp-win out from 0 to 1 @ 03/15/23 20:33:26.018
  INFO: Patching the replica count in Machine Pool capz-e2e-f78qhy/capz-e2e-f78qhy-flex-mp-win
  INFO: Patching the replica count in Machine Pool capz-e2e-f78qhy/capz-e2e-f78qhy-flex-mp-0
  Waiting for the machine pool workload nodes @ 03/15/23 20:33:26.089
  Waiting for the machine pool workload nodes @ 03/15/23 20:33:26.093
  Scaling machine pool capz-e2e-f78qhy-flex-mp-0 in from 2 to 1 @ 03/15/23 20:37:46.385
  Scaling machine pool capz-e2e-f78qhy-flex-mp-win in from 1 to 0 @ 03/15/23 20:37:46.386

@mboersma
Copy link
Contributor Author

/retest

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@mboersma
Copy link
Contributor Author

The -optional job is still failing on the clusterclass spec; I'll dig in and see if there's a fix.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 16, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 7282a20cf4c44aad31cb9a3711d68c387df6d30c

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 16, 2023
@CecileRobertMichon
Copy link
Contributor

/retest

@CecileRobertMichon
Copy link
Contributor

/retest

do we have an open flaky test issue for ClusterClass?

@mboersma
Copy link
Contributor Author

mboersma commented Mar 16, 2023

do we have an open flaky test issue for ClusterClass?

I don't think so, but I'm repro'ing and investigating now and I'll add one.

Edit: See #3313

@k8s-ci-robot
Copy link
Contributor

@mboersma: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-azure-e2e-optional ae52156 link false /test pull-cluster-api-provider-azure-e2e-optional
pull-cluster-api-provider-azure-e2e ae52156 link unknown /test pull-cluster-api-provider-azure-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@CecileRobertMichon
Copy link
Contributor

@mboersma I think I have a pretty decent idea of root cause (see comments in #3313) and there are exact repros of the same error in main (https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-provider-azure-e2e-main/1636248867903115264/artifacts/clusters/bootstrap/resources/capz-e2e-851wjq/KubeadmControlPlane/capz-e2e-851wjq-cc-rtlzw.yaml) so I'm tempted to override the optional test if the next run fails again

/retest

@k8s-ci-robot k8s-ci-robot merged commit 739a7c2 into kubernetes-sigs:main Mar 16, 2023
@k8s-infra-cherrypick-robot

@CecileRobertMichon: new pull request created: #3314

In response to this:

/cherry-pick release-1.8
/cherry-pick release-1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-infra-cherrypick-robot

@CecileRobertMichon: new pull request created: #3315

In response to this:

/cherry-pick release-1.8
/cherry-pick release-1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mboersma mboersma deleted the fix-scale-in branch March 16, 2023 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CAPZ fails to scale in K8s VMSS Flex cluster
5 participants