Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAPI test cannot get correct number of machines when upgrading from v1.5.0 to v1.5.1 #9411

Closed
Sunnatillo opened this issue Sep 13, 2023 · 12 comments
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Sunnatillo
Copy link
Contributor

Sunnatillo commented Sep 13, 2023

What steps did you take and what happened?

We utilize CAPI test framework for upgrading providers. When running upgrade test from CAPI v1.5.0 to v1.5.1, following code cannot get correct number of ready machines.

		By("Waiting for the machines to exist")
		Eventually(func() (int64, error) {
			var n int64
			machineList := &clusterv1alpha3.MachineList{}
			if err := managementClusterProxy.GetClient().List(ctx, machineList, client.InNamespace(testNamespace.Name), client.MatchingLabels{clusterv1.ClusterNameLabel: workLoadClusterName}); err == nil {
				for _, machine := range machineList.Items {
					if machine.Status.NodeRef != nil {
						n++
					}
				}
			}
			return n, nil
		}, input.E2EConfig.GetIntervals(specName, "wait-worker-nodes")...).Should(Equal(*controlPlaneMachineCount+*workerMachineCount), "Timed out waiting for all machines to be exist")

We checked the machines are ready. But the following error occured:

  [FAILED] Timed out after 1800.000s.
  Timed out waiting for all machines to be exist
  Expected
      <int64>: 0
  to equal
      <int64>: 2
  In [It] at: /home/****/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/e2e/clusterctl_upgrade.go:402 @ 09/13/23 02:59:32.879

What did you expect to happen?

Provider upgrade tests to pass.

Cluster API version

CAPI v1.5.0

Kubernetes version

v1.28.1

Anything else you would like to add?

Link to the logs of the test:
https://jenkins.nordix.org/job/metal3_capm3_from-release-1-5_main_e2e_clusterctl_upgrade_test_ubuntu/2/consoleFull

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 13, 2023
@Sunnatillo Sunnatillo changed the title CAPI test cannot get correct number of machines when upgrading CAPI v1.5.0 to v1.5.1 CAPI test cannot get correct number of machines when upgrading from v1.5.0 to v1.5.1 Sep 13, 2023
@killianmuldoon
Copy link
Contributor

killianmuldoon commented Sep 13, 2023

Can you try to run the test with a more up to date version of the Cluster API go types? As of v1.5 v1alpha3 is no longer served by CAPI.

i.e. change
machineList := &clusterv1alpha3.MachineList{} to machineList := &clusterv1beta1.MachineList{}

@mdbooth
Copy link
Contributor

mdbooth commented Sep 13, 2023

I believe this may be the same issue I hit, discussed in slack here: https://kubernetes.slack.com/archives/C8TSNPY4T/p1694103458989449

The TL;DR is that the test explicitly uses the v1alpha3 API, which is no longer served by 1.5.x. This requires the upgrade to be from prior to 1.5.x. We decided to pin to 1.4.x in CAPO, but the issue is fixed in 1.6, and @sbueringer mentioned he may be able to backport the fix.

@mdbooth
Copy link
Contributor

mdbooth commented Sep 13, 2023

Looks like it was specifically fixed in 5c4ce8b, which is unfortunately a mega-commit and can't be cherry-picked. Should be able to pull out the relevant parts, though.

@killianmuldoon
Copy link
Contributor

I think it would be a good idea to backport that specific fix to the release-1.5 branch.

/help
/triage accepted

@k8s-ci-robot
Copy link
Contributor

@killianmuldoon:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

I think it would be a good idea to backport that specific fix to the release-1.5 branch.

/help
/triage accepted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 13, 2023
@killianmuldoon
Copy link
Contributor

/assign @mdbooth

@furkatgofurov7
Copy link
Member

Thanks for reporting, we have not thought about this when working in #9097 and did not hit in core CAPI so it was not backported.

@Sunnatillo
Copy link
Contributor Author

Thank you for quick reaction @killianmuldoon, @furkatgofurov7, @mdbooth.

@furkatgofurov7
Copy link
Member

@Sunnatillo the fix #9412 is merged now and should be available in the coming v1.5.2 patch release after <~2 weeks.

@Sunnatillo
Copy link
Contributor Author

Thank you for update @furkatgofurov7.
I will keep this open until the release happens for clarity if other face the issue.

@sbueringer
Copy link
Member

Would be good if you can confirm in the meantime if the fix works for you

@Sunnatillo
Copy link
Contributor Author

/close by #9412
Issue has been solved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants