Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading a workload cluster using ClusterClass with RuntimeSDK test is flaky with error: Resource versions didn't stay stable #10838

Open
Sunnatillo opened this issue Jul 8, 2024 · 18 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Sunnatillo
Copy link
Contributor

Sunnatillo commented Jul 8, 2024

Which jobs are flaking?

capi-e2e-main

Which tests are flaking?

When upgrading a workload cluster using ClusterClass with RuntimeSDK [ClusterClass] [It] Should create, upgrade and delete a workload cluster
/home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/cluster_upgrade_runtimesdk.go:155

Testgrid link

Edited: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-release-1-7/1809819550426861568

No response

Reason for failure (if possible)

No response

Anything else we need to know?

No response

Label(s) to be applied

/kind flake
/area ci

  [FAILED] Failed after 63.517s.
  Resource versions didn't stay stable
  The function passed to Consistently failed at /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/resourceversion_helpers.go:53 with:
  Expected object to be comparable, diff:   map[string]string{
    	... // 11 identical entries
    	"DockerMachine/k8s-upgrade-with-runtimesdk-05ptjc/worker-r5yi9k":                                              "38350",
    	"DockerMachine/k8s-upgrade-with-runtimesdk-05ptjc/worker-vlj8b9":                                              "38404",
  - 	"DockerMachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-0-vtdfd":          "39165",
  + 	"DockerMachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-0-vtdfd":          "38721",
    	"DockerMachinePoolTemplate/k8s-upgrade-with-runtimesdk-05ptjc/quick-start-default-worker-machinepooltemplate": "29519",
    	"DockerMachineTemplate/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-0-pgr5r":      "30876",
    	... // 16 identical entries
    	"Machine/k8s-upgrade-with-runtimesdk-05ptjc/worker-vlj8b9":                                              "38573",
    	"MachineDeployment/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-md-0-qjkf9": "38854",
  - 	"MachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-mp-0-b8r79":       "39168",
  + 	"MachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-mp-0-b8r79":       "38728",
    	"MachineSet/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-md-0-qjkf9-h754k":  "38853",
    	"MachineSet/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-md-0-qjkf9-tg2vx":  "38777",
    	... // 9 identical entries
    }
  In [It] at: /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/resourceversion_helpers.go:54 @ 06/27/24 04:19:26.795

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 8, 2024
@adilGhaffarDev
Copy link
Contributor

@Sunnatillo link is pointing to different failure.

@Sunnatillo
Copy link
Contributor Author

@fabriziopandini fabriziopandini added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 17, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates an issue lacks a `priority/foo` label and requires one. label Jul 17, 2024
@fabriziopandini fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 17, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 17, 2024
@fabriziopandini
Copy link
Member

/help

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 17, 2024
@Sunnatillo Sunnatillo changed the title Upgrading a workload cluster using ClusterClass with RuntimeSDK test flaking quite a lot with error: Resource versions didn't stay stable Upgrading a workload cluster using ClusterClass with RuntimeSDK test is flaky with error: Resource versions didn't stay stable Jul 22, 2024
@willie-yao
Copy link
Contributor

/assign

@willie-yao
Copy link
Contributor

willie-yao commented Jul 24, 2024

I noticed that nodeVolumeDetachTimeout and minReadySeconds wasn't added to the machineDeployment spec for runtimesdk in #9393, so I'm gonna update that and see if the flake still happens.

@sbueringer
Copy link
Member

Fine to add, I don't think it will affect the results though

@chrischdi
Copy link
Member

Query to find the latest failures

@sbueringer
Copy link
Member

Improvement to make CAPD DockerMachinePools more deterministic: #10998

(I wouldn't expect it to solve the whole flake though)

@sbueringer
Copy link
Member

sbueringer commented Aug 13, 2024

The CAPD flake seems to be gone now.

We only have a relatively rare flake with KCP left: https://storage.googleapis.com/k8s-triage/index.html?text=Detected%20objects%20with%20changed%20resourceVersion&job=.*cluster-api.*e2e.*main&xjob=.*-provider-.*

Example: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-main/1822127420073840640

@willie-yao
Copy link
Contributor

The CAPD flake seems to be gone now.

Will unassign myself for now but if this flake is persistent, I can take another look when I have time.

/unassign

@sbueringer
Copy link
Member

sbueringer commented Sep 12, 2024

The MachinePool flake (#11162) is a lot more frequent/problematic

@sivchari
Copy link
Member

I'll investigate it.
/assign

@k8s-triage-robot
Copy link

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority important-longterm or /priority backlog
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 11, 2024
@k8s-ci-robot k8s-ci-robot removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Dec 11, 2024
@sivchari
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 11, 2024
@cprivitere
Copy link
Member

Just a note that the last occurrence of this seems to have been on 11/14/2024: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-latestk8s-main/1857118214752833536

@sivchari
Copy link
Member

Currently, I can't have time to work this.
/unassign

@chrischdi
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: No status
Development

No branches or pull requests

10 participants