Upgrading a workload cluster using ClusterClass with RuntimeSDK test is flaky with error: Resource versions didn't stay stable #10838

Sunnatillo · 2024-07-08T08:36:12Z

Which jobs are flaking?

capi-e2e-main

Which tests are flaking?

When upgrading a workload cluster using ClusterClass with RuntimeSDK [ClusterClass] [It] Should create, upgrade and delete a workload cluster
/home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/cluster_upgrade_runtimesdk.go:155

Testgrid link

Edited: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-release-1-7/1809819550426861568

No response

Reason for failure (if possible)

No response

Anything else we need to know?

No response

Label(s) to be applied

/kind flake
/area ci

  [FAILED] Failed after 63.517s.
  Resource versions didn't stay stable
  The function passed to Consistently failed at /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/resourceversion_helpers.go:53 with:
  Expected object to be comparable, diff:   map[string]string{
    	... // 11 identical entries
    	"DockerMachine/k8s-upgrade-with-runtimesdk-05ptjc/worker-r5yi9k":                                              "38350",
    	"DockerMachine/k8s-upgrade-with-runtimesdk-05ptjc/worker-vlj8b9":                                              "38404",
  - 	"DockerMachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-0-vtdfd":          "39165",
  + 	"DockerMachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-0-vtdfd":          "38721",
    	"DockerMachinePoolTemplate/k8s-upgrade-with-runtimesdk-05ptjc/quick-start-default-worker-machinepooltemplate": "29519",
    	"DockerMachineTemplate/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-0-pgr5r":      "30876",
    	... // 16 identical entries
    	"Machine/k8s-upgrade-with-runtimesdk-05ptjc/worker-vlj8b9":                                              "38573",
    	"MachineDeployment/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-md-0-qjkf9": "38854",
  - 	"MachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-mp-0-b8r79":       "39168",
  + 	"MachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-mp-0-b8r79":       "38728",
    	"MachineSet/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-md-0-qjkf9-h754k":  "38853",
    	"MachineSet/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-md-0-qjkf9-tg2vx":  "38777",
    	... // 9 identical entries
    }
  In [It] at: /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/resourceversion_helpers.go:54 @ 06/27/24 04:19:26.795

The text was updated successfully, but these errors were encountered:

adilGhaffarDev · 2024-07-09T07:18:51Z

@Sunnatillo link is pointing to different failure.

Sunnatillo · 2024-07-09T09:37:54Z

I updated it with correct link.
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-release-1-7/1809819550426861568

fabriziopandini · 2024-07-17T12:52:50Z

/help

k8s-ci-robot · 2024-07-17T12:52:52Z

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

willie-yao · 2024-07-24T20:31:00Z

/assign

willie-yao · 2024-07-24T20:54:09Z

I noticed that nodeVolumeDetachTimeout and minReadySeconds wasn't added to the machineDeployment spec for runtimesdk in #9393, so I'm gonna update that and see if the flake still happens.

sbueringer · 2024-07-25T07:15:22Z

Fine to add, I don't think it will affect the results though

chrischdi · 2024-07-31T17:35:02Z

Query to find the latest failures

sbueringer · 2024-08-02T13:30:07Z

Improvement to make CAPD DockerMachinePools more deterministic: #10998

(I wouldn't expect it to solve the whole flake though)

sbueringer · 2024-08-13T13:15:44Z

The CAPD flake seems to be gone now.

We only have a relatively rare flake with KCP left: https://storage.googleapis.com/k8s-triage/index.html?text=Detected%20objects%20with%20changed%20resourceVersion&job=.*cluster-api.*e2e.*main&xjob=.*-provider-.*

Example: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-main/1822127420073840640

willie-yao · 2024-09-11T18:24:11Z

The CAPD flake seems to be gone now.

Will unassign myself for now but if this flake is persistent, I can take another look when I have time.

/unassign

sbueringer · 2024-09-12T05:18:00Z

The MachinePool flake (#11162) is a lot more frequent/problematic

sivchari · 2024-09-12T06:13:47Z

I'll investigate it.
/assign

k8s-triage-robot · 2024-12-11T09:34:01Z

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

sivchari · 2024-12-11T10:47:56Z

/triage accepted

cprivitere · 2025-01-14T23:09:19Z

Just a note that the last occurrence of this seems to have been on 11/14/2024: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-latestk8s-main/1857118214752833536

sivchari · 2025-01-15T01:55:28Z

Currently, I can't have time to work this.
/unassign

chrischdi · 2025-01-15T08:51:50Z

I think this still seems to happen (although the message changed):

https://storage.googleapis.com/k8s-triage/index.html?text=Resource%20versions%20didn%27t%20stay%20stable&job=.*-cluster-api-.*&test=When%20upgrading%20a%20workload%20cluster%20using%20ClusterClass%20with%20RuntimeSDK&xjob=.*-provider-.*%7C.*-cluster-api-operator-.*

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 8, 2024

fabriziopandini added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 17, 2024

k8s-ci-robot removed the needs-priority Indicates an issue lacks a `priority/foo` label and requires one. label Jul 17, 2024

fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 17, 2024

k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 17, 2024

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 17, 2024

k8s-ci-robot assigned willie-yao Jul 24, 2024

willie-yao mentioned this issue Jul 24, 2024

🌱 Add nodeVolumeDetachTimeout & minReadySeconds for MD to RuntimeSDK e2e test template #10933

Merged

sbueringer mentioned this issue Aug 2, 2024

🐛 Ensure DockerMachinePool providerIDList is deterministic #10998

Merged

Sunnatillo added this to CAPI v1.9 release improvement tasks Aug 28, 2024

k8s-ci-robot unassigned willie-yao Sep 11, 2024

k8s-ci-robot assigned sivchari Sep 12, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 11, 2024

k8s-ci-robot removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Dec 11, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 11, 2024

k8s-ci-robot unassigned sivchari Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading a workload cluster using ClusterClass with RuntimeSDK test is flaky with error: Resource versions didn't stay stable #10838

Upgrading a workload cluster using ClusterClass with RuntimeSDK test is flaky with error: Resource versions didn't stay stable #10838

Sunnatillo commented Jul 8, 2024 •

edited

Loading

adilGhaffarDev commented Jul 9, 2024

Sunnatillo commented Jul 9, 2024

fabriziopandini commented Jul 17, 2024

k8s-ci-robot commented Jul 17, 2024

willie-yao commented Jul 24, 2024

willie-yao commented Jul 24, 2024 •

edited

Loading

sbueringer commented Jul 25, 2024

chrischdi commented Jul 31, 2024

sbueringer commented Aug 2, 2024

sbueringer commented Aug 13, 2024 •

edited

Loading

willie-yao commented Sep 11, 2024

sbueringer commented Sep 12, 2024 •

edited

Loading

sivchari commented Sep 12, 2024

k8s-triage-robot commented Dec 11, 2024

sivchari commented Dec 11, 2024

cprivitere commented Jan 14, 2025

sivchari commented Jan 15, 2025

chrischdi commented Jan 15, 2025

Upgrading a workload cluster using ClusterClass with RuntimeSDK test is flaky with error: Resource versions didn't stay stable #10838

Upgrading a workload cluster using ClusterClass with RuntimeSDK test is flaky with error: Resource versions didn't stay stable #10838

Comments

Sunnatillo commented Jul 8, 2024 • edited Loading

Which jobs are flaking?

Which tests are flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Label(s) to be applied

adilGhaffarDev commented Jul 9, 2024

Sunnatillo commented Jul 9, 2024

fabriziopandini commented Jul 17, 2024

k8s-ci-robot commented Jul 17, 2024

Guidelines

willie-yao commented Jul 24, 2024

willie-yao commented Jul 24, 2024 • edited Loading

sbueringer commented Jul 25, 2024

chrischdi commented Jul 31, 2024

sbueringer commented Aug 2, 2024

sbueringer commented Aug 13, 2024 • edited Loading

willie-yao commented Sep 11, 2024

sbueringer commented Sep 12, 2024 • edited Loading

sivchari commented Sep 12, 2024

k8s-triage-robot commented Dec 11, 2024

sivchari commented Dec 11, 2024

cprivitere commented Jan 14, 2025

sivchari commented Jan 15, 2025

chrischdi commented Jan 15, 2025

Sunnatillo commented Jul 8, 2024 •

edited

Loading

willie-yao commented Jul 24, 2024 •

edited

Loading

sbueringer commented Aug 13, 2024 •

edited

Loading

sbueringer commented Sep 12, 2024 •

edited

Loading