Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Use a different object name for each ClusterResultSet test case #4081

Conversation

fabriziopandini
Copy link
Member

@fabriziopandini fabriziopandini commented Jan 15, 2021

What this PR does / why we need it:
Fix ClusterResultSet test flakes by ensuring that each test case uses a different CRS object name (instead of reusing always the same name).

Which issue(s) this PR fixes:
Fixes #4075

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 15, 2021
@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-test-main

2 similar comments
@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-test-main

@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-test-main

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 15, 2021
@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-test-main

4 similar comments
@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-test-main

@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-test-main

@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-test-main

@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-test-main

@fabriziopandini fabriziopandini force-pushed the investigate-clusterresultset-flakes branch from 72a5865 to 4ae0733 Compare January 15, 2021 16:55
@fabriziopandini fabriziopandini changed the title [WIP] 🐛 investigate ClusterResultSet test flakes 🐛 investigate ClusterResultSet test flakes Jan 15, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 15, 2021
@fabriziopandini fabriziopandini changed the title 🐛 investigate ClusterResultSet test flakes 🐛 Use different object names for each ClusterResultSet test cases Jan 15, 2021
@fabriziopandini fabriziopandini changed the title 🐛 Use different object names for each ClusterResultSet test cases 🐛 Use a different object name for each ClusterResultSet test case Jan 15, 2021
@fabriziopandini
Copy link
Member Author

@vincepri @jsturtevant @furkatgofurov7 PTAL

With this change, the ClusterResultSet test seems to be consistently passing

@jsturtevant
Copy link
Contributor

jsturtevant commented Jan 15, 2021

I pulled your change into my branch and ran it locally via in goland and still see errors:

Summarizing 1 Failure:

[Fail] ClusterResourceSet Reconciler [It] Should reconcile a ClusterResourceSet when a resource is created that is part of ClusterResourceSet resources 
/home/jstur/projects/cluster-api/exp/addons/controllers/clusterresourceset_controller_test.go:312

Ran 5 of 5 Specs in 16.096 seconds
FAIL! -- 4 Passed | 1 Failed | 0 Pending | 0 Skipped

From my investigation what seems to be happening is the test client doesn't see the configMap right away (my custom logging):

I0115 09:11:02.883895 1738907 clusterresourceset_controller.go:292] controller-runtime/manager/controller/clusterresourceset "msg"="############## can't find the object!" "cluster"="cluster-84jeuv" "name"="test-clusterresourceset" "namespace"="default" "reconciler group"="addons.cluster.x-k8s.io" "reconciler kind"="ClusterResourceSet" "ns"="default" "resource"={"name":"test-configmap3","kind":"ConfigMap"}

and it quietly returns ctrl.Result{} so the controller doesn't reconcile again. As a test, I updated to return and re-run when the ConfigMap isn't found and it passes consistently:

// Continue without adding the error to the aggregate if we can't find the resource.
if apierrors.IsNotFound(err) {
continue
}

to

if apierrors.IsNotFound(err) {
					log.Info("############## can't find the object!", "resource", resource, "ns", cluster.GetNamespace())
					return err
					//continue
				}

Seems like there are two things happening... The test controller doesn't get the value correctly (initially) and second if the the resource isn't there (maybe because of delay or caching) the reconciliation might not happen.

@jsturtevant
Copy link
Contributor

looks like the test failed in prow as well :-(

I saw this comment: // It applies resources best effort and continue on scenarios like: unsupported resource types, failure during creation, missing resources.

Currently if the resource is not found it doesn't re-trigger reconciliation unless something else re-triggers it later. Should we consider adding this as an error and retrying reconciliation with a back-off in cases where the resource is missing?

@fabriziopandini
Copy link
Member Author

@jsturtevant the prow failure is in the machinepool test, and it is tracked in #4068
I will investigate your note above, thanks for the detailed repo

@jsturtevant
Copy link
Contributor

@jsturtevant the prow failure is in the machinepool test, and it is tracked in #4068

I see that now. looked too quick.

These are the rough changes with the logging for debugging and retry: https://github.com/kubernetes-sigs/cluster-api/compare/master...jsturtevant:crs-objects-not-found-retry?expand=1. Seems that this is mainly caused by something in the test client caching (since retries fix it) but it might make the CRS more resilient in the cases that something wasn't on the cluster when the reconciliation happened for what ever reason.

I also found that in this case where the ConfigMap object isn't found the conditions.MarkTrue(clusterResourceSet, addonsv1.ResourcesAppliedCondition) is applied to the cluster which is a bit strange to me.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 18, 2021
@fabriziopandini
Copy link
Member Author

/retest

@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-test-main

1 similar comment
@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-test-main

@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-test-main

@fabriziopandini fabriziopandini force-pushed the investigate-clusterresultset-flakes branch from aa7a57d to c69f3e8 Compare January 18, 2021 12:31
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 18, 2021
@fabriziopandini
Copy link
Member Author

@jsturtevant thanks!
I have incorporated also the wait for the config map object to be created, and now this flake does not happen anymore both in prow and locally for me. If everyone agrees, I suggest to merge these changes and then make another round of optimization after keeping an eye on flakes for a few days

@fabriziopandini
Copy link
Member Author

/milestone v0.4.0

@k8s-ci-robot k8s-ci-robot added this to the v0.4.0 milestone Jan 19, 2021
Comment on lines -102 to +106
Name: "test-clusterresourceset",
Name: clusterResourceSetName,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use generate name instead?

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/milestone v0.4.0

@fabriziopandini fabriziopandini force-pushed the investigate-clusterresultset-flakes branch from c69f3e8 to 90dfe19 Compare January 19, 2021 15:59
Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 22, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 22, 2021
@k8s-ci-robot k8s-ci-robot merged commit e851972 into kubernetes-sigs:master Jan 22, 2021
@fabriziopandini fabriziopandini deleted the investigate-clusterresultset-flakes branch February 10, 2021 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flakes in the ClusterResourceSet unit tests
4 participants