Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: resource binding is not created when creating resources and propagation policies at the same time #1199

Merged
merged 1 commit into from
Jan 24, 2022

Conversation

dddddai
Copy link
Member

@dddddai dddddai commented Dec 31, 2021

What type of PR is this?
/kind bug
/kind flake

What this PR does / why we need it:
The root cause is as the follow process:
image

We must add the resource to waiting list before looking up the matched propagation policy, so that the matched resource can always be found when reconciling propagation policies

Which issue(s) this PR fixes:
Fixes #1195

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@karmada-bot karmada-bot added kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. labels Dec 31, 2021
@karmada-bot karmada-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Dec 31, 2021
@dddddai dddddai changed the title bugfix: resource binding is not created bugfix: resource binding is not created when creating resources and propagation policies at the same time Dec 31, 2021
pkg/detector/detector.go Outdated Show resolved Hide resolved
@@ -955,7 +958,6 @@ func (d *ResourceDetector) HandlePropagationPolicyCreation(policy *policyv1alpha
}

for _, key := range matchedKeys {
d.RemoveWaiting(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand why this line was deleted.

Copy link
Member Author

@dddddai dddddai Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the waiting obj would be removed when reconciling, as above shows
Is there any good reason to keep it here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconciling will add the key again, keeping it here feels fine and easier to understand.

Copy link
Member Author

@dddddai dddddai Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconciling will add the key again, keeping it here feels fine and easier to understand.

waitingObjects map[keys.ClusterWideKey]struct{}

The waitingObjects is a map, so it won't matter if an obj is added multiple times

If you mean it's for better understanding, I will revert it though

@XiShanYongYe-Chang
Copy link
Member

XiShanYongYe-Chang commented Jan 4, 2022

image

The current processing logic can cover the scenario.

@dddddai
Copy link
Member Author

dddddai commented Jan 4, 2022

@XiShanYongYe-Chang Sorry... I didn't get the meaning of the picture

@XiShanYongYe-Chang
Copy link
Member

@XiShanYongYe-Chang Sorry... I didn't get the meaning of the picture

I didn't draw it well. I'm just trying to explain that the current way can overwrite the previous error.

@XiShanYongYe-Chang
Copy link
Member

/lgtm

/cc @RainbowMango

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 4, 2022
@RainbowMango
Copy link
Member

Good job @dddddai .
This patch definitely can solve the corner case. The side-effect I can tell is the resource template might be got in and out the waitting list many times.

Can you run a test to observe the logs:

klog.V(1).Infof("Add object(%s) to waiting list, length of list is: %d", objectKey.String(), len(d.waitingObjects))

Let's see how many times a resource template would be put in waitting list in a propagation process.

@dddddai
Copy link
Member Author

dddddai commented Jan 5, 2022

Thanks @RainbowMango, here's the log
If creating resource before the propagation policy:

I0105 10:29:38.611012       1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:29:46.133486       1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:29:46.177437       1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:29:46.208280       1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:29:46.618257       1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:29:46.633569       1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:29:46.733210       1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:29:48.428089       1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160

The resource was added to waiting list 8 times

If creating the propagation policy before resource:

I0105 10:32:25.143445       1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:32:25.165382       1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:32:25.202162       1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:32:25.576123       1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:32:25.816949       1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:32:25.930986       1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:32:26.822290       1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160

The resource was added to waiting list 7 times

The side-effect I can tell is the resource template might be got in and out the waitting list many times.

Yes, but this is the simplest way I could figure out to fix the bug, I'd appreciate it if there is any better idea

@RainbowMango
Copy link
Member

I've no idea yet. Since it's the corner case, so let's do it slowly and correctly.

@iawia002 any comments?

@iawia002
Copy link
Member

iawia002 commented Jan 5, 2022

I've no idea yet. Since it's the corner case, so let's do it slowly and correctly.

@iawia002 any comments?

Actually I have no better idea, maybe we can make a judgment before adding keys to the waiting list:

+	if _, ok := d.waitingObjects[objectKey]; ok {
+		return
+	}
	d.waitingObjects[objectKey] = struct{}{}
	klog.V(1).Infof("Add object(%s) to waiting list, length of list is: %d", objectKey.String(), len(d.waitingObjects))

@dddddai
Copy link
Member Author

dddddai commented Jan 5, 2022

@iawia002 , Thanks for your suggestion, but in most cases it's unlikely for objects to stay in waiting list before adding, and the judgment can make AddWaiting slower

Copy link
Member

@XiShanYongYe-Chang XiShanYongYe-Chang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on with this pull request?

@RainbowMango
Copy link
Member

Looking forward to a better solution.
/priority important-soon

@karmada-bot karmada-bot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 10, 2022
@dddddai
Copy link
Member Author

dddddai commented Jan 13, 2022

To be honest I don't see any better fixes, maybe we can merge it for quick fix(when we are gonna release 1.1) and update this once there's a better solution.

If the logs are too noisy , we may change them from info to debug

@RainbowMango RainbowMango added this to the v1.1 milestone Jan 13, 2022
@RainbowMango
Copy link
Member

I'm thinking if we can merge the two queues into one. But haven't time to look at it yet.

It's not an urgent bug, so we can hold for a while. I think it will be included in v1.1.

@RainbowMango
Copy link
Member

This problem is simply that two routines(resource template reconciler and PP/CPP reconciler) depend on the same mutex(waiting list).

So, how about merging PP/CPP reconciler to resource template reconciler? Thus, the waiting list will not be shared.
I mean the following reconcile could be removed.

// policyReconcileWorker maintains a rate limited queue which used to store PropagationPolicy's key and
// a reconcile function to consume the items in queue.
policyReconcileWorker util.AsyncWorker
propagationPolicyLister cache.GenericLister
// clusterPolicyReconcileWorker maintains a rate limited queue which used to store ClusterPropagationPolicy's key and
// a reconcile function to consume the items in queue.
clusterPolicyReconcileWorker util.AsyncWorker
clusterPropagationPolicyLister cache.GenericLister

When reconciling there will be two routing path:

  • For PP/CPP, routing to ReconcilePropagationPolicy or ReconcileClusterPropagationPolicy.
  • For resource template, routing to Reconcile

There will be a little bit complex when dealing with the EventFilter.

@dddddai
Copy link
Member Author

dddddai commented Jan 21, 2022

So, how about merging PP/CPP reconciler to resource template reconciler? Thus, the waiting list will not be shared.

Do you mean reconciling both PP/CPPs and resource templates in one goroutine?

@XiShanYongYe-Chang
Copy link
Member

Do you mean reconciling both PP/CPPs and resource templates in one goroutine?

@dddddai I think it should be. Can you try to modify it? Today has two cases failed on the master branch for this reason.

@RainbowMango
Copy link
Member

Do you mean reconciling both PP/CPPs and resource templates in one goroutine?

Yes, that's the idea.

@karmada-bot karmada-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 21, 2022
@dddddai dddddai force-pushed the waiting branch 2 times, most recently from e718e2c to 9f83682 Compare January 22, 2022 01:08
pkg/detector/detector.go Outdated Show resolved Hide resolved
pkg/detector/detector.go Outdated Show resolved Hide resolved
@dddddai dddddai force-pushed the waiting branch 2 times, most recently from 19694c3 to c11fb12 Compare January 22, 2022 03:00
Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty good now.

@@ -206,6 +191,19 @@ func (d *ResourceDetector) Reconcile(key util.QueueKey) error {
}
klog.Infof("Reconciling object: %s", clusterWideKey)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the log down? I don't want the ignored resources present in the logs by default.

Comment on lines 194 to 204
if clusterWideKey.Group == policyv1alpha1.GroupName {
switch clusterWideKey.Kind {
case "PropagationPolicy":
return d.ReconcilePropagationPolicy(key)
case "ClusterPropagationPolicy":
return d.ReconcileClusterPropagationPolicy(key)
}
}

if !d.EventFilter(clusterWideKey) {
return nil
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if clusterWideKey.Group == policyv1alpha1.GroupName block better to moved into the if !d.EventFilter(clusterWideKey), like:

	if !d.EventFilter(clusterWideKey) {
		if clusterWideKey.Group == policyv1alpha1.GroupName {
			switch clusterWideKey.Kind {
			case "PropagationPolicy":
				return d.ReconcilePropagationPolicy(key)
			case "ClusterPropagationPolicy":
				return d.ReconcileClusterPropagationPolicy(key)
			}
		}
		klog.V(4).Infof("Ignored object(%s) from reconciling.", clusterWideKey.String())
		return nil
	}
	klog.Infof("Reconciling object: %s", clusterWideKey)

return false
}

func (d *ResourceDetector) EventFilter(clusterWideKey keys.ClusterWideKey) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the EventFilter should be renamed now, as it not handling the event anymore, but checks if the specified object should be prevented from propagating. So, how about SkippedFromPropagating?

Remember to update the comments and take care of the function body.

return false
}

func (d *ResourceDetector) SkippedFromPropagating(clusterWideKey keys.ClusterWideKey) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After renaming the function, the implementation should be updated as well, when a resource is skipped, it should return true now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks for elaborating!

@dddddai dddddai force-pushed the waiting branch 3 times, most recently from 76a23ce to c770cc1 Compare January 24, 2022 02:13
reconcile PP/CPPs and resource templates in one goroutine

Signed-off-by: dddddai <[email protected]>
Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Now we just started one worker, this risk still exists if there is more than one worker.

No other better solution yet, and this issue happens several times in our CI, merging this patch first.

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 24, 2022
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 24, 2022
@karmada-bot karmada-bot merged commit e0e7a60 into karmada-io:master Jan 24, 2022
@dddddai
Copy link
Member Author

dddddai commented Jan 24, 2022

this risk still exists if there is more than one worker

That's exactly what I'm worried about, so honestly I'd lean towards the original solution

@RainbowMango
Copy link
Member

so honestly I'd lean towards the original solution

...

The original solution(throw into waiting list) does not 100% solve the problem but reduces the probability significantly.

@dddddai
Copy link
Member Author

dddddai commented Jan 24, 2022

Why? Could you please give an example?

Anyway I'm fine with both solutions

@pigletfly
Copy link
Contributor

so honestly I'd lean towards the original solution

...

The original solution(throw into waiting list) does not 100% solve the problem but reduces the probability significantly.

we should start a seperate goroutine to consume waiting, if resource template and pp are created at the same time and miss the match(resource template is created and has no more event),then goroutine keeps find the match.

@RainbowMango
Copy link
Member

we should start a seperate goroutine to consume waiting, if resource template and pp are created at the same time and miss the match(resource template is created and has no more event),then goroutine keeps find the match.

I'm also thinking of this approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. lgtm Indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resourcebinding NotFound.
6 participants