-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bugfix: resource binding is not created when creating resources and propagation policies at the same time #1199
Conversation
pkg/detector/detector.go
Outdated
@@ -955,7 +958,6 @@ func (d *ResourceDetector) HandlePropagationPolicyCreation(policy *policyv1alpha | |||
} | |||
|
|||
for _, key := range matchedKeys { | |||
d.RemoveWaiting(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand why this line was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the waiting obj would be removed when reconciling, as above shows
Is there any good reason to keep it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reconciling will add the key again, keeping it here feels fine and easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reconciling will add the key again, keeping it here feels fine and easier to understand.
karmada/pkg/detector/detector.go
Line 74 in 7224234
waitingObjects map[keys.ClusterWideKey]struct{} |
The waitingObjects
is a map, so it won't matter if an obj is added multiple times
If you mean it's for better understanding, I will revert it though
@XiShanYongYe-Chang Sorry... I didn't get the meaning of the picture |
I didn't draw it well. I'm just trying to explain that the current way can overwrite the previous error. |
/lgtm /cc @RainbowMango |
Good job @dddddai . Can you run a test to observe the logs: karmada/pkg/detector/detector.go Line 707 in 8b3aefd
Let's see how many times a resource template would be put in |
Thanks @RainbowMango, here's the log I0105 10:29:38.611012 1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:29:46.133486 1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:29:46.177437 1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:29:46.208280 1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:29:46.618257 1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:29:46.633569 1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:29:46.733210 1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:29:48.428089 1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160 The resource was added to waiting list 8 times If creating the propagation policy before resource: I0105 10:32:25.143445 1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:32:25.165382 1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:32:25.202162 1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:32:25.576123 1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:32:25.816949 1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:32:25.930986 1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160
I0105 10:32:26.822290 1 detector.go:710] Add object(apps/v1, kind=Deployment, default/nginx) to waiting list, length of list is: 160 The resource was added to waiting list 7 times
Yes, but this is the simplest way I could figure out to fix the bug, I'd appreciate it if there is any better idea |
I've no idea yet. Since it's the corner case, so let's do it slowly and correctly. @iawia002 any comments? |
Actually I have no better idea, maybe we can make a judgment before adding keys to the waiting list: + if _, ok := d.waitingObjects[objectKey]; ok {
+ return
+ }
d.waitingObjects[objectKey] = struct{}{}
klog.V(1).Infof("Add object(%s) to waiting list, length of list is: %d", objectKey.String(), len(d.waitingObjects)) |
@iawia002 , Thanks for your suggestion, but in most cases it's unlikely for objects to stay in waiting list before adding, and the judgment can make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's going on with this pull request?
Looking forward to a better solution. |
To be honest I don't see any better fixes, maybe we can merge it for quick fix(when we are gonna release 1.1) and update this once there's a better solution. If the logs are too noisy , we may change them from |
I'm thinking if we can merge the two queues into one. But haven't time to look at it yet. It's not an urgent bug, so we can hold for a while. I think it will be included in v1.1. |
This problem is simply that two routines( So, how about merging karmada/pkg/detector/detector.go Lines 57 to 65 in e3c8d39
When reconciling there will be two routing path:
There will be a little bit complex when dealing with the EventFilter. |
Do you mean reconciling both PP/CPPs and resource templates in one goroutine? |
@dddddai I think it should be. Can you try to modify it? Today has two cases failed on the master branch for this reason. |
Yes, that's the idea. |
e718e2c
to
9f83682
Compare
19694c3
to
c11fb12
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty good now.
pkg/detector/detector.go
Outdated
@@ -206,6 +191,19 @@ func (d *ResourceDetector) Reconcile(key util.QueueKey) error { | |||
} | |||
klog.Infof("Reconciling object: %s", clusterWideKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the log down? I don't want the ignored resources present in the logs by default.
pkg/detector/detector.go
Outdated
if clusterWideKey.Group == policyv1alpha1.GroupName { | ||
switch clusterWideKey.Kind { | ||
case "PropagationPolicy": | ||
return d.ReconcilePropagationPolicy(key) | ||
case "ClusterPropagationPolicy": | ||
return d.ReconcileClusterPropagationPolicy(key) | ||
} | ||
} | ||
|
||
if !d.EventFilter(clusterWideKey) { | ||
return nil | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if clusterWideKey.Group == policyv1alpha1.GroupName
block better to moved into the if !d.EventFilter(clusterWideKey)
, like:
if !d.EventFilter(clusterWideKey) {
if clusterWideKey.Group == policyv1alpha1.GroupName {
switch clusterWideKey.Kind {
case "PropagationPolicy":
return d.ReconcilePropagationPolicy(key)
case "ClusterPropagationPolicy":
return d.ReconcileClusterPropagationPolicy(key)
}
}
klog.V(4).Infof("Ignored object(%s) from reconciling.", clusterWideKey.String())
return nil
}
klog.Infof("Reconciling object: %s", clusterWideKey)
pkg/detector/detector.go
Outdated
return false | ||
} | ||
|
||
func (d *ResourceDetector) EventFilter(clusterWideKey keys.ClusterWideKey) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the EventFilter
should be renamed now, as it not handling the event
anymore, but checks if the specified object should be prevented from propagating. So, how about SkippedFromPropagating
?
Remember to update the comments and take care of the function body.
return false | ||
} | ||
|
||
func (d *ResourceDetector) SkippedFromPropagating(clusterWideKey keys.ClusterWideKey) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After renaming the function, the implementation should be updated as well, when a resource is skipped, it should return true
now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks for elaborating!
76a23ce
to
c770cc1
Compare
reconcile PP/CPPs and resource templates in one goroutine Signed-off-by: dddddai <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Now we just started one worker, this risk still exists if there is more than one worker.
No other better solution yet, and this issue happens several times in our CI, merging this patch first.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: RainbowMango The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
That's exactly what I'm worried about, so honestly I'd lean towards the original solution |
... The original solution(throw into waiting list) does not 100% solve the problem but reduces the probability significantly. |
Why? Could you please give an example? Anyway I'm fine with both solutions |
we should start a seperate goroutine to consume waiting, if resource template and pp are created at the same time and miss the match(resource template is created and has no more event),then goroutine keeps find the match. |
I'm also thinking of this approach. |
What type of PR is this?
/kind bug
/kind flake
What this PR does / why we need it:
The root cause is as the follow process:
We must add the resource to waiting list before looking up the matched propagation policy, so that the matched resource can always be found when reconciling propagation policies
Which issue(s) this PR fixes:
Fixes #1195
Special notes for your reviewer:
Does this PR introduce a user-facing change?: