-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't send endpoint profile updates from Server updates when opaqueness doesn't change #12013
Conversation
Signed-off-by: Alex Leong <[email protected]>
Signed-off-by: Alex Leong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good change. I am a fan of the "pruning" approach you have here. Left a question about resyncs. Also, I think it would be good to add a test (if it is not too hard to wire up) that demonstrates the case where updates that we do not need to process, do not end up in the queue.
if oldServer.ResourceVersion == newServer.ResourceVersion { | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, you are already minimizing the amount of work dispatched by checking whether there is a change in the "opaqueness" in updateServers
. Why go further and ignore resyncs too? If I remember correctly, we discussed resync ignores when it came to the endpoints slices controller, and reached the conclusion that it is fine to skip them if we have another retry mechanism (a retry queue), which we do not have here. Here are some follow up questions:
- If we are skipping resyncs here, why not skip them for all update callbacks (e.g. pods)
- Do we have any idea how much of the work needlessly ends up in the update queue due to the fact that we were not checking opaqueness state change as opposed to not ignoring resyncs?. In other words, if you let resyncs go through, would the rest of the changes still solve the problem that we were seeing?
// Fill in ownership. | ||
if wp.addr.ExternalWorkload != nil && len(wp.addr.ExternalWorkload.GetOwnerReferences()) == 1 { | ||
wp.addr.OwnerKind = wp.addr.ExternalWorkload.GetOwnerReferences()[0].Kind | ||
wp.addr.OwnerName = wp.addr.ExternalWorkload.GetOwnerReferences()[0].Name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIOLI: you can abstract this ownership filling into separate functions to make things more readable. Here and in the other place where you are doing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are extremely nice simplifications 🤟
_, opaqueProtocol := opaquePorts[address.Port] | ||
profile := &pb.DestinationProfile{ | ||
RetryBudget: defaultRetryBudget(), | ||
Endpoint: endpoint, | ||
OpaqueProtocol: address.OpaqueProtocol, | ||
OpaqueProtocol: opaqueProtocol || address.OpaqueProtocol, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to just rely on address.OpaqueProtocol
here, and refactor ept.createEndpoint()
so it also relies solely on address.OpaqueProtocol
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be a good change but after giving it a try, I think it's a larger refactor that would be better to separate from this change. basically, we not consistent with where we're consuming opaque protocol annotations and we read them both in the translator and in the watcher. refactoring this to centralize this so that opaque protocol annotations are consumed in one place would be a good refactor.
Signed-off-by: Alex Leong <[email protected]>
Signed-off-by: Alex Leong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Alex Leong <[email protected]>
When the destination controller receives an update for a Server resource, we recompute opaqueness ports for all pods. This results in a large number of updates to all endpoint profile watches, even if the opaqueness doesn't change. In cases where there are many Server resources, this can result in a large number of updates being sent to the endpoint profile translator and overflowing the endpoint profile translator update queue. This is especially likely to happen during an informer resync, since this will result in an informer callback for every Server in the cluster.
We refactor the workload watcher to not send these updates if the opaqueness has not changed.
This, seemingly simple, change in behavior requires a large code change because:
createAddress
func to dynamically create an Address from the state it stores