Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't send endpoint profile updates from Server updates when opaqueness doesn't change #12013

Merged
merged 5 commits into from
Mar 19, 2024

Conversation

adleong
Copy link
Member

@adleong adleong commented Jan 30, 2024

When the destination controller receives an update for a Server resource, we recompute opaqueness ports for all pods. This results in a large number of updates to all endpoint profile watches, even if the opaqueness doesn't change. In cases where there are many Server resources, this can result in a large number of updates being sent to the endpoint profile translator and overflowing the endpoint profile translator update queue. This is especially likely to happen during an informer resync, since this will result in an informer callback for every Server in the cluster.

We refactor the workload watcher to not send these updates if the opaqueness has not changed.

This, seemingly simple, change in behavior requires a large code change because:

  • the current opaqueness state is not stored on workload publishers and must be added so that we can determine if the opaqueness has changed
  • storing the opaqueness in addition to the other state we're storing (pod, ip, port, etc.) means that we are not storing all of the data represented by the Address struct
  • workload watcher uses a createAddress func to dynamically create an Address from the state it stores
  • now that we are storing the Address as state, creating Addresses dynamically is no longer necessary and we can operate on the Address state directly
    • this makes the workload watcher more similar to other watchers and follow a common pattern
    • it also fixes some minor correctness issues:
      • pods that did not have the ready status condition were being considered when they should not have been
      • updates to ExternalWorkload labels were not being considered

@adleong adleong requested a review from a team as a code owner January 30, 2024 01:09
Copy link
Member

@zaharidichev zaharidichev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good change. I am a fan of the "pruning" approach you have here. Left a question about resyncs. Also, I think it would be good to add a test (if it is not too hard to wire up) that demonstrates the case where updates that we do not need to process, do not end up in the queue.

Comment on lines 301 to 303
if oldServer.ResourceVersion == newServer.ResourceVersion {
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, you are already minimizing the amount of work dispatched by checking whether there is a change in the "opaqueness" in updateServers. Why go further and ignore resyncs too? If I remember correctly, we discussed resync ignores when it came to the endpoints slices controller, and reached the conclusion that it is fine to skip them if we have another retry mechanism (a retry queue), which we do not have here. Here are some follow up questions:

  1. If we are skipping resyncs here, why not skip them for all update callbacks (e.g. pods)
  2. Do we have any idea how much of the work needlessly ends up in the update queue due to the fact that we were not checking opaqueness state change as opposed to not ignoring resyncs?. In other words, if you let resyncs go through, would the rest of the changes still solve the problem that we were seeing?

Comment on lines +656 to +659
// Fill in ownership.
if wp.addr.ExternalWorkload != nil && len(wp.addr.ExternalWorkload.GetOwnerReferences()) == 1 {
wp.addr.OwnerKind = wp.addr.ExternalWorkload.GetOwnerReferences()[0].Kind
wp.addr.OwnerName = wp.addr.ExternalWorkload.GetOwnerReferences()[0].Name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIOLI: you can abstract this ownership filling into separate functions to make things more readable. Here and in the other place where you are doing it.

Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are extremely nice simplifications 🤟

Comment on lines +127 to +131
_, opaqueProtocol := opaquePorts[address.Port]
profile := &pb.DestinationProfile{
RetryBudget: defaultRetryBudget(),
Endpoint: endpoint,
OpaqueProtocol: address.OpaqueProtocol,
OpaqueProtocol: opaqueProtocol || address.OpaqueProtocol,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to just rely on address.OpaqueProtocol here, and refactor ept.createEndpoint() so it also relies solely on address.OpaqueProtocol?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be a good change but after giving it a try, I think it's a larger refactor that would be better to separate from this change. basically, we not consistent with where we're consuming opaque protocol annotations and we read them both in the translator and in the watcher. refactoring this to centralize this so that opaque protocol annotations are consumed in one place would be a good refactor.

Copy link
Member

@zaharidichev zaharidichev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Alex Leong <[email protected]>
@adleong adleong merged commit 5915ef5 into main Mar 19, 2024
35 checks passed
@adleong adleong deleted the alex/server-madness branch March 19, 2024 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants