Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(envoy): create a CDS cluster per model #5916

Merged
merged 1 commit into from
Sep 18, 2024

Conversation

driev
Copy link

@driev driev commented Sep 17, 2024

What this PR does / why we need it:

When scaling a model replica count up, existing clusters were removed and new ones were added in the delta response, as the cluster name changed. This resulted in the downstream receiving 503s due to no cluster being found.

Instead of changing the cluster name everytime the number of replicas changes, just keep the cluster name static, so clusters will be updated in place.

Which issue(s) this PR fixes:

Fixes #
INFRA-1150

Special notes for your reviewer:

@driev driev requested review from sakoush and lc525 as code owners September 17, 2024 15:52
@driev driev added the v2 label Sep 17, 2024
Copy link
Member

@sakoush sakoush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle this looks that it will solve the reported 503 issue. However I am not sure how this will scale with the number of models. Any thoughts on how envoy can handle clusters in the 100s-1000s?

Supporting this scale might not be of immediate concern though (yet).

@driev
Copy link
Author

driev commented Sep 18, 2024

In principle this looks that it will solve the reported 503 issue. However I am not sure how this will scale with the number of models. Any thoughts on how envoy can handle clusters in the 100s-1000s?

Supporting this scale might not be of immediate concern though (yet).

There's no limit on the number of clusters envoy can handle, once resources are allocated appropriately - since these clusters don't have active health checking the overhead of having one per model shouldn't be too expensive.

@driev
Copy link
Author

driev commented Sep 18, 2024

And for reference, performance will likely be impacted by the number of stats generated by the clusters.

envoyproxy/envoy#19946

Copy link
Member

@sakoush sakoush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! should we also remove computeHashKeyForList util as it is not being used anymore (we can always recover from history if required)?

@driev
Copy link
Author

driev commented Sep 18, 2024

LGTM! should we also remove computeHashKeyForList util as it is not being used anymore (we can always recover from history if required)?

It's gone - check the first change in the list.

@driev driev merged commit 36f6189 into SeldonIO:v2 Sep 18, 2024
4 checks passed
@driev driev deleted the INFRA-1150/envoy-cluster-per-model-version branch September 18, 2024 13:07
@sakoush
Copy link
Member

sakoush commented Sep 18, 2024

LGTM! should we also remove computeHashKeyForList util as it is not being used anymore (we can always recover from history if required)?

It's gone - check the first change in the list.

Thanks for some reason I missed it, I should have looked more carefully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants