Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy Gateway performance at scale #1365

Closed
arkodg opened this issue Apr 25, 2023 · 27 comments · Fixed by #3599
Closed

Envoy Gateway performance at scale #1365

arkodg opened this issue Apr 25, 2023 · 27 comments · Fixed by #3599
Assignees
Labels
area/ci CI and build related issues area/performance documentation Improvements or additions to documentation kind/enhancement New feature or request no stalebot road-to-ga
Milestone

Comments

@arkodg
Copy link
Contributor

arkodg commented Apr 25, 2023

Description:
This issue tracks the performance (Throughput, Latency) of Envoy Gateway control plane and data plane at scale (Service, xRoutes, Gateway, Client Connections) .

  • The output of this issue should be a document that can be presented to the user to better understand what is the performance of Envoy Gateway by default and how can this be increased if there is an increase in scale.
  • Other issues might need to be created (based on testing) and linked to this issue
  • A testing framework/script should be introduced that could be run in CI to make the performance measurement results reproducible

[optional Relevant Links:]
Emissary: https://www.getambassador.io/docs/emissary/latest/topics/running/scaling
Contour: https://github.com/projectcontour/contour-perf / https://projectcontour.io/guides/resource-limits/
Istio: https://istio.io/v1.16/docs/ops/deployment/performance-and-scalability/

@arkodg arkodg added kind/enhancement New feature or request documentation Improvements or additions to documentation area/ci CI and build related issues labels Apr 25, 2023
@arkodg arkodg added this to the 0.5.0-rc1 milestone Apr 25, 2023
@arkodg
Copy link
Contributor Author

arkodg commented Apr 25, 2023

cc @AliceProxy @haq204

@arkodg
Copy link
Contributor Author

arkodg commented Apr 25, 2023

@kflynn
Copy link
Contributor

kflynn commented Apr 25, 2023

Good first target: 1GB RAM usage at 1000 HTTPRoutes. (We'd probably be OK at 2GB, but let's go for 1GB.)

@Xunzhuo
Copy link
Member

Xunzhuo commented Apr 26, 2023

Do we run performance test in GitHub CI ? I do not know if GitHub CI provides enough resources to run large scale eg tests.

@arkodg
Copy link
Contributor Author

arkodg commented Apr 26, 2023

Do we run performance test in GitHub CI ? I do not know if GitHub CI provides enough resources to run large scale eg tests.

it provides 7GB RAM https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources, we might be able to do without a self hosted runner, since we are shooting for a usage of 1GB

@haq204
Copy link
Contributor

haq204 commented Apr 26, 2023

Current statistics

Notes:

  • This is for a vanilla install of Envoy Gateway. No extensions
  • HTTPRoutes do not use any filters
  • HTTPRoutes are routed to the same service just with different paths (e.g /get<n>)
  • It is against a single Gateway & GatewayClass
cpu mem # simultaneously applied HTTPRoutes before OOMing
1000m 1Gb 400-500
1000m 2Gb 700
1000m 4Gb 1000

@arkodg
Copy link
Contributor Author

arkodg commented Apr 27, 2023

looks like envoy proxy is building a framework / test suite for perf testing
https://github.com/envoyproxy/envoy-perf/tree/main/salvo that is based on the existing envoy nighthawk load generator.
Although the primary goal of this issue is control plane performance it would be good to leverage existing envoy projects to quantify data plane performance and see if the frameworks can be used to scale and test control plane performance

arkodg added a commit to arkodg/gateway that referenced this issue Apr 27, 2023
* Moves envoyproxy#24 into v0.5.0 since it carries over from v0.4.0
* Adds envoyproxy#1365 since it tracks the work items of the Scale theme
* Removed other items not tied directly to the roadmap theme
* Added a placeholder roadmap theme for v0.6.0

Signed-off-by: Arko Dasgupta <[email protected]>
zirain pushed a commit that referenced this issue May 4, 2023
* Update roadmap for v0.5.0

* Moves #24 into v0.5.0 since it carries over from v0.4.0
* Adds #1365 since it tracks the work items of the Scale theme
* Removed other items not tied directly to the roadmap theme
* Added a placeholder roadmap theme for v0.6.0

Signed-off-by: Arko Dasgupta <[email protected]>

* rm unused link

Signed-off-by: Arko Dasgupta <[email protected]>

* fix roadmap

Signed-off-by: Arko Dasgupta <[email protected]>

---------

Signed-off-by: Arko Dasgupta <[email protected]>
@Xunzhuo
Copy link
Member

Xunzhuo commented May 13, 2023

Hey @arkodg, I am not sure this one could be done in v0.5.0 for recent I am lack of bandwidth. I will focus more on observability of Envoy Gateway control plane, if I still have any bandwidth after resolving observability of EG, I can still work on this one, but not sure for v0.5.0.

So if any other maintainers want to take this one, feel free to take this one from me. Thanks.

@arkodg
Copy link
Contributor Author

arkodg commented May 13, 2023

np @Xunzhuo, thanks for the heads up, please unassign yourself, hoping someone from the community will pick this one up

@Xunzhuo Xunzhuo removed their assignment May 13, 2023
@Xunzhuo Xunzhuo added the help wanted Extra attention is needed label May 15, 2023
@arkodg arkodg modified the milestones: 0.5.0-rc1, 0.6.0-rc1 Jul 6, 2023
@arkodg
Copy link
Contributor Author

arkodg commented Jul 6, 2023

moving this to the 0.6.0-rc1 milestone since it is still unsigned and unlikely to be finished within the v0.5.0 timeline

@arkodg
Copy link
Contributor Author

arkodg commented Sep 19, 2023

@qicz checking I to see if you have any cycles to help with this one

@soulxu
Copy link
Member

soulxu commented Jan 25, 2024

cc @gyohuangxin

@arkodg
Copy link
Contributor Author

arkodg commented Jan 25, 2024

thanks @soulxu & @gyohuangxin for picking up this up !

@arkodg arkodg removed the help wanted Extra attention is needed label Jan 25, 2024
@gyohuangxin
Copy link
Member

@arkodg @Xunzhuo Here is the Propose to add Performance Benchmarking at Scale in EnvoyGateway CI Pipeline, which outlines some plans and options based on my personal thoughts, looking for your feedback. If my ideas are not correct or do not meet the original intention of this issue, please correct me. Thanks in advance! cc @soulxu

@Xunzhuo
Copy link
Member

Xunzhuo commented Feb 4, 2024

Thanks @gyohuangxin. I have looked throught the docs, I think most of it covers the data plane performance tests.

I would like to see more tests on control plane perf test, like observing CP status when facing different scale of numbers of Gateway/xRoute/xPolicy/Service/Endpoint/EndpointSlice.

@gyohuangxin
Copy link
Member

@Xunzhuo Thanks for your comments, my thought is to send load requests to data plane at different scale, and then use Prometheus to collect metrics from both control plane and data plane. What do you mean by "observing CP status", is it "observing CPU status"? Yes, we can monitor the control plane's cpu status to see how much EndpointsSlices a single EG instance can support. What do you think about it?

@soulxu
Copy link
Member

soulxu commented Feb 4, 2024

Thanks @gyohuangxin. I have looked throught the docs, I think most of it covers the data plane performance tests.

I would like to see more tests on control plane perf test, like observing CP status when facing different scale of numbers of Gateway/xRoute/xPolicy/Service/Endpoint/EndpointSlice.

The data-plane and control-plane performance tests can be separate things. Do you mean the control-plane performance is more important for now? we can adjust the priorities.

@zirain
Copy link
Member

zirain commented Feb 4, 2024

I think we should test control-plane first.

@gyohuangxin
Copy link
Member

@zirain Thanks for your comments, we will consider control-plane first. But I think testing frameworks can be universal.

@arkodg
Copy link
Contributor Author

arkodg commented Feb 6, 2024

the docs looks looks good @gyohuangxin, left some comments ! agree with everyone here, we should focus on control plane first

something to also keep in mind while designing this @gyohuangxin, since you're also active in gateway api, would be great if the framework can be reused for other gateway api implementation in the future, implementation perf comparisons would really benefit the end user

from the document, looks like it can be EG agnostic

@gyohuangxin
Copy link
Member

@arkodg Thanks for your helpful comments.

something to also keep in mind while designing this @gyohuangxin, since you're also active in gateway api, would be great if the framework can be reused for other gateway api implementation in the future, implementation perf comparisons would really benefit the end user

from the document, looks like it can be EG agnostic

It’s a great idea to use this framework in other gateway API implementations and compare their performance. You’re correct that this framework is inherently universal, and we should always take its versatility into account.

@arkodg
Copy link
Contributor Author

arkodg commented Feb 6, 2024

@Xunzhuo is it possible to calculate time to program data plane via CP metrics today ? that would be handy in perf benchmarking

@Xunzhuo
Copy link
Member

Xunzhuo commented Feb 7, 2024

@arkodg by exposing some xds metrics?

@arkodg
Copy link
Contributor Author

arkodg commented Feb 7, 2024

@Xunzhuo we can calculate it CP by difference between provider reconcile time to xds server push time, (but that may not be entirely accurate )

@EltonzHu
Copy link

@Xunzhuo we can calculate it CP by difference between provider reconcile time to xds server push time, (but that may not be entirely accurate )

Basically, we want to see time consuming from all abstraction layers inside envoy gateway by measuring the time interval from provider reconcile stage to xDS pushing stage. However, when envoy proxy taking effect from xDS is out of control plane control which is Envoy gateway in our case.

@arkodg arkodg modified the milestones: v1.0.0-rc1, v1.0.0 Mar 2, 2024
@arkodg arkodg modified the milestones: v1.0.0, v1.1.0-rc1 Mar 28, 2024
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ci CI and build related issues area/performance documentation Improvements or additions to documentation kind/enhancement New feature or request no stalebot road-to-ga
Projects
No open projects
9 participants