Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[scalability testing] large number of replicas (100) #830

Closed
jlewi opened this issue Sep 20, 2018 · 9 comments
Closed

[scalability testing] large number of replicas (100) #830

jlewi opened this issue Sep 20, 2018 · 9 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Sep 20, 2018

As a step towards reaching v1, I think we should consider doing a scalability test to see if we can correctly handle a large number 100-1000 of jobs running concurrently.

I think what we want to look at is:

  1. CPU/MEM usage of the operator
  2. What is the latency with which the operator responds to events
    e.g. when a pod dies, does the operator still process it in a timely fashion.

For the latter part we might need to add instrumentation to TFOperator to report metrics about event processing to Prometheus. Would be good to sync with folks in K8s community to see what they do.

See also:
#829 scalability testing for large number of concurrent jobs.

/cc @richardsliu
/cc @johnugeorge

@johnugeorge
Copy link
Member

@gaocegege how should we plan this ?

@richardsliu
Copy link
Contributor

@johnugeorge I think we should first try to enhance the test code so it becomes easier to add an E2E test. Then we can extend the infrastructure to support stress testing and collect metrics.

@johnugeorge
Copy link
Member

Are you planning to add this as a E2E tests for every PR presubmit? I don't know if it is feasible. Is Kubernetes doing the same way?

@richardsliu
Copy link
Contributor

Not on pre or post submits. But we can do this periodically on release (stable) branches.

@carmine carmine added this to the 0.4.0 milestone Nov 6, 2018
@jlewi
Copy link
Contributor Author

jlewi commented Nov 7, 2018

Downgrading to P1
/priority p1
/cancel-priority p0

@xyhuang
Copy link
Member

xyhuang commented Feb 27, 2019

/assign

@richardsliu
Copy link
Contributor

Anything else left to do here? Otherwise we can close this.

@richardsliu
Copy link
Contributor

Results can be found at https://bit.ly/2CtWFn3. Closing this for now.
/close

@k8s-ci-robot
Copy link

@richardsliu: Closing this issue.

In response to this:

Results can be found at https://bit.ly/2CtWFn3. Closing this for now.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants