[scalability testing] large number of replicas (100) #830

jlewi · 2018-09-20T01:26:05Z

As a step towards reaching v1, I think we should consider doing a scalability test to see if we can correctly handle a large number 100-1000 of jobs running concurrently.

I think what we want to look at is:

CPU/MEM usage of the operator
What is the latency with which the operator responds to events
e.g. when a pod dies, does the operator still process it in a timely fashion.

For the latter part we might need to add instrumentation to TFOperator to report metrics about event processing to Prometheus. Would be good to sync with folks in K8s community to see what they do.

See also:
#829 scalability testing for large number of concurrent jobs.

/cc @richardsliu
/cc @johnugeorge

johnugeorge · 2018-10-11T06:52:08Z

@gaocegege how should we plan this ?

richardsliu · 2018-10-11T18:50:59Z

@johnugeorge I think we should first try to enhance the test code so it becomes easier to add an E2E test. Then we can extend the infrastructure to support stress testing and collect metrics.

johnugeorge · 2018-10-12T01:56:12Z

Are you planning to add this as a E2E tests for every PR presubmit? I don't know if it is feasible. Is Kubernetes doing the same way?

richardsliu · 2018-10-12T02:02:01Z

Not on pre or post submits. But we can do this periodically on release (stable) branches.

jlewi · 2018-11-07T23:06:06Z

Downgrading to P1
/priority p1
/cancel-priority p0

xyhuang · 2019-02-27T01:35:33Z

/assign

richardsliu · 2019-03-15T23:39:29Z

Anything else left to do here? Otherwise we can close this.

richardsliu · 2019-03-20T17:16:30Z

Results can be found at https://bit.ly/2CtWFn3. Closing this for now.
/close

k8s-ci-robot · 2019-03-20T17:16:32Z

@richardsliu: Closing this issue.

In response to this:

Results can be found at https://bit.ly/2CtWFn3. Closing this for now.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jlewi added priority/p0 area/tfjob area/0.4.0 labels Sep 20, 2018

xyhuang mentioned this issue Oct 9, 2018

Support scalability tests kubeflow/kubebench#126

Open

richardsliu added the area/testing label Oct 19, 2018

carmine added this to the 0.4.0 milestone Nov 6, 2018

k8s-ci-robot added the priority/p1 label Nov 7, 2018

jlewi removed the priority/p0 label Nov 7, 2018

richardsliu added area/0.5.0 and removed area/0.4.0 labels Dec 3, 2018

k8s-ci-robot assigned xyhuang Feb 27, 2019

k8s-ci-robot closed this as completed Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[scalability testing] large number of replicas (100) #830

[scalability testing] large number of replicas (100) #830

jlewi commented Sep 20, 2018

johnugeorge commented Oct 11, 2018

richardsliu commented Oct 11, 2018

johnugeorge commented Oct 12, 2018

richardsliu commented Oct 12, 2018

jlewi commented Nov 7, 2018

xyhuang commented Feb 27, 2019

richardsliu commented Mar 15, 2019

richardsliu commented Mar 20, 2019

k8s-ci-robot commented Mar 20, 2019

[scalability testing] large number of replicas (100) #830

[scalability testing] large number of replicas (100) #830

Comments

jlewi commented Sep 20, 2018

johnugeorge commented Oct 11, 2018

richardsliu commented Oct 11, 2018

johnugeorge commented Oct 12, 2018

richardsliu commented Oct 12, 2018

jlewi commented Nov 7, 2018

xyhuang commented Feb 27, 2019

richardsliu commented Mar 15, 2019

richardsliu commented Mar 20, 2019

k8s-ci-robot commented Mar 20, 2019