[scalability testing] large number of jobs (100?) running concurrently? #829

jlewi · 2018-09-20T01:23:58Z

As a step towards reaching v1, I think we should consider doing a scalability test to see if we can correctly handle a large number 100-1000 of jobs running concurrently.

I think what we want to look at is:

CPU/MEM usage of the operator
What is the latency with which the operator responds to events
- e.g. when a pod dies, does the operator still process it in a timely fashion.

For the latter part we might need to add instrumentation to TFOperator to report metrics about event processing to Prometheus. Would be good to sync with folks in K8s community to see what they do.

/cc @richardsliu
/cc @johnugeorge

chrisheecho · 2018-10-25T17:27:18Z

/priority p1

chrisheecho · 2018-10-29T15:59:05Z

/priority p2

Don't have large users hitting this as far as I'm aware

jbottum · 2018-11-03T21:13:53Z

/remove-priority p1

xyhuang · 2019-02-27T01:35:58Z

/assign

richardsliu · 2019-03-15T23:39:23Z

Anything else left to do here? Otherwise we can close this.

richardsliu · 2019-03-20T17:16:40Z

Results can be found at https://bit.ly/2CtWFn3. Closing this for now.
/close

k8s-ci-robot · 2019-03-20T17:16:42Z

@richardsliu: Closing this issue.

In response to this:

Results can be found at https://bit.ly/2CtWFn3. Closing this for now.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jlewi added area/tfjob area/0.4.0 area/api/v1beta1 labels Sep 20, 2018

jlewi mentioned this issue Sep 20, 2018

[scalability testing] large number of replicas (100) #830

Closed

richardsliu added the area/testing label Oct 19, 2018

k8s-ci-robot added the priority/p1 label Oct 25, 2018

k8s-ci-robot added the priority/p2 label Oct 29, 2018

k8s-ci-robot removed the priority/p1 label Nov 3, 2018

carmine added this to the 0.4.0 milestone Nov 6, 2018

richardsliu added area/0.5.0 and removed area/0.4.0 labels Jan 7, 2019

richardsliu mentioned this issue Jan 8, 2019

Support scalability tests kubeflow/kubebench#126

Open

k8s-ci-robot assigned xyhuang Feb 27, 2019

k8s-ci-robot closed this as completed Mar 20, 2019

jlewi mentioned this issue Aug 28, 2019

Notebook controller scalability testing - # concurrent notebooks kubeflow/kubeflow#4019

Closed

johnugeorge mentioned this issue Aug 30, 2019

Performance issue when there is a lot of completed jobs #965

Closed

gaocegege mentioned this issue Sep 11, 2019

Performance problem about pod informer #1079

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[scalability testing] large number of jobs (100?) running concurrently? #829

[scalability testing] large number of jobs (100?) running concurrently? #829

jlewi commented Sep 20, 2018

chrisheecho commented Oct 25, 2018

chrisheecho commented Oct 29, 2018

jbottum commented Nov 3, 2018

xyhuang commented Feb 27, 2019

richardsliu commented Mar 15, 2019

richardsliu commented Mar 20, 2019

k8s-ci-robot commented Mar 20, 2019

[scalability testing] large number of jobs (100?) running concurrently? #829

[scalability testing] large number of jobs (100?) running concurrently? #829

Comments

jlewi commented Sep 20, 2018

chrisheecho commented Oct 25, 2018

chrisheecho commented Oct 29, 2018

jbottum commented Nov 3, 2018

xyhuang commented Feb 27, 2019

richardsliu commented Mar 15, 2019

richardsliu commented Mar 20, 2019

k8s-ci-robot commented Mar 20, 2019