-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[scalability testing] large number of replicas (100) #830
Comments
@gaocegege how should we plan this ? |
@johnugeorge I think we should first try to enhance the test code so it becomes easier to add an E2E test. Then we can extend the infrastructure to support stress testing and collect metrics. |
Are you planning to add this as a E2E tests for every PR presubmit? I don't know if it is feasible. Is Kubernetes doing the same way? |
Not on pre or post submits. But we can do this periodically on release (stable) branches. |
Downgrading to P1 |
/assign |
Anything else left to do here? Otherwise we can close this. |
Results can be found at https://bit.ly/2CtWFn3. Closing this for now. |
@richardsliu: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
As a step towards reaching v1, I think we should consider doing a scalability test to see if we can correctly handle a large number 100-1000 of jobs running concurrently.
I think what we want to look at is:
e.g. when a pod dies, does the operator still process it in a timely fashion.
For the latter part we might need to add instrumentation to TFOperator to report metrics about event processing to Prometheus. Would be good to sync with folks in K8s community to see what they do.
See also:
#829 scalability testing for large number of concurrent jobs.
/cc @richardsliu
/cc @johnugeorge
The text was updated successfully, but these errors were encountered: