-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue when there is a lot of completed jobs #965
Comments
@zionwu How many completed jobs do you have in the system currently? |
Resync period is currently 30s (https://github.com/kubeflow/tf-operator/blob/master/cmd/tf-operator.v1beta2/app/server.go#L53) |
Around 250.
Yes, I am using v0.3 and it is also 30s. |
/cc @zabbasi |
I increased I changed the code by adding a new condition called "ResourceCleaned". The new condition is added to job after resource is cleaned in This fix can improve the performance and now it only take a few seconds. |
It is slightly misleading to have a JobCondition to be ResourceCleaned. How about using this condition (https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1beta2/tensorflow/controller.go#L382) ? |
When this condition |
@johnugeorge could you please help answering the above question? |
When I had a relook at code, /cc @richardsliu |
@johnugeorge in this case we have to introduce a new condition, right? |
Is new status |
The new status will be set no matters what CleanPodPolicy is. We can set the new status at the end of this block https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1beta2/tensorflow/controller.go#L436, where the job is terminated. |
|
resources cleaned up means the pods and services are deleted, I think complete status means all pods are exit 0 but not deleted yet. |
@johnugeorge @richardsliu Any update on this? I believe this is the only remaining issue blocking 1.0? |
Last week we decided that this issue is not a blocker for 1.0, so we removed the label. |
@zionwu Is this issue still valid? See Scalability evaluation of operator doc |
Related: #829 |
Discussed with @johnugeorge, will lower this in priority based on the scalability evaluation. |
@johnugeorge @zionwu Is there any update? I think we should have such a condition. If you do not have time, I can help implement it. |
@johnugeorge The scalability evaluation only tested two cases: big TFJobs and large number of concurrent TFJobs , but it did not test the case which there is a lot of completed jobs. |
@gaocegege I have implemented it in my own environment. I can submit a PR if you guys agree with the fix I described in previous comments. |
@zionwu I think we need it and the implementation LGTM. @johnugeorge WDYT |
@gaocegege Great, I will submit a PR tmr. |
@zionwu |
Sorry for late reply. My worry is that this is a change that will affect other operators too. Current terminal conditions are successful and failed. I am not sure if there are assumptions made about this outside the project. So best case solution would be if we can solve this problem without adding new one. Is it really necessary? |
I can confirm the problem locally. While I can solve it by setting reconcilePeriod to 12h or 24h. I think it will be better if we could have a condition to avoid useless reconcile loops. |
AFAIU, the |
Ref kubeflow/common#61 currently if a job is finished (succeeded/failed) it will do following tasks
then it will return. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Issue Label Bot is not confident enough to auto-label this issue. |
1 similar comment
Issue Label Bot is not confident enough to auto-label this issue. |
I run tf-operator for a month and recently I find it takes several minutes for tf-operator to start creating pods after I submit a new job, which is slower than the time(a few seconds) a month ago.
I think the cause is that the resyncPeriod is
15s30s, and there are a lot of completed job that I don't want to delete in order to keep the job history. So the work queue is always full of completed jobs and new job does not get processed in time.I tried to increase the theadiness from 1 to 20, but it did not mitiagte the issue much.
I would like to increase the resyncPeriod but it is hard-coded.
To increase the performance:
syncTFJob
, if the status is "completed", it does nothing and return.The text was updated successfully, but these errors were encountered: