Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue when there is a lot of completed jobs #965

Closed
zionwu opened this issue Mar 25, 2019 · 36 comments
Closed

Performance issue when there is a lot of completed jobs #965

zionwu opened this issue Mar 25, 2019 · 36 comments

Comments

@zionwu
Copy link
Contributor

zionwu commented Mar 25, 2019

I run tf-operator for a month and recently I find it takes several minutes for tf-operator to start creating pods after I submit a new job, which is slower than the time(a few seconds) a month ago.

I think the cause is that the resyncPeriod is 15s 30s, and there are a lot of completed job that I don't want to delete in order to keep the job history. So the work queue is always full of completed jobs and new job does not get processed in time.

I tried to increase the theadiness from 1 to 20, but it did not mitiagte the issue much.
I would like to increase the resyncPeriod but it is hard-coded.

To increase the performance:

  1. How about making resyncPeriod a command line flag?
  2. How about adding a new status like "Completed". The job is updated to "Completed" after the resources of the job is cleaned up. In syncTFJob, if the status is "completed", it does nothing and return.
@zionwu zionwu changed the title Performance issue when there is a lot of completed job Performance issue when there is a lot of completed jobs Mar 25, 2019
@gaocegege
Copy link
Member

/cc @richardsliu @johnugeorge

@johnugeorge
Copy link
Member

@zionwu How many completed jobs do you have in the system currently?

@johnugeorge
Copy link
Member

@zionwu
Copy link
Contributor Author

zionwu commented Mar 25, 2019

How many completed jobs do you have in the system currently?

Around 250.

Resync period is currently 30s

Yes, I am using v0.3 and it is also 30s.

@richardsliu
Copy link
Contributor

/cc @zabbasi

@zionwu
Copy link
Contributor Author

zionwu commented Mar 26, 2019

I increased resyncPeriod from 30s to 300s/900s, but it did not improve the performance much.

I changed the code by adding a new condition called "ResourceCleaned". The new condition is added to job after resource is cleaned in reconcileTFJobs. And in the beginning of syncTFJob, if the new condition is found, just return.

This fix can improve the performance and now it only take a few seconds.
I will submit a PR if you agree with this approach.

@johnugeorge
Copy link
Member

It is slightly misleading to have a JobCondition to be ResourceCleaned. How about using this condition (https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1beta2/tensorflow/controller.go#L382) ?

@zionwu
Copy link
Contributor Author

zionwu commented Mar 26, 2019

When this condition tfjob.Status.ReplicaStatuses[rtype].Active == 0 is true, the job must be failed or succeed and the resources is already cleaned up, right? If so I think we can use this condition.

@zionwu
Copy link
Contributor Author

zionwu commented Mar 28, 2019

When this condition tfjob.Status.ReplicaStatuses[rtype].Active == 0 is true, the job must be failed or succeed and the resources is already cleaned up, right?

@johnugeorge could you please help answering the above question?

@johnugeorge
Copy link
Member

johnugeorge commented Mar 28, 2019

When I had a relook at code, Active is updated only when job is succeeded. If job is failed, there will be Active and Failed pods. This solution won't work in that case. See #897 (comment)

/cc @richardsliu

@zionwu
Copy link
Contributor Author

zionwu commented Mar 28, 2019

@johnugeorge in this case we have to introduce a new condition, right?

@johnugeorge johnugeorge mentioned this issue Mar 28, 2019
4 tasks
@johnugeorge
Copy link
Member

Is new status completed set only when resources are cleaned up? What would be behavior if CleanPodPolicy of Job is set to None or Running?

@zionwu
Copy link
Contributor Author

zionwu commented Mar 28, 2019

The new status will be set no matters what CleanPodPolicy is.
It is used to mark that required steps(if any) are done after job terminated, so next time we find job with the new status we can do nothing and return. It avoids re-processing the terminated job.

We can set the new status at the end of this block https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1beta2/tensorflow/controller.go#L436, where the job is terminated.

@ScorpioCPH
Copy link
Member

Completed status is needed for user experience, and resyncPeriod gives informer a second chance to make sure that the local cache is equal to remote apiserver, but we should depend on Event which is more reliable, and we set resyncPeriod to 24 hours in our env.

@ScorpioCPH
Copy link
Member

Is new status completed set only when resources are cleaned up?

resources cleaned up means the pods and services are deleted, I think complete status means all pods are exit 0 but not deleted yet.

@jlewi
Copy link
Contributor

jlewi commented Jun 10, 2019

@johnugeorge @richardsliu Any update on this? I believe this is the only remaining issue blocking 1.0?

@richardsliu
Copy link
Contributor

Last week we decided that this issue is not a blocker for 1.0, so we removed the label.

@johnugeorge
Copy link
Member

@zionwu Is this issue still valid? See Scalability evaluation of operator doc

@johnugeorge
Copy link
Member

Related: #829

@richardsliu
Copy link
Contributor

Discussed with @johnugeorge, will lower this in priority based on the scalability evaluation.

@gaocegege
Copy link
Member

@johnugeorge @zionwu Is there any update? I think we should have such a condition. If you do not have time, I can help implement it.

@zionwu
Copy link
Contributor Author

zionwu commented Oct 29, 2019

@johnugeorge The scalability evaluation only tested two cases: big TFJobs and large number of concurrent TFJobs , but it did not test the case which there is a lot of completed jobs.

@zionwu
Copy link
Contributor Author

zionwu commented Oct 29, 2019

@gaocegege I have implemented it in my own environment. I can submit a PR if you guys agree with the fix I described in previous comments.

@gaocegege
Copy link
Member

@zionwu I think we need it and the implementation LGTM. @johnugeorge WDYT

@zionwu
Copy link
Contributor Author

zionwu commented Oct 30, 2019

@gaocegege Great, I will submit a PR tmr.

@gaocegege
Copy link
Member

@zionwu
Thanks. Maybe we could request @johnugeorge 's review.

@johnugeorge
Copy link
Member

Sorry for late reply. My worry is that this is a change that will affect other operators too. Current terminal conditions are successful and failed. I am not sure if there are assumptions made about this outside the project. So best case solution would be if we can solve this problem without adding new one. Is it really necessary?

@gaocegege
Copy link
Member

I can confirm the problem locally. While I can solve it by setting reconcilePeriod to 12h or 24h.

I think it will be better if we could have a condition to avoid useless reconcile loops.

@zhujl1991
Copy link
Member

AFAIU, the resync can help the cases where informer misses notifications. Do we have any idea why and how frequently does the miss happen? kubernetes/kubernetes#75622 Looks like the resync has been removed in sset controller. Is it possible to remove it for tf operator as well?

@ChanYiLin
Copy link
Member

ChanYiLin commented Apr 10, 2020

Ref kubeflow/common#61
@zionwu @gaocegege any feedback or update on it?

currently if a job is finished (succeeded/failed) it will do following tasks

  1. delete related pod/resources
  2. cleanup tfjob based on TTL
  3. delete podgroup if enable gang-scheduling
  4. update replica status if is Succeeded

then it will return.
I wonder which part still introduces the delay.

@stale
Copy link

stale bot commented Jul 10, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@issue-label-bot
Copy link

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

1 similar comment
@issue-label-bot
Copy link

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants