Performance issue when there is a lot of completed jobs #965

zionwu · 2019-03-25T08:32:44Z

I run tf-operator for a month and recently I find it takes several minutes for tf-operator to start creating pods after I submit a new job, which is slower than the time(a few seconds) a month ago.

I think the cause is that the resyncPeriod is ~~15s~~ 30s, and there are a lot of completed job that I don't want to delete in order to keep the job history. So the work queue is always full of completed jobs and new job does not get processed in time.

I tried to increase the theadiness from 1 to 20, but it did not mitiagte the issue much.
I would like to increase the resyncPeriod but it is hard-coded.

To increase the performance:

How about making resyncPeriod a command line flag?
How about adding a new status like "Completed". The job is updated to "Completed" after the resources of the job is cleaned up. In syncTFJob, if the status is "completed", it does nothing and return.

The text was updated successfully, but these errors were encountered:

gaocegege · 2019-03-25T09:03:50Z

/cc @richardsliu @johnugeorge

johnugeorge · 2019-03-25T09:17:58Z

@zionwu How many completed jobs do you have in the system currently?

johnugeorge · 2019-03-25T09:18:58Z

Resync period is currently 30s (https://github.com/kubeflow/tf-operator/blob/master/cmd/tf-operator.v1beta2/app/server.go#L53)

zionwu · 2019-03-25T09:50:19Z

How many completed jobs do you have in the system currently?

Around 250.

Resync period is currently 30s

Yes, I am using v0.3 and it is also 30s.

richardsliu · 2019-03-25T18:52:21Z

/cc @zabbasi

zionwu · 2019-03-26T09:15:48Z

I increased resyncPeriod from 30s to 300s/900s, but it did not improve the performance much.

I changed the code by adding a new condition called "ResourceCleaned". The new condition is added to job after resource is cleaned in reconcileTFJobs. And in the beginning of syncTFJob, if the new condition is found, just return.

This fix can improve the performance and now it only take a few seconds.
I will submit a PR if you agree with this approach.

johnugeorge · 2019-03-26T09:44:38Z

It is slightly misleading to have a JobCondition to be ResourceCleaned. How about using this condition (https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1beta2/tensorflow/controller.go#L382) ?

zionwu · 2019-03-26T09:59:32Z

When this condition tfjob.Status.ReplicaStatuses[rtype].Active == 0 is true, the job must be failed or succeed and the resources is already cleaned up, right? If so I think we can use this condition.

zionwu · 2019-03-28T02:27:32Z

When this condition tfjob.Status.ReplicaStatuses[rtype].Active == 0 is true, the job must be failed or succeed and the resources is already cleaned up, right?

@johnugeorge could you please help answering the above question?

johnugeorge · 2019-03-28T03:19:07Z

When I had a relook at code, Active is updated only when job is succeeded. If job is failed, there will be Active and Failed pods. This solution won't work in that case. See #897 (comment)

/cc @richardsliu

zionwu · 2019-03-28T06:48:56Z

@johnugeorge in this case we have to introduce a new condition, right?

johnugeorge · 2019-03-28T07:38:44Z

Is new status completed set only when resources are cleaned up? What would be behavior if CleanPodPolicy of Job is set to None or Running?

zionwu · 2019-03-28T08:38:51Z

The new status will be set no matters what CleanPodPolicy is.
It is used to mark that required steps(if any) are done after job terminated, so next time we find job with the new status we can do nothing and return. It avoids re-processing the terminated job.

We can set the new status at the end of this block https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1beta2/tensorflow/controller.go#L436, where the job is terminated.

ScorpioCPH · 2019-05-24T03:56:03Z

Completed status is needed for user experience, and resyncPeriod gives informer a second chance to make sure that the local cache is equal to remote apiserver, but we should depend on Event which is more reliable, and we set resyncPeriod to 24 hours in our env.

ScorpioCPH · 2019-05-24T03:59:04Z

Is new status completed set only when resources are cleaned up?

resources cleaned up means the pods and services are deleted, I think complete status means all pods are exit 0 but not deleted yet.

jlewi · 2019-06-10T13:17:50Z

@johnugeorge @richardsliu Any update on this? I believe this is the only remaining issue blocking 1.0?

richardsliu · 2019-06-10T18:33:08Z

Last week we decided that this issue is not a blocker for 1.0, so we removed the label.

johnugeorge · 2019-08-30T12:34:34Z

@zionwu Is this issue still valid? See Scalability evaluation of operator doc

johnugeorge · 2019-08-30T12:35:48Z

Related: #829

richardsliu · 2019-09-05T03:23:33Z

Discussed with @johnugeorge, will lower this in priority based on the scalability evaluation.

gaocegege · 2019-10-29T09:47:15Z

@johnugeorge @zionwu Is there any update? I think we should have such a condition. If you do not have time, I can help implement it.

zionwu · 2019-10-29T13:12:03Z

@johnugeorge The scalability evaluation only tested two cases: big TFJobs and large number of concurrent TFJobs , but it did not test the case which there is a lot of completed jobs.

zionwu · 2019-10-29T13:14:48Z

@gaocegege I have implemented it in my own environment. I can submit a PR if you guys agree with the fix I described in previous comments.

gaocegege · 2019-10-30T01:33:45Z

@zionwu I think we need it and the implementation LGTM. @johnugeorge WDYT

zionwu · 2019-10-30T09:56:44Z

@gaocegege Great, I will submit a PR tmr.

gaocegege · 2019-10-30T10:02:23Z

@zionwu
Thanks. Maybe we could request @johnugeorge 's review.

johnugeorge · 2019-10-30T14:01:58Z

Sorry for late reply. My worry is that this is a change that will affect other operators too. Current terminal conditions are successful and failed. I am not sure if there are assumptions made about this outside the project. So best case solution would be if we can solve this problem without adding new one. Is it really necessary?

gaocegege · 2019-10-31T01:57:17Z

I can confirm the problem locally. While I can solve it by setting reconcilePeriod to 12h or 24h.

I think it will be better if we could have a condition to avoid useless reconcile loops.

zhujl1991 · 2020-02-11T00:11:30Z

AFAIU, the resync can help the cases where informer misses notifications. Do we have any idea why and how frequently does the miss happen? kubernetes/kubernetes#75622 Looks like the resync has been removed in sset controller. Is it possible to remove it for tf operator as well?

ChanYiLin · 2020-04-10T03:30:59Z

Ref kubeflow/common#61
@zionwu @gaocegege any feedback or update on it?

currently if a job is finished (succeeded/failed) it will do following tasks

delete related pod/resources
cleanup tfjob based on TTL
delete podgroup if enable gang-scheduling
update replica status if is Succeeded

then it will return.
I wonder which part still introduces the delay.

stale · 2020-07-10T00:19:13Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

issue-label-bot · 2020-07-10T00:19:20Z

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

issue-label-bot · 2020-07-10T00:19:20Z

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

zionwu changed the title ~~Performance issue when there is a lot of completed job~~ Performance issue when there is a lot of completed jobs Mar 25, 2019

gaocegege added the improvement/enhancement label Mar 25, 2019

johnugeorge mentioned this issue Mar 28, 2019

TFJob 1.0 #968

Closed

4 tasks

richardsliu mentioned this issue May 23, 2019

Fix panic as pod is nil #1006

Merged

johnugeorge mentioned this issue May 27, 2019

fix sync PodGroup logic #1012

Merged

jlewi added the priority/p1 label Jun 10, 2019

jlewi added kind/feature and removed improvement/enhancement labels Aug 28, 2019

richardsliu added the area/1.0.0 label Sep 4, 2019

richardsliu added priority/p2 and removed priority/p1 labels Sep 5, 2019

zionwu mentioned this issue Oct 31, 2019

Add new condition JobReconcileFinished to fix performance issue #1097

Closed

johnugeorge mentioned this issue Dec 10, 2019

fix the reconcile flow #1111

Merged

terrytangyuan mentioned this issue Apr 12, 2020

Continue to optimize reconciler performance and reduce latency to take actions on CR events kubeflow/common#68

Open

stale bot added the lifecycle/stale label Jul 10, 2020

stale bot closed this as completed Jul 17, 2020

Jeffwan mentioned this issue Sep 1, 2020

[Release 1.2] Feature Planning / Roadmap kubeflow/kubeflow#5224

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue when there is a lot of completed jobs #965

Performance issue when there is a lot of completed jobs #965

zionwu commented Mar 25, 2019 •

edited

Loading

gaocegege commented Mar 25, 2019

johnugeorge commented Mar 25, 2019

johnugeorge commented Mar 25, 2019

zionwu commented Mar 25, 2019 •

edited

Loading

richardsliu commented Mar 25, 2019

zionwu commented Mar 26, 2019 •

edited

Loading

johnugeorge commented Mar 26, 2019

zionwu commented Mar 26, 2019 •

edited

Loading

zionwu commented Mar 28, 2019

johnugeorge commented Mar 28, 2019 •

edited

Loading

zionwu commented Mar 28, 2019

johnugeorge commented Mar 28, 2019

zionwu commented Mar 28, 2019

ScorpioCPH commented May 24, 2019

ScorpioCPH commented May 24, 2019

jlewi commented Jun 10, 2019

richardsliu commented Jun 10, 2019

johnugeorge commented Aug 30, 2019

johnugeorge commented Aug 30, 2019

richardsliu commented Sep 5, 2019

gaocegege commented Oct 29, 2019

zionwu commented Oct 29, 2019 •

edited

Loading

zionwu commented Oct 29, 2019 •

edited

Loading

gaocegege commented Oct 30, 2019

zionwu commented Oct 30, 2019

gaocegege commented Oct 30, 2019

johnugeorge commented Oct 30, 2019

gaocegege commented Oct 31, 2019

zhujl1991 commented Feb 11, 2020

ChanYiLin commented Apr 10, 2020 •

edited

Loading

stale bot commented Jul 10, 2020

issue-label-bot bot commented Jul 10, 2020

issue-label-bot bot commented Jul 10, 2020

Performance issue when there is a lot of completed jobs #965

Performance issue when there is a lot of completed jobs #965

Comments

zionwu commented Mar 25, 2019 • edited Loading

gaocegege commented Mar 25, 2019

johnugeorge commented Mar 25, 2019

johnugeorge commented Mar 25, 2019

zionwu commented Mar 25, 2019 • edited Loading

richardsliu commented Mar 25, 2019

zionwu commented Mar 26, 2019 • edited Loading

johnugeorge commented Mar 26, 2019

zionwu commented Mar 26, 2019 • edited Loading

zionwu commented Mar 28, 2019

johnugeorge commented Mar 28, 2019 • edited Loading

zionwu commented Mar 28, 2019

johnugeorge commented Mar 28, 2019

zionwu commented Mar 28, 2019

ScorpioCPH commented May 24, 2019

ScorpioCPH commented May 24, 2019

jlewi commented Jun 10, 2019

richardsliu commented Jun 10, 2019

johnugeorge commented Aug 30, 2019

johnugeorge commented Aug 30, 2019

richardsliu commented Sep 5, 2019

gaocegege commented Oct 29, 2019

zionwu commented Oct 29, 2019 • edited Loading

zionwu commented Oct 29, 2019 • edited Loading

gaocegege commented Oct 30, 2019

zionwu commented Oct 30, 2019

gaocegege commented Oct 30, 2019

johnugeorge commented Oct 30, 2019

gaocegege commented Oct 31, 2019

zhujl1991 commented Feb 11, 2020

ChanYiLin commented Apr 10, 2020 • edited Loading

stale bot commented Jul 10, 2020

issue-label-bot bot commented Jul 10, 2020

issue-label-bot bot commented Jul 10, 2020

zionwu commented Mar 25, 2019 •

edited

Loading

zionwu commented Mar 25, 2019 •

edited

Loading

zionwu commented Mar 26, 2019 •

edited

Loading

zionwu commented Mar 26, 2019 •

edited

Loading

johnugeorge commented Mar 28, 2019 •

edited

Loading

zionwu commented Oct 29, 2019 •

edited

Loading

zionwu commented Oct 29, 2019 •

edited

Loading

ChanYiLin commented Apr 10, 2020 •

edited

Loading