-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix the reconcile flow #1111
fix the reconcile flow #1111
Conversation
/assign @gaocegege @richardsliu |
Travis tests have failedHey @ChanYiLin, 1st Buildhack/verify-codegen.sh
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/*.go,pkg/util/*/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go,pkg/apis/common/*/zz_generated.*.go,pkg/apis/common/*/*_generated.go"
TravisBuddy Request Identifier: 69e8a340-1a6e-11ea-8ac4-55ffd8dc6fb9 |
Hey @ChanYiLin, TravisBuddy Request Identifier: 18b88040-1a72-11ea-8ac4-55ffd8dc6fb9 |
/retest |
if tc.Config.EnableGangScheduling { | ||
minAvailableReplicas := getTotalReplicas(tfjob) | ||
_, err := tc.SyncPodGroup(tfjob, minAvailableReplicas) | ||
err := updateTFJobConditions(tfjob, common.JobFailed, tfJobFailedReason, failureMessage) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err != nil { | |
if err := updateTFJobConditions( | |
tfjob, common.JobFailed, tfJobFailedReason, failureMessage); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Done!
Could you please update pytorch-operator, too? |
Sure, no problem! |
Hey @ChanYiLin, TravisBuddy Request Identifier: 52f49570-1af2-11ea-b9ce-6f43500ed087 |
…e and backofflimit
Hey @ChanYiLin, TravisBuddy Request Identifier: d26bab50-1af6-11ea-b9ce-6f43500ed087 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/assign @johnugeorge @richardsliu
Since there is no more reconcile after completion, does it also solve #965? |
I think so. |
Yes, by moving The complete jobs will only do the cleanup process(if set to cleanup) then return nil(if no status update) |
Does it improve the performance issue in #965? After a relook, It looks like the the return conditions are exactly same as before. Cleanup check will happen for every reconcile call(before and after the changes) and if job status hasn't changed, the control returns instead of further reconcile(before and after the changes) I can merge this unless you have some thoughts on it. |
@johnugeorge Yes you can merge this PR first, I think this is enough for this PR. Also I can create the same PR to the pytorch-operator then. Thanks! For the performance issue, we can discuss in #965 . However this PR indeed prevents checking backofflimit and activedeadline for those complete jobs which might have some help. |
Great. Thanks @ChanYiLin |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: johnugeorge The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is originally from kubeflow/training-operator#1111 Signed-off-by: Jiaxin Shan <[email protected]>
) * Skip check activeDeadline or backoffLimit if job terminated This is originally from kubeflow/training-operator#1111 Signed-off-by: Jiaxin Shan <[email protected]> * Add PodGroup reconcile logic This is missing in kubeflow/common. We need this to make sure minAvailableReplicas is correct in PodGroup for each training job Signed-off-by: Jiaxin Shan <[email protected]>
If the tfjob has already terminated, we don't need to check activedeadline and backofflimit.
Originally, even the job has terminated it still checks the Activedeadline and appends the event to it.
So the event that shows the job failed after it succeeded might happen as follow,
the log of tf-operator will also keep showing the failure massage of past Activedeadline
In this MR, I reorder the checking process so if the job has terminated(Succeed, Failed), it will return instead of further reconcile.
This change is