-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v1alpha2] Add distributed state management #625
Conversation
Travis tests have failedHey @yph152, 2nd Buildgometalinter --config=linter_config.json --vendor ./...
3rd Buildgometalinter --config=linter_config.json --vendor ./...
|
pkg/controller.v2/controller.go
Outdated
if rtype == tfv1alpha2.TFReplicaTypeWorker { | ||
workerReplicas = int(*spec.Replicas) | ||
} | ||
err = tc.reconcilePods(tfjob, pods, rtype, spec, rstatus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to return the map instead of pass it as a argument, since we do not use it except here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, could you please explain the name rstatus
? Do you mean replicaStatus
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok,I will do it
pkg/controller.v2/controller.go
Outdated
} | ||
if rtype == tfv1alpha2.TFReplicaTypeWorker { | ||
workerReplicas = int(*spec.Replicas) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could get the replicas number in the function updateStatusNew, because we do not use it here.
} | ||
|
||
} | ||
if status[psRunningReason] == preplicas && status[workerPendingReason] == 1 && tfjob.Status.Conditions[len(tfjob.Status.Conditions)-1].Type == tfv1alpha2.TFJobFailed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we restart the TFJob? I do not see the logic about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restart is controlled by restart policy .
c := status.Conditions[i] | ||
if c.Type == condType { | ||
return &c | ||
} | ||
}*/ | ||
if len(status.Conditions) > 0 { | ||
return &status.Conditions[len(status.Conditions)-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we return the last condition instead of return the condition with the provided type? We are using the function in setCondition and the change will break it, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the state should be created-> running-> failed. The final state is determined in the order of the array, but the previous method is, for example: the current state is running, but the array form is created->running->failed. Personally think this is not very good
/ok-to-test Please fix the lint issues. |
Can you please clarify what this PR is doing? Its not clear from the PR description or #562 what is actually changing about conditions? |
@jlewi yeah,This mainly solves a state management task for tensorflow distributed training. The previous solution only uses the status of all replicas to determine the current state of distributed training. However, when an important role is a failure, for example PS, this time the training task is unsuccessful, but at this time, the user cannot get the current actual state through the condition provided by our tfjob. Of course, because of the many factors that determine the success or failure of distributed training, there is currently no way to satisfy a variety of needs. Therefore, we assume that the distributed training code written by the user is implemented according to the best practice recommended by the government. Based on this, we proposed #562 to solve the work of distributed state management, this is an implementation of it. |
I think this PR is to implement the logic about the distributed status management. It marks the TFJob succeeded if the worker-0 is succeeded and all PS is running. Then we could run the cleanup logic based on the status. For example, delete all pods and services after the TFJob is succeeded. @yph152 Do I misunderstand the purpose of the PR? |
@gaocegege Yes, you know nothing about it. This is just the first step. We can only finish the state judgment of the model training first, then we can use this judgment to manage the resources. |
Why do we need the PS to be running to mark the job as succesful? If the chief exited with success isn't that sufficient to mark the job as successful? |
@@ -84,6 +107,103 @@ func (tc *TFJobController) updateStatus(tfjob *tfv1alpha2.TFJob, rtype tfv1alpha | |||
return nil | |||
} | |||
|
|||
func (tc *TFJobController) updateStatusNew(tfjob *tfv1alpha2.TFJob, rstatus map[string]v1.PodPhase, wreplicas int, preplicas int) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment explaining what this function is doing. In particular, if its updating status please describe the logic for determining the status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlewi I will add it as soon as possible
It looks to me like termination policies haven't been implemented yet for v1alpha2 see How does that relate to this PR? Should we implement support for the termination policies first? |
Review status: 0 of 3 files reviewed, 5 unresolved discussions (waiting on @gaocegege, @jlewi, @DjangoPeng, and @ddysher) pkg/controller.v2/controller_status.go, line 110 at r2 (raw file): Previously, yph152 (Penghui Yan) wrote…
Why is this function called "UpdateStatusNew? Why aren't you just changing updateStatus? Comments from Reviewable |
Travis tests have failedHey @yph152, 3rd Buildgoveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
|
Most recent GPU failure looks like #640 |
I tried a job after patching #637 which adds supports for termination policies. It doesn't look like the TFJob condition gets set as succeeded. Would this PR address that?
|
Some conflicts should be resolved at first. @yph152 |
Travis tests have failedHey @yph152, 2nd Buildgometalinter --config=linter_config.json --vendor ./...
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
3rd Buildgometalinter --config=linter_config.json --vendor ./...
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
|
Hey @yph152, TravisCI finished with status |
1 similar comment
Hey @yph152, TravisCI finished with status |
@yph152 Could you update the PR description with the status of a succesful job so we can see what the conditions would look like? |
@jlewi FYI, We do not plan to merge it before 0.2. |
Travis tests have failedHey @yph152, 2nd Buildgometalinter --config=linter_config.json --vendor ./...
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
3rd Buildgometalinter --config=linter_config.json --vendor ./...
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
|
Travis tests have failedHey @yph152, 2nd Buildgoveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
3rd Buildgoveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
|
Travis tests have failedHey @yph152, 1st Buildgoveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
2nd Buildgoveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
3rd Buildgoveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
|
Please rebase the master and fix the linting errors. |
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the |
CLAs look good, thanks! |
Travis tests have failedHey @yph152, 2nd Buildgometalinter --config=linter_config.json --vendor ./...
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
3rd Buildgometalinter --config=linter_config.json --vendor ./...
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/testutil/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go"
|
@yph152 The v1alpha2 E2E tests have been updated. Can you update them to test for the changes made in this PR. For example: In test_runner.py Can we add checks to make sure that the expected conditions are present when the job is finished? Also could you please update the PR description with the status for a completed jobs that shows the updated conditions? |
Could you please update the unit test too? I think test about status should be updated. |
@jlewi This PR does not block e2e test. We add a status restarting, but it not affect succeeded case in e2e. I will merge it and begin to implement delete logic based on the PR. And at that PR, maybe we should check if the pods are deleted in e2e. |
LGTM, thanks for your contribution! /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gaocegege The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* add distributed state management * update controller_status_test * add unit test * add startTime
/cc @gaocegege @ddysher @DjangoPeng
I have a submission for v1alpha2 status management
#562
This change is