Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow using WORKER:0 as chief #221

Merged
merged 19 commits into from
Dec 20, 2017
Merged

allow using WORKER:0 as chief #221

merged 19 commits into from
Dec 20, 2017

Conversation

lluunn
Copy link
Contributor

@lluunn lluunn commented Dec 13, 2017

second part for this issue


This change is Reviewable

@k8s-ci-robot
Copy link

Hi @lluunn. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jlewi
Copy link
Contributor

jlewi commented Dec 13, 2017

/ok-to-test

@jlewi
Copy link
Contributor

jlewi commented Dec 15, 2017

Reviewed 4 of 4 files at r1.
Review status: all files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

@jlewi
Copy link
Contributor

jlewi commented Dec 15, 2017

This is awesome.
Thanks

@lluunn could you add


Review status: all files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/trainer/training.go, line 216 at r1 (raw file):

	chief := j.job.Spec.TerminationPolicy.Chief
	if v, ok := replicaSetStates[spec.TfReplicaType(chief.ReplicaName)]; ok && v == spec.ReplicaStateSucceeded {

Can we check the specific state of the replica that is the chief rather than the overall state?


Comments from Reviewable

@lluunn
Copy link
Contributor Author

lluunn commented Dec 15, 2017

What do you want to add?


Review status: all files reviewed at latest revision, 1 unresolved discussion.


pkg/trainer/training.go, line 216 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Can we check the specific state of the replica that is the chief rather than the overall state?

It is checking the replica state of chief right? Maybe I misunderstood, could you elaborate?


Comments from Reviewable

@jlewi
Copy link
Contributor

jlewi commented Dec 16, 2017

The policy specifies the index of a single replica; e.g. if we have 10 workers only worker 0 is the chief.
So we should only take status of worker 0 into account when deciding what to do.

replicaSetStates is some aggregation of all the workers but that's not quite what we want.


Review status: all files reviewed at latest revision, 1 unresolved discussion.


Comments from Reviewable

@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for the commit author(s). If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

@lluunn
Copy link
Contributor Author

lluunn commented Dec 19, 2017

Done.


Review status: 3 of 5 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


pkg/trainer/training.go, line 216 at r1 (raw file):

Previously, lluunn wrote…

It is checking the replica state of chief right? Maybe I misunderstood, could you elaborate?

Done.


Comments from Reviewable

@jlewi
Copy link
Contributor

jlewi commented Dec 20, 2017

:lgtm:


Review status: 3 of 5 files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@jlewi
Copy link
Contributor

jlewi commented Dec 20, 2017

Looks good but there are merge conflicts now. Can you fix please?

Also can you open up an issue to add an E2E test for the case where the chief isn't the master?

@lluunn
Copy link
Contributor Author

lluunn commented Dec 20, 2017

Thanks for the review. I think it's fixed now. PTAL


Review status: 0 of 5 files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@coveralls
Copy link

coveralls commented Dec 20, 2017

Coverage Status

Coverage increased (+0.3%) to 37.782% when pulling 91d8a23 on lluunn:kai into 6a2fc9c on tensorflow:master.

@lluunn
Copy link
Contributor Author

lluunn commented Dec 20, 2017

Filed this issue for e2e test.
I am interested to work on that after this one.

@jlewi jlewi merged commit cb1e053 into kubeflow:master Dec 20, 2017
jlewi added a commit to jlewi/k8s that referenced this pull request Jan 14, 2018
…orker.

  * This was added in kubeflow#221 and accidentally removed in the refactor in kubeflow#234.
jlewi added a commit that referenced this pull request Jan 16, 2018
…roken (#308)

* In syncTfJob when checking whether a work queue item corresponds to a TrainingJob already in the map we need to check the UID. Otherwise we will not properly handle the case where a training job is deleted and then a new job is recreated with the same name.

* We need to make sure that the Replicas field in TrainingJob is always properly set;

* We were only initializing replicas in setup which was problematic in the case where the TfJob controller gets restarted because on restarted setup won't be invoked because the job is past that phase and as a result the replicas won't be reinitialized.

* test_runner needs to ignore case when checking whether the job succeeded otherwise we conclude
that successful jobs failed

* The controller should only forget about job after the job has been cleaned up; not when it is marked as succeeded or failed.

* Add back code to support termination policies use the worker and not the master as the chief
    *This was added in #221 and accidentally removed in the refactor in #234.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants