-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow using WORKER:0 as chief #221
Conversation
Hi @lluunn. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
Reviewed 4 of 4 files at r1. Comments from Reviewable |
This is awesome. @lluunn could you add Review status: all files reviewed at latest revision, 1 unresolved discussion, all commit checks successful. pkg/trainer/training.go, line 216 at r1 (raw file):
Can we check the specific state of the replica that is the chief rather than the overall state? Comments from Reviewable |
What do you want to add? Review status: all files reviewed at latest revision, 1 unresolved discussion. pkg/trainer/training.go, line 216 at r1 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
It is checking the replica state of chief right? Maybe I misunderstood, could you elaborate? Comments from Reviewable |
The policy specifies the index of a single replica; e.g. if we have 10 workers only worker 0 is the chief. replicaSetStates is some aggregation of all the workers but that's not quite what we want. Review status: all files reviewed at latest revision, 1 unresolved discussion. Comments from Reviewable |
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for the commit author(s). If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. |
Done. Review status: 3 of 5 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed. pkg/trainer/training.go, line 216 at r1 (raw file): Previously, lluunn wrote…
Done. Comments from Reviewable |
Review status: 3 of 5 files reviewed at latest revision, all discussions resolved, some commit checks failed. Comments from Reviewable |
Looks good but there are merge conflicts now. Can you fix please? Also can you open up an issue to add an E2E test for the case where the chief isn't the master? |
Thanks for the review. I think it's fixed now. PTAL Review status: 0 of 5 files reviewed at latest revision, all discussions resolved, some commit checks failed. Comments from Reviewable |
Filed this issue for e2e test. |
…orker. * This was added in kubeflow#221 and accidentally removed in the refactor in kubeflow#234.
…roken (#308) * In syncTfJob when checking whether a work queue item corresponds to a TrainingJob already in the map we need to check the UID. Otherwise we will not properly handle the case where a training job is deleted and then a new job is recreated with the same name. * We need to make sure that the Replicas field in TrainingJob is always properly set; * We were only initializing replicas in setup which was problematic in the case where the TfJob controller gets restarted because on restarted setup won't be invoked because the job is past that phase and as a result the replicas won't be reinitialized. * test_runner needs to ignore case when checking whether the job succeeded otherwise we conclude that successful jobs failed * The controller should only forget about job after the job has been cleaned up; not when it is marked as succeeded or failed. * Add back code to support termination policies use the worker and not the master as the chief *This was added in #221 and accidentally removed in the refactor in #234.
second part for this issue
This change is