-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create Pod instead of Job #344
Conversation
Do we keep the trainer and just use pod instead of job in this PR? |
Can you explain how you are going to deal with pod terminations? For example suppose, a pod gets terminated because the machine its running on becomes unavailable. How will this situation be handled? |
The pod event callback handler should be notified by the apiserver and we could handle the failure of pod, I think. |
Hi, @ScorpioCPH Any progress here? |
This is dependent on this discussion #333. |
If we just want to replace Maybe we can add event driven logic in another PR. |
37d348d
to
4033787
Compare
When the PR is ready please remove WIP from the title. |
4033787
to
be20406
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, and one question:
Do we need to update https://github.com/tensorflow/k8s/blob/master/examples/gke/TF%20on%20GKE.ipynb?
/cc @jlewi
pkg/trainer/replicas.go
Outdated
} | ||
} else { | ||
s.recorder.Eventf(s.Job.job, v1.EventTypeNormal, SuccessfulCreateReason, "Created job: %v", createdJob.Name) | ||
s.recorder.Eventf(s.Job.job, v1.EventTypeNormal, SuccessfulCreateReason, "Created Pod: %v", createdPod.Name) | ||
} | ||
} | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/tensorflow/k8s/pull/344/files#diff-d2bc8c1807fa25d2b911d2d781f48a07R236
err = s.ClientSet.BatchV1().Jobs(s.Job.job.ObjectMeta.Namespace).DeleteCollection(&meta_v1.DeleteOptions{}, options)
I think we should update here, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks, done.
pkg/trainer/replicas.go
Outdated
@@ -359,7 +341,7 @@ func replicaStatusFromPodList(l v1.PodList, name tfv1alpha1.ContainerName) tfv1a | |||
} | |||
|
|||
func (s *TFReplicaSet) GetSingleReplicaStatus(index int32) tfv1alpha1.ReplicaState { | |||
j, err := s.ClientSet.BatchV1().Jobs(s.Job.job.ObjectMeta.Namespace).Get(s.jobName(index), meta_v1.GetOptions{}) | |||
j, err := s.ClientSet.BatchV1().Jobs(s.Job.job.ObjectMeta.Namespace).Get(s.genName(index), meta_v1.GetOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we should replace BatchV1().Jobs() to CoreV1().Pods()
be20406
to
6e1c2b6
Compare
We can probably just delete If its broken. |
What are the failure/retry semantics? Before we relied on the Job controller to create new pods as necessary. What happens now? It looks like genNames always creates a POD with the same name for replica index i. But what happens if a pod is terminated (e.g. a VM restarts)? We would either need to recreate the pod with a new name or delete the current pod before recreating it. |
/test all |
1 similar comment
/test all |
@jlewi We are celebrating Chinese New Year those days, and apologize that maybe we can not reply promptly.
I think it should be handled by the pod restartPolicy. |
Restart policy only applies to the containers within the pod docs. So I don't think it helps in the case where the pod itself is terminated; e.g. because the VM becomes unhealthy. |
Sorry for the misunderstanding. I think we have to handle it in the operator. We need to register the callback for pod and if pod status is changed we should do some work similar to job controller in kubernetes maybe. |
I think this is another topic |
Right now because we create Job controllers, we are resilient to unwanted pod terminations. If a pod is terminated (e.g. because a VM goes away) then the job controller will create a new pod to replace it. So don't we need to provide the same level of reliability? Otherwise it would be a regression? Why can't we handle this in the reconcile function? Reconcile should be called periodically; so in the reconcile function why can't we check that all the requisite pods are running and if not create new ones? This seems like a good idea even if we have event handlers. If we rely on event handlers and an event gets dropped the job would be stuck. But as long as we have logic in Renoncile to get a job back to a healthy state then we can always recover. |
See #309 for more info about how the informer periodically sends Update events that we can use to trigger reconcile. |
I'm not following. The lines of code you [linked to(]https://github.com/kubeflow/tf-operator/blob/master/pkg/controller/controller.go#L123-L125) show that the informer will generate an Update event periodically which will cause Reconcile to be called periodically. So if the Reconcile creates any missing pods then the code should be resilient to failures. But your Reconcile function always uses the same name for every pod it creates; i.e. the function is TFReplicaSet.genName is a deterministic function of replica type, job type, index, runtime id. So master-0 would always correspond to some pod name "master-abcd-0". So suppose we create pod "master-abcd-0" and then this pod is terminated because the VM restarts. This worked before when we were creating Job Controllers because the job controller would automatically create a new pod if a pod terminated and wasn't successful. Now that we are creating the pods ourselves we need to do that. |
5e342a1
to
979b20c
Compare
@jlewi Addressed your comments (creating pods with a random UUID & reuse SyncPods/SyncServices), PTAL :) |
Review status: 0 of 3 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed. pkg/trainer/replicas.go, line 420 at r4 (raw file):
Why not just call GetSingleReplicaStatus and then decide based on the result whether or not to create a new pod? Comments from Reviewable |
Review status: 0 of 3 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed. pkg/trainer/replicas.go, line 510 at r4 (raw file):
Nit please add a "-" between the name and the random string. Also 5 digits of randomness is what K8s does and that's probably sufficient. Comments from Reviewable |
A couple minor comments. Biggest issue is that travis is still failing. It would be nice to reuse GetSingleReplicaStatus if it makes sense but I'm fine with the current code if you think its better. So fixing travis is the only blocking change at this point. |
61309e8
to
3e60600
Compare
3e60600
to
2a19325
Compare
@ScorpioCPH: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest |
@jlewi Thanks, i will rethink GetSingleReplicaStatus in v1alpha2 :) And i think CI is ok now. |
Review status: 0 of 3 files reviewed at latest revision, 3 unresolved discussions. pkg/trainer/replicas.go, line 420 at r4 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Per discussion elsehwere will reconsider this as part of v1alpha2. Comments from Reviewable |
Reviewed 1 of 2 files at r5. Comments from Reviewable |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlewi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@ScorpioCPH Thank you. |
Reviewed 1 of 2 files at r3, 2 of 2 files at r5. Comments from Reviewable |
Could we merge it now? @jlewi |
@gaocegege Yeah. I was hoping auto merge would work but I think it's being blocked by the reviewable failed status check. I'll merge it manually. |
This PR is a part of kubeflow#325: rename jobName() to genName() create Pod instead of Job TODOs (in another PR): use controller.PodControlInterface and CreatePodsWithControllerRef to create Pod Listen Pod CRUD and update TFJob status which descried in kubeflow#314
* A bug was introduced with getting the replica status in kubeflow#344 which switched to creating pods directly. * Our presubmits/postsubmits were failing but this went unnoticed because the git status check was improperly reported as succeeded. * The bug is because we try to get the pod status by name but the name doesn't include the random salt in the pod name. * The code in question is a legacy of when we were using job controllers and we first got the status of the job controller. We incorrectly changed that code to get the pod. The correct thing is to just list pods by label; we already do that in the code below so we just need to delete some code. * Fix kubeflow#500
* A bug was introduced with getting the replica status in kubeflow#344 which switched to creating pods directly. * Our presubmits/postsubmits were failing but this went unnoticed because the git status check was improperly reported as succeeded. * The bug is because we try to get the pod status by name but the name doesn't include the random salt in the pod name. * The code in question is a legacy of when we were using job controllers and we first got the status of the job controller. We incorrectly changed that code to get the pod. The correct thing is to just list pods by label; we already do that in the code below so we just need to delete some code. * Don't create any resources if the DeletionTimestamp is set. Creating resources at this point would end blocking deletion of the object because the controller would create resources while we are trying to delete them. * Use logrus in controller.go, trainer.go, and replicas.go to log with fields providing information about the job and repliac. This makes it easy to filter logs for a particular job. * Use logrus to log the name of the job in a field.
* Fix bug with jobs not being marked as completed. * A bug was introduced with getting the replica status in #344 which switched to creating pods directly. * Our presubmits/postsubmits were failing but this went unnoticed because the git status check was improperly reported as succeeded. * The bug is because we try to get the pod status by name but the name doesn't include the random salt in the pod name. * The code in question is a legacy of when we were using job controllers and we first got the status of the job controller. We incorrectly changed that code to get the pod. The correct thing is to just list pods by label; we already do that in the code below so we just need to delete some code. * Don't create any resources if the DeletionTimestamp is set. Creating resources at this point would end blocking deletion of the object because the controller would create resources while we are trying to delete them. * Use logrus in controller.go, trainer.go, and replicas.go to log with fields providing information about the job and repliac. This makes it easy to filter logs for a particular job. * Use logrus to log the name of the job in a field. * Checking the deletiontime stamp doesn't appear to be sufficient. Use the Phase to determine whether we should create resources. * Run gofmt. * * Reset the rate limiter after every successful sync. * Otherwise the ratelimiter will end up delaying processing subsequent events which isn't what we want. * Run goimports to fix lint issues. * * Reconcile needs to update the TFJob stored in TrainingJob. This ensures TrainingJob has an up to date representation of the job. * Otherwise changes made to the spec won't be available to TrainingJob. For example, if the job is deleted by the user, the deletion timestamp will be set. But if we don't update the TFJob stored in TrainingJob this change won't be propogated. * * TrainingJob.update should log the value of the job not the pointer. * Add more comments to the code.
* Fix bug with jobs not being marked as completed. * A bug was introduced with getting the replica status in kubeflow#344 which switched to creating pods directly. * Our presubmits/postsubmits were failing but this went unnoticed because the git status check was improperly reported as succeeded. * The bug is because we try to get the pod status by name but the name doesn't include the random salt in the pod name. * The code in question is a legacy of when we were using job controllers and we first got the status of the job controller. We incorrectly changed that code to get the pod. The correct thing is to just list pods by label; we already do that in the code below so we just need to delete some code. * Don't create any resources if the DeletionTimestamp is set. Creating resources at this point would end blocking deletion of the object because the controller would create resources while we are trying to delete them. * Use logrus in controller.go, trainer.go, and replicas.go to log with fields providing information about the job and repliac. This makes it easy to filter logs for a particular job. * Use logrus to log the name of the job in a field. * Checking the deletiontime stamp doesn't appear to be sufficient. Use the Phase to determine whether we should create resources. * Run gofmt. * * Reset the rate limiter after every successful sync. * Otherwise the ratelimiter will end up delaying processing subsequent events which isn't what we want. * Run goimports to fix lint issues. * * Reconcile needs to update the TFJob stored in TrainingJob. This ensures TrainingJob has an up to date representation of the job. * Otherwise changes made to the spec won't be available to TrainingJob. For example, if the job is deleted by the user, the deletion timestamp will be set. But if we don't update the TFJob stored in TrainingJob this change won't be propogated. * * TrainingJob.update should log the value of the job not the pointer. * Add more comments to the code.
This PR is a part of #325:
jobName()
togenName()
TODOs (in another PR):
controller.PodControlInterface
andCreatePodsWithControllerRef
to create Pod@jlewi @gaocegege
This change is