-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PS still running after tfjob is complete #774
Comments
Looks like we aren't verifying that pods are deleted |
How about your clean policy? |
What version of tfjob are you using? The cleanPodPolicy might not work in earlier versions of tfjob I used cleanPodPolicy in my tfjob spec and it cleaned up all pods as expected after the job completed. Here is my complete tfjob spec
|
@ankushagarwal what's the default CleanPodPolicy? I think the default should be to delete running pods as that is the most sensible thing. I thought that's what we were using as the default. |
The default is to leave pods running. |
I was using the default clean policy. |
My yaml looks like this:
|
@gaoning777 What are the events for the TFJob? I'd like to know if we tried to delete the pod. There are instructions here for dumping the events You can ping us internally in the Kubeflow chat room with relevant stackdriver information. |
#750 is for E2E test for CleanPodPolicy. |
@ankushagarwal The default CleanPodPolicy is Running That means we will delete pods that are still Running which is the right default. |
@jlewi @ankushagarwal @gaocegege |
@ChanYiLin I think #767 is ok as we used embedded field in |
yes, its my fault. I have just tested it and found the problem is not there... |
Can anyone reproduce the problem with the prods not being deleted? |
I've tested it using GCP with KUBEFLOW_VERSION=0.2.2 and tf-operator v1alpha2. The yaml file I used is as follow
|
Where can I find the new version? The most recent v0.2.2 release(https://github.com/kubeflow/kubeflow/releases) was published on July 12nd. |
@gaoning777 What is the docker image for tf-operator that you are using? |
@jlewi I also put some comments here: tensorflow/tensorflow#20833
|
This can be consistently reproduced if the tf operator is using image tf_operator:v0.2.0. The latest tf operator image (I was using tf_operator:v20180809-d2509aa) does not have this issue. |
This should be fixed in 0.2.3. If you have a dependency on Kubeflow 0.2.2, you can fix this by doing something like:
|
Thanks Richard for looking into this. |
@gaoning777 You can customize your ksonnet app if you want to override the image. We don't want to plumb through more options to deploy.sh. Instead the pattern is to create the ksonnet application, let the user customize it, and then deploy. |
@richardsliu Can we
|
The TFJob operator image on master is already pointing to the latest image. I believe @gaoning777 is depending on the 0.2.2 release. |
This release has the fix: https://github.com/kubeflow/kubeflow/releases/tag/v0.2.4-rc.0 |
Closing this since the issue is fixed. Will send out a separate PR for the e2etest. |
Copying kubeflow/kubeflow#1334
Hi,
I have an issue that the PS pod keeps running after the tfjob is complete, even after several days.
kubectl get pods returns
'''
trainer-180804-004108-master-0 0/1 Completed 0 4d
trainer-180804-004108-ps-0 1/1 Running 0 4d
trainer-180804-004108-worker-0 0/1 Completed 0 4d
'''
And, kubectl get tfjob returns
'''
tfReplicaStatuses:
MASTER:
succeeded: 1
PS:
active: 1
Worker:
succeeded: 1
'''
The text was updated successfully, but these errors were encountered: