Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PS still running after tfjob is complete #774

Closed
jlewi opened this issue Aug 9, 2018 · 26 comments
Closed

PS still running after tfjob is complete #774

jlewi opened this issue Aug 9, 2018 · 26 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Aug 9, 2018

Copying kubeflow/kubeflow#1334

Hi,
I have an issue that the PS pod keeps running after the tfjob is complete, even after several days.
kubectl get pods returns
'''
trainer-180804-004108-master-0 0/1 Completed 0 4d
trainer-180804-004108-ps-0 1/1 Running 0 4d
trainer-180804-004108-worker-0 0/1 Completed 0 4d
'''

And, kubectl get tfjob returns
'''
tfReplicaStatuses:
MASTER:
succeeded: 1
PS:
active: 1
Worker:
succeeded: 1
'''

@jlewi
Copy link
Contributor Author

jlewi commented Aug 9, 2018

@gaocegege
Copy link
Member

@gaoning777

How about your clean policy?

@ankushagarwal
Copy link

@gaoning777

What version of tfjob are you using? The cleanPodPolicy might not work in earlier versions of tfjob

I used cleanPodPolicy in my tfjob spec and it cleaned up all pods as expected after the job completed. Here is my complete tfjob spec

apiVersion: "kubeflow.org/v1alpha2"
kind: "TFJob"
metadata:
  name: "linear09"
spec:
  cleanPodPolicy: All
  tfReplicaSpecs:
    PS:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: gcr.io/agwlkubeflow/linear-regression-estimator:latest
              command: ["python", "/train.py", "--model_dir=/mnt/kubeflow-gcfs/linear09"]
              volumeMounts:
              - mountPath: /mnt/kubeflow-gcfs
                name: kubeflow-gcfs
          volumes:
          - name: kubeflow-gcfs
            persistentVolumeClaim:
              claimName: kubeflow-gcfs
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: gcr.io/agwlkubeflow/linear-regression-estimator:latest
              command: ["python", "/train.py", "--model_dir=/mnt/kubeflow-gcfs/linear09"]
              volumeMounts:
              - mountPath: /mnt/kubeflow-gcfs
                name: kubeflow-gcfs
          volumes:
          - name: kubeflow-gcfs
            persistentVolumeClaim:
              claimName: kubeflow-gcfs
    Chief:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: gcr.io/agwlkubeflow/linear-regression-estimator:latest
              command: ["python", "/train.py", "--model_dir=/mnt/kubeflow-gcfs/linear09"]
              volumeMounts:
              - mountPath: /mnt/kubeflow-gcfs
                name: kubeflow-gcfs
          volumes:
          - name: kubeflow-gcfs
            persistentVolumeClaim:
              claimName: kubeflow-gcfs

@jlewi
Copy link
Contributor Author

jlewi commented Aug 9, 2018

@ankushagarwal what's the default CleanPodPolicy? I think the default should be to delete running pods as that is the most sensible thing.

I thought that's what we were using as the default.

@gaoning777
Copy link

gaoning777 commented Aug 9, 2018

I was using the default clean policy.
I tried the 'cleanPodPolicy: All' just now, but to no avail. The PS is still running after 1 hour.
I'm using kubeflow v1alpha2.

@gaoning777
Copy link

gaoning777 commented Aug 9, 2018

My yaml looks like this:

apiVersion: kubeflow.org/v1alpha2
kind: TFJob
metadata:
  generateName: trainer-
spec:
  cleanPodPolicy: All
  tfReplicaSpecs:
    MASTER:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - command:
            - python
            - -m
            - trainer.task
            image: ******
            name: tensorflow
    PS:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - command:
            - python
            - -m
            - trainer.task
            image: *****
            name: tensorflow
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - command:
            - python
            - -m
            - trainer.task
            image: *****
            name: tensorflow

@jlewi
Copy link
Contributor Author

jlewi commented Aug 10, 2018

@gaoning777 What are the events for the TFJob? I'd like to know if we tried to delete the pod.

There are instructions here for dumping the events
https://www.kubeflow.org/docs/guides/monitoring/#default-stackdriver

You can ping us internally in the Kubeflow chat room with relevant stackdriver information.

@jlewi
Copy link
Contributor Author

jlewi commented Aug 10, 2018

#750 is for E2E test for CleanPodPolicy.

@jlewi
Copy link
Contributor Author

jlewi commented Aug 10, 2018

@ChanYiLin
Copy link
Member

ChanYiLin commented Aug 12, 2018

@jlewi @ankushagarwal @gaocegege
I think I found the root cause.
The reason is when the last time we refactored the code (#767 ), we put functions like GetPodsForJob, GetServicesForJob, DeletePod, DeleteService under JobController rather than TFJobController.
We cannot call these function under tc.XXX(), therefore, this issue happened.
See #776
Thanks!

@ScorpioCPH
Copy link
Member

@ChanYiLin I think #767 is ok as we used embedded field in TFJobController struct.

@ChanYiLin
Copy link
Member

yes, its my fault. I have just tested it and found the problem is not there...
Sorry guys

@jlewi
Copy link
Contributor Author

jlewi commented Aug 13, 2018

Can anyone reproduce the problem with the prods not being deleted?

@ChanYiLin
Copy link
Member

I've tested it using GCP with KUBEFLOW_VERSION=0.2.2 and tf-operator v1alpha2.
Kubeflow killed all the pods after the job succeeded as expected.

The yaml file I used is as follow

apiVersion: kubeflow.org/v1alpha2
kind: TFJob
metadata:
  name: test
spec:
  cleanPodPolicy: All
  tfReplicaSpecs:
    MASTER:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - command:
            - python
            - tf_cnn_benchmarks.py
            - --batch_size=8
            - --model=resnet50
            - --data_format=NHWC
            - --device=cpu
            - --variable_update=parameter_server
            - --flush_stdout=true
            - --num_gpus=1
            - --num_batches=10
            - --sync_on_finish=false
            - --cross_replica_sync=false
            - --num_warmup_batches=0
            image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            name: tensorflow
            workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
    PS:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - command:
            - python
            - tf_cnn_benchmarks.py
            - --batch_size=8
            - --model=resnet50
            - --data_format=NHWC
            - --device=cpu
            - --variable_update=parameter_server
            - --flush_stdout=true
            - --num_gpus=1
            - --num_batches=10
            - --sync_on_finish=false
            - --cross_replica_sync=false
            - --num_warmup_batches=0
            image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            name: tensorflow
            workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - command:
            - python
            - tf_cnn_benchmarks.py
            - --batch_size=8
            - --model=resnet50
            - --data_format=NHWC
            - --device=cpu
            - --variable_update=parameter_server
            - --flush_stdout=true
            - --num_gpus=1
            - --num_batches=10
            - --sync_on_finish=false
            - --cross_replica_sync=false
            - --num_warmup_batches=0
            image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            name: tensorflow
            workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks

@gaoning777
Copy link

Where can I find the new version? The most recent v0.2.2 release(https://github.com/kubeflow/kubeflow/releases) was published on July 12nd.

@jlewi
Copy link
Contributor Author

jlewi commented Aug 14, 2018

@gaoning777 What is the docker image for tf-operator that you are using?

@ashahba
Copy link
Member

ashahba commented Aug 15, 2018

@jlewi I also put some comments here: tensorflow/tensorflow#20833
but it may well apply to tf-operator.
Strangely enough if I used Tensorflow 1.8.0 the TfJob is marked as Succeeded but with Tensorflow 1.9.0 it remains Running indefinitely, however when using 1.8.0 I see this in kubetail logs, for all Succeeded jobs 🤔

[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-1-wgdla] INFO:root:Session from worker 1 closed cleanly 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] INFO:tensorflow:Coordinator stopped with threads still running: QueueRunnerThread-dummy_queue-sync_token_q_EnqueueMany 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] Exception in thread QueueRunnerThread-dummy_queue-sync_token_q_EnqueueMany: 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] Traceback (most recent call last): 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] self.run() 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] File "/usr/lib/python3.5/threading.py", line 862, in run 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] self._target(*self._args, **self._kwargs) 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 268, in _run 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] coord.request_stop(e) 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 213, in request_stop 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] six.reraise(*sys.exc_info()) 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] raise value 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 252, in _run 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] enqueue_callable() 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1244, in _single_operation_run 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] self._call_tf_sessionrun(None, {}, [], target_list, None) 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] run_metadata) 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] tensorflow.python.framework.errors_impl.CancelledError: Step was cancelled by an explicit call to `Session::Close()`. 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy]  
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] INFO:root:Finished on task 0 in 262.2973277568817 seconds 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] INFO:root:Session from worker 0 closed cleanly 

@richardsliu
Copy link
Contributor

This can be consistently reproduced if the tf operator is using image tf_operator:v0.2.0. The latest tf operator image (I was using tf_operator:v20180809-d2509aa) does not have this issue.

@richardsliu
Copy link
Contributor

@gaoning777

This should be fixed in 0.2.3. If you have a dependency on Kubeflow 0.2.2, you can fix this by doing something like:

  1. export KUBEFLOW_DEPLOY=false
  2. Run deploy.sh
  3. ks param set tf-job-operator tfJobImage gcr.io/kubeflow-images-public/tf_operator:v20180809-d2509aa
  4. export KUBEFLOW_DEPLOY=true
  5. Run deploy.sh again

@gaoning777
Copy link

Thanks Richard for looking into this.
The ideal would be to enable the user to specify the tf-operator version to the deploy script such that there is a clear decoupling of these two projects. As such, user will not depend on the new release of kubeflow to upgrade their tf-operator version on the assumption that we need one-click deployment.

@jlewi
Copy link
Contributor Author

jlewi commented Aug 20, 2018

@gaoning777 You can customize your ksonnet app if you want to override the image. We don't want to plumb through more options to deploy.sh. Instead the pattern is to create the ksonnet application, let the user customize it, and then deploy.

@jlewi
Copy link
Contributor Author

jlewi commented Aug 20, 2018

@richardsliu Can we

  1. Update the TFJob operator image on master
  2. Add an E2E test to verify the processes are being deleted correctly.

@richardsliu
Copy link
Contributor

The TFJob operator image on master is already pointing to the latest image. I believe @gaoning777 is depending on the 0.2.2 release.

@richardsliu
Copy link
Contributor

@richardsliu
Copy link
Contributor

Closing this since the issue is fixed. Will send out a separate PR for the e2etest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants