Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleted jobs re-starting #156

Closed
cwbeitel opened this issue Nov 17, 2017 · 2 comments
Closed

Deleted jobs re-starting #156

cwbeitel opened this issue Nov 17, 2017 · 2 comments

Comments

@cwbeitel
Copy link
Contributor

Noting that I'm seeing deleted jobs re-starting themselves, e.g.

>> sh scripts/cleanup_clusters.sh
service "master-obxn-0" deleted
service "master-slhg-0" deleted
service "master-xfrm-0" deleted
service "ps-obxn-0" deleted
service "ps-slhg-0" deleted
service "ps-xfrm-0" deleted
service "tensorboard-obxn" deleted
service "tensorboard-slhg" deleted
service "tensorboard-xfrm" deleted
job "master-obxn-0" deleted
job "master-slhg-0" deleted
job "master-xfrm-0" deleted
job "ps-xfrm-0" deleted
pod "master-obxn-0-gd7g1" deleted
pod "master-obxn-0-x255h" deleted
pod "master-slhg-0-1zkz2" deleted
pod "master-slhg-0-n4vv9" deleted
pod "master-xfrm-0-tdnsx" deleted
pod "ps-xfrm-0-1wbps" deleted
pod "tensorboard-slhg-1876549404-zk8wz" deleted
pod "tensorboard-xfrm-270610881-fdkpj" deleted
>> kubectl get jobs
NAME            DESIRED   SUCCESSFUL   AGE
master-obxn-0   1         0            13s
master-slhg-0   1         0            6s
master-xfrm-0   1         0            1s

I'll note my cluster version is 1.7.8-gke.0 in case that's relevant.

E.g. one of the problem-causing jobs scripts:

apiVersion: "tensorflow.org/v1alpha1"
kind: "TfJob"
metadata:
  name: "tensorflow-20171117090935"
  namespace: default
spec:
  replicaSpecs:
    - replicas: 1
      tfReplicaType: MASTER
      template:
        spec:
          containers:
            - image: gcr.io/dev01-181118-181500/agents-cpu:2254d3f
              name: tensorflow
              args: ['--log_dir', 'gs://dev01-181118-181500-k8s/jobs/tensorflow-20171117090935', '--config', 'pybullet_ant']
          restartPolicy: OnFailure
    - replicas: 1
      tfReplicaType: PS
  tensorBoard:
    logDir: gs://dev01-181118-181500-k8s/jobs/tensorflow-20171117090935
@jlewi
Copy link
Contributor

jlewi commented Nov 17, 2017

Did you delete the TfJob? If you don't delete the TfJob it will recreate all the resources needed to run the job.

@cwbeitel
Copy link
Contributor Author

Wow my bad right there in the readme. I.e. for future reference

>>kubectl get tfjobs
NAME                        AGE
tensorflow-20171116092830   2d
tensorflow-20171116093103   2d
tensorflow-20171116094619   2d
tensorflow-20171116100805   2d
tensorflow-20171116102903   2d
...
>>kubectl delete tfjob tensorflow-20171116092830
tfjob "tensorflow-20171116092830" deleted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants