Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Events don't show up in kubectl describe tfjobs #763

Closed
jlewi opened this issue Jul 29, 2018 · 2 comments
Closed

Events don't show up in kubectl describe tfjobs #763

jlewi opened this issue Jul 29, 2018 · 2 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jul 29, 2018

Below is the output of running kubectl describe for a TFJob that is running.

The job is running successfully (pods exist) but events don't show up in output of kubectl describe

The events don't show up with kubectl get events; but they do show up in stackdriver. I wonder if the problem is that the events are tool old.

kubectl describe tfjobs tfjob
Name:         tfjob
Namespace:    kubeflow
Labels:       app.kubernetes.io/deploy-manager=ksonnet
Annotations:  ksonnet.io/managed={"pristine":"H4sIAAAAAAAA/+yRz27UMBDG7zzGnJ3NbkoFjZQTqEIcYEUrekBVNHEmWbOObY3HqcJq3x05UC1/ngCJHKKZbz6P5e93AgzmM3E03kENx9TRYP3TxvNYzju04YAVKDga10MN97fvfQcKJhLsURDqEzicCGqQ4es6ym0MqOmX...
API Version:  kubeflow.org/v1alpha2
Kind:         TFJob
Metadata:
  Cluster Name:        
  Creation Timestamp:  2018-07-29T00:31:12Z
  Generation:          1
  Resource Version:    25561
  Self Link:           /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/tfjobs/tfjob
  UID:                 b20c924b-92c6-11e8-b3ca-42010a80019c
Spec:
  Tf Replica Specs:
    PS:
      Replicas:  1
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Args:
              python
              tf_cnn_benchmarks.py
              --batch_size=32
              --model=resnet50
              --variable_update=parameter_server
              --flush_stdout=true
              --num_gpus=1
              --local_parameter_device=cpu
              --device=cpu
              --data_format=NHWC
            Image:  gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            Name:   tensorflow
            Ports:
              Container Port:  2222
              Name:            tfjob-port
            Resources:
            Working Dir:   /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          Restart Policy:  OnFailure
    Worker:
      Replicas:  1
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Args:
              python
              tf_cnn_benchmarks.py
              --batch_size=32
              --model=resnet50
              --variable_update=parameter_server
              --flush_stdout=true
              --num_gpus=1
              --local_parameter_device=cpu
              --device=cpu
              --data_format=NHWC
            Image:  gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            Name:   tensorflow
            Ports:
              Container Port:  2222
              Name:            tfjob-port
            Resources:
            Working Dir:   /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          Restart Policy:  OnFailure
Status:
  Conditions:
    Last Transition Time:  2018-07-29T00:31:48Z
    Last Update Time:      2018-07-29T00:31:48Z
    Message:               TFJob tfjob is running.
    Reason:                TFJobRunning
    Status:                True
    Type:                  Running
  Start Time:              2018-07-29T02:38:43Z
  Tf Replica Statuses:
    PS:
      Active:  1
    Worker:
      Active:  1
Events:        <none>
@jlewi
Copy link
Contributor Author

jlewi commented Jul 29, 2018

Created a new job and the events show up; so looks like its a question of events being too old and not showing up.

kubectl describe tfjobs tfjob2
Name:         tfjob2
Namespace:    kubeflow
Labels:       app.kubernetes.io/deploy-manager=ksonnet
Annotations:  ksonnet.io/managed={"pristine":"H4sIAAAAAAAA/+yRz27UMBDG7zzGnJ3NbkoFjZQTqEIcYEUrekBVNHEmWbOObY3HqcJq3x05UC1/ngCJHKKZbz6P5e93AgzmM3E03kENx9TRYP3TxvNYzju04YAVKDga10MN97fvfQcKJhLsURDqEzicCGqQ4avvsjX3MaCm...
API Version:  kubeflow.org/v1alpha2
Kind:         TFJob
Metadata:
  Cluster Name:        
  Creation Timestamp:  2018-07-29T02:46:53Z
  Generation:          1
  Resource Version:    26872
  Self Link:           /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/tfjobs/tfjob2
  UID:                 a6bc7b6f-92d9-11e8-b3ca-42010a80019c
Spec:
  Tf Replica Specs:
    PS:
      Replicas:  1
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Args:
              python
              tf_cnn_benchmarks.py
              --batch_size=32
              --model=resnet50
              --variable_update=parameter_server
              --flush_stdout=true
              --num_gpus=1
              --local_parameter_device=cpu
              --device=cpu
              --data_format=NHWC
            Image:  gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            Name:   tensorflow
            Ports:
              Container Port:  2222
              Name:            tfjob-port
            Resources:
            Working Dir:   /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          Restart Policy:  OnFailure
    Worker:
      Replicas:  1
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Args:
              python
              tf_cnn_benchmarks.py
              --batch_size=32
              --model=resnet50
              --variable_update=parameter_server
              --flush_stdout=true
              --num_gpus=1
              --local_parameter_device=cpu
              --device=cpu
              --data_format=NHWC
            Image:  gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            Name:   tensorflow
            Ports:
              Container Port:  2222
              Name:            tfjob-port
            Resources:
            Working Dir:   /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          Restart Policy:  OnFailure
Status:
  Conditions:
    Last Transition Time:  2018-07-29T02:46:55Z
    Last Update Time:      2018-07-29T02:46:55Z
    Message:               TFJob tfjob2 is running.
    Reason:                TFJobRunning
    Status:                True
    Type:                  Running
  Start Time:              2018-07-29T02:46:55Z
  Tf Replica Statuses:
    PS:
      Active:  1
    Worker:
      Active:  1
Events:
  Type     Reason                          Age                From         Message
  ----     ------                          ----               ----         -------
  Warning  SettedPodTemplateRestartPolicy  19s (x2 over 19s)  tf-operator  Restart policy in pod template will be overwritten by restart policy in replica spec
  Normal   SuccessfulCreatePod             19s                tf-operator  Created pod: tfjob2-worker-0
  Normal   SuccessfulCreateService         19s                tf-operator  Created service: tfjob2-worker-0
  Normal   SuccessfulCreatePod             19s                tf-operator  Created pod: tfjob2-ps-0
  Normal   SuccessfulCreateService         19s                tf-operator  Created service: tfjob2-ps-0

@jlewi jlewi closed this as completed Jul 29, 2018
@jlewi
Copy link
Contributor Author

jlewi commented Jul 29, 2018

See kubernetes/kubernetes#52521

It looks like events are garbage collected after 1 hour to avoid straining ETCD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant