Events don't show up in kubectl describe tfjobs #763

jlewi · 2018-07-29T02:46:07Z

Below is the output of running kubectl describe for a TFJob that is running.

The job is running successfully (pods exist) but events don't show up in output of kubectl describe

The events don't show up with kubectl get events; but they do show up in stackdriver. I wonder if the problem is that the events are tool old.

kubectl describe tfjobs tfjob
Name:         tfjob
Namespace:    kubeflow
Labels:       app.kubernetes.io/deploy-manager=ksonnet
Annotations:  ksonnet.io/managed={"pristine":"H4sIAAAAAAAA/+yRz27UMBDG7zzGnJ3NbkoFjZQTqEIcYEUrekBVNHEmWbOObY3HqcJq3x05UC1/ngCJHKKZbz6P5e93AgzmM3E03kENx9TRYP3TxvNYzju04YAVKDga10MN97fvfQcKJhLsURDqEzicCGqQ4es6ym0MqOmX...
API Version:  kubeflow.org/v1alpha2
Kind:         TFJob
Metadata:
  Cluster Name:        
  Creation Timestamp:  2018-07-29T00:31:12Z
  Generation:          1
  Resource Version:    25561
  Self Link:           /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/tfjobs/tfjob
  UID:                 b20c924b-92c6-11e8-b3ca-42010a80019c
Spec:
  Tf Replica Specs:
    PS:
      Replicas:  1
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Args:
              python
              tf_cnn_benchmarks.py
              --batch_size=32
              --model=resnet50
              --variable_update=parameter_server
              --flush_stdout=true
              --num_gpus=1
              --local_parameter_device=cpu
              --device=cpu
              --data_format=NHWC
            Image:  gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            Name:   tensorflow
            Ports:
              Container Port:  2222
              Name:            tfjob-port
            Resources:
            Working Dir:   /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          Restart Policy:  OnFailure
    Worker:
      Replicas:  1
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Args:
              python
              tf_cnn_benchmarks.py
              --batch_size=32
              --model=resnet50
              --variable_update=parameter_server
              --flush_stdout=true
              --num_gpus=1
              --local_parameter_device=cpu
              --device=cpu
              --data_format=NHWC
            Image:  gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            Name:   tensorflow
            Ports:
              Container Port:  2222
              Name:            tfjob-port
            Resources:
            Working Dir:   /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          Restart Policy:  OnFailure
Status:
  Conditions:
    Last Transition Time:  2018-07-29T00:31:48Z
    Last Update Time:      2018-07-29T00:31:48Z
    Message:               TFJob tfjob is running.
    Reason:                TFJobRunning
    Status:                True
    Type:                  Running
  Start Time:              2018-07-29T02:38:43Z
  Tf Replica Statuses:
    PS:
      Active:  1
    Worker:
      Active:  1
Events:        <none>

The text was updated successfully, but these errors were encountered:

jlewi · 2018-07-29T02:48:06Z

Created a new job and the events show up; so looks like its a question of events being too old and not showing up.

kubectl describe tfjobs tfjob2
Name:         tfjob2
Namespace:    kubeflow
Labels:       app.kubernetes.io/deploy-manager=ksonnet
Annotations:  ksonnet.io/managed={"pristine":"H4sIAAAAAAAA/+yRz27UMBDG7zzGnJ3NbkoFjZQTqEIcYEUrekBVNHEmWbOObY3HqcJq3x05UC1/ngCJHKKZbz6P5e93AgzmM3E03kENx9TRYP3TxvNYzju04YAVKDga10MN97fvfQcKJhLsURDqEzicCGqQ4avvsjX3MaCm...
API Version:  kubeflow.org/v1alpha2
Kind:         TFJob
Metadata:
  Cluster Name:        
  Creation Timestamp:  2018-07-29T02:46:53Z
  Generation:          1
  Resource Version:    26872
  Self Link:           /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/tfjobs/tfjob2
  UID:                 a6bc7b6f-92d9-11e8-b3ca-42010a80019c
Spec:
  Tf Replica Specs:
    PS:
      Replicas:  1
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Args:
              python
              tf_cnn_benchmarks.py
              --batch_size=32
              --model=resnet50
              --variable_update=parameter_server
              --flush_stdout=true
              --num_gpus=1
              --local_parameter_device=cpu
              --device=cpu
              --data_format=NHWC
            Image:  gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            Name:   tensorflow
            Ports:
              Container Port:  2222
              Name:            tfjob-port
            Resources:
            Working Dir:   /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          Restart Policy:  OnFailure
    Worker:
      Replicas:  1
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Args:
              python
              tf_cnn_benchmarks.py
              --batch_size=32
              --model=resnet50
              --variable_update=parameter_server
              --flush_stdout=true
              --num_gpus=1
              --local_parameter_device=cpu
              --device=cpu
              --data_format=NHWC
            Image:  gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            Name:   tensorflow
            Ports:
              Container Port:  2222
              Name:            tfjob-port
            Resources:
            Working Dir:   /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          Restart Policy:  OnFailure
Status:
  Conditions:
    Last Transition Time:  2018-07-29T02:46:55Z
    Last Update Time:      2018-07-29T02:46:55Z
    Message:               TFJob tfjob2 is running.
    Reason:                TFJobRunning
    Status:                True
    Type:                  Running
  Start Time:              2018-07-29T02:46:55Z
  Tf Replica Statuses:
    PS:
      Active:  1
    Worker:
      Active:  1
Events:
  Type     Reason                          Age                From         Message
  ----     ------                          ----               ----         -------
  Warning  SettedPodTemplateRestartPolicy  19s (x2 over 19s)  tf-operator  Restart policy in pod template will be overwritten by restart policy in replica spec
  Normal   SuccessfulCreatePod             19s                tf-operator  Created pod: tfjob2-worker-0
  Normal   SuccessfulCreateService         19s                tf-operator  Created service: tfjob2-worker-0
  Normal   SuccessfulCreatePod             19s                tf-operator  Created pod: tfjob2-ps-0
  Normal   SuccessfulCreateService         19s                tf-operator  Created service: tfjob2-ps-0

jlewi · 2018-07-29T21:09:25Z

See kubernetes/kubernetes#52521

It looks like events are garbage collected after 1 hour to avoid straining ETCD.

jlewi added priority/p2 area/0.3.0 area/monitoring area/tfjob labels Jul 29, 2018

jlewi closed this as completed Jul 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Events don't show up in kubectl describe tfjobs #763

Events don't show up in kubectl describe tfjobs #763

jlewi commented Jul 29, 2018

jlewi commented Jul 29, 2018

jlewi commented Jul 29, 2018

Events don't show up in kubectl describe tfjobs #763

Events don't show up in kubectl describe tfjobs #763

Comments

jlewi commented Jul 29, 2018

jlewi commented Jul 29, 2018

jlewi commented Jul 29, 2018