[bug]: pytorchjob status conditions out-of-order #88

codeflitting · 2018-10-24T09:15:24Z

Created after Succeeded

gaocegege · 2018-10-24T09:28:33Z

Can you give us more details about it? I think conditions does not require order, we can only rely on the time field in them.

codeflitting · 2018-10-24T09:38:01Z

When the job just ended the conditions is

Created -> Running -> Succeeded

After a while the conditions is

Running -> Succeeded -> Created

when the last conditions is succeeded，tf-operator wont update the conditions， i think this is a pytorch-operator bug

gaocegege · 2018-10-24T09:50:29Z

Gotcha, then I think so. And we should check it in tf-operator, too.

johnugeorge · 2018-10-24T09:54:17Z

I haven't seen this before. I couldn't reproduce this issue.
In the screenshot, I see that lastUpdateTime of Create event(2018-10-24) is later than the other two. How did it happen?

codeflitting · 2018-10-24T10:02:22Z

@johnugeorge

spec:
  cleanPodPolicy: None
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            helm.sh/namespace: test
            helm.sh/path: pytorchjob-1540278456
            helm.sh/release: pytorchjob-1540278456
          creationTimestamp: null
        spec:
          containers:
          - command:
            - /bin/bash
            - -c
            - python /mnist-pytorch/train.py
            image: pytorch:v0.4.1-py36
            name: pytorch
            ports:
            - containerPort: 23456
              name: pytorchjob-port
            resources:
              limits:
                nvidia.com/gpu: "0"
              requests:
                cpu: 500m
                memory: 512Mi

johnugeorge · 2018-10-24T10:17:07Z

I haven't seen this before. I couldn't reproduce this issue.
In the screenshot, I see that lastUpdateTime of Create event(2018-10-24) is later than the other two. How did it happen?

Did you do any manual change?

codeflitting · 2018-10-24T10:58:36Z

I haven't seen this before. I couldn't reproduce this issue.
In the screenshot, I see that lastUpdateTime of Create event(2018-10-24) is later than the other two. How did it happen?

Did you do any manual change?

no， just a normal job

johnugeorge · 2018-10-24T10:59:57Z

I tried out with repo examples and I couldn't reproduce it. Are you seeing this consistently? Is the specified image public? I can try out with your image

codeflitting · 2018-10-24T11:06:43Z

are you seeing this consistently?

I didn't care about it before. Not all pytorch job like this，just a fraction

gaocegege · 2018-10-24T11:48:22Z

I looked through the code. We only have one place to update Created condition:

pytorch-operator/pkg/controller.v2/pytorch/job.go

Line 84 in 46e8cb6

    
           err = updatePyTorchJobConditions(job, v1alpha2.PyTorchJobCreated, pytorchJobCreatedReason, msg)

Maybe it is caused by multiple runs of the operator. For example, you run the operator and deal with the create event. Then the operator is crashed and restarted, then addjob is called again. The condition is updated again.

🤔

johnugeorge · 2018-10-24T11:53:14Z

I am confused at the timestamp of the events. eg: the lastTransitionTime and lastUpdatedTime of the Create event

I looked through the code. We only have one place to update Created condition:

pytorch-operator/pkg/controller.v2/pytorch/job.go
Line 84 in 46e8cb6
err = updatePyTorchJobConditions(job, v1alpha2.PyTorchJobCreated, pytorchJobCreatedReason, msg)
Maybe it is caused by multiple runs of the operator. For example, you run the operator and deal with the create event. Then the operator is crashed and restarted, then addjob is called again. The condition is updated again.

🤔

codeflitting · 2018-10-25T03:35:59Z

pytorch-operator-operator-v1-0-745458d6f9-5tcth 1/1 Running 6 10d

It did reboot a few times

if the operator has more than one replicas，it maybe reproduce this issue，right ？

I looked through the code. We only have one place to update Created condition:

pytorch-operator/pkg/controller.v2/pytorch/job.go

Line 84 in 46e8cb6

err = updatePyTorchJobConditions(job, v1alpha2.PyTorchJobCreated, pytorchJobCreatedReason, msg)
Maybe it is caused by multiple runs of the operator. For example, you run the operator and deal with the create event. Then the operator is crashed and restarted, then addjob is called again. The condition is updated again.

🤔

gaocegege · 2018-10-25T05:54:44Z

I think so

johnugeorge · 2018-10-25T06:01:57Z

I feel, this should be common across operators then

gaocegege · 2018-10-25T06:10:58Z

Yeah, definitely

gaocegege · 2019-04-25T02:59:17Z

I think the created condition should be created and never updated.

I will work on the issue.

gaocegege · 2019-04-25T03:03:26Z

/assign

johnugeorge · 2019-06-05T07:06:39Z

This has been already fixed by #114

johnugeorge · 2019-06-05T07:09:00Z

Closing the issue

johnugeorge · 2019-06-05T07:09:05Z

/close

k8s-ci-robot · 2019-06-05T07:09:06Z

@johnugeorge: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gaocegege added problems/bug Something isn't working kind/bug labels Oct 24, 2018

gaocegege assigned codeflitting Oct 31, 2018

k8s-ci-robot assigned gaocegege Apr 25, 2019

gaocegege mentioned this issue Apr 25, 2019

status: Avoid setting last transition time kubeflow/training-operator#982

Merged

johnugeorge mentioned this issue Jun 5, 2019

fix: Remove dup code kubeflow/training-operator#1022

Merged

k8s-ci-robot closed this as completed Jun 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug]: pytorchjob status conditions out-of-order #88

[bug]: pytorchjob status conditions out-of-order #88

codeflitting commented Oct 24, 2018

gaocegege commented Oct 24, 2018

codeflitting commented Oct 24, 2018

gaocegege commented Oct 24, 2018

johnugeorge commented Oct 24, 2018

codeflitting commented Oct 24, 2018

johnugeorge commented Oct 24, 2018

codeflitting commented Oct 24, 2018

johnugeorge commented Oct 24, 2018 •

edited

Loading

codeflitting commented Oct 24, 2018

gaocegege commented Oct 24, 2018

johnugeorge commented Oct 24, 2018

codeflitting commented Oct 25, 2018 •

edited

Loading

gaocegege commented Oct 25, 2018

johnugeorge commented Oct 25, 2018

gaocegege commented Oct 25, 2018

gaocegege commented Apr 25, 2019 •

edited

Loading

gaocegege commented Apr 25, 2019

johnugeorge commented Jun 5, 2019

johnugeorge commented Jun 5, 2019 •

edited

Loading

johnugeorge commented Jun 5, 2019

k8s-ci-robot commented Jun 5, 2019

[bug]: pytorchjob status conditions out-of-order #88

[bug]: pytorchjob status conditions out-of-order #88

Comments

codeflitting commented Oct 24, 2018

gaocegege commented Oct 24, 2018

codeflitting commented Oct 24, 2018

gaocegege commented Oct 24, 2018

johnugeorge commented Oct 24, 2018

codeflitting commented Oct 24, 2018

johnugeorge commented Oct 24, 2018

codeflitting commented Oct 24, 2018

johnugeorge commented Oct 24, 2018 • edited Loading

codeflitting commented Oct 24, 2018

gaocegege commented Oct 24, 2018

johnugeorge commented Oct 24, 2018

codeflitting commented Oct 25, 2018 • edited Loading

gaocegege commented Oct 25, 2018

johnugeorge commented Oct 25, 2018

gaocegege commented Oct 25, 2018

gaocegege commented Apr 25, 2019 • edited Loading

gaocegege commented Apr 25, 2019

johnugeorge commented Jun 5, 2019

johnugeorge commented Jun 5, 2019 • edited Loading

johnugeorge commented Jun 5, 2019

k8s-ci-robot commented Jun 5, 2019

johnugeorge commented Oct 24, 2018 •

edited

Loading

codeflitting commented Oct 25, 2018 •

edited

Loading

gaocegege commented Apr 25, 2019 •

edited

Loading

johnugeorge commented Jun 5, 2019 •

edited

Loading