Skip to content
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

[bug]: pytorchjob status conditions out-of-order #88

Closed
codeflitting opened this issue Oct 24, 2018 · 21 comments
Closed

[bug]: pytorchjob status conditions out-of-order #88

codeflitting opened this issue Oct 24, 2018 · 21 comments
Assignees
Labels
kind/bug problems/bug Something isn't working

Comments

@codeflitting
Copy link
Member

image

Created after Succeeded

@gaocegege
Copy link
Member

Can you give us more details about it? I think conditions does not require order, we can only rely on the time field in them.

@codeflitting
Copy link
Member Author

When the job just ended the conditions is

  • Created -> Running -> Succeeded

After a while the conditions is

  • Running -> Succeeded -> Created

when the last conditions is succeeded,tf-operator wont update the conditions, i think this is a pytorch-operator bug

@gaocegege
Copy link
Member

Gotcha, then I think so. And we should check it in tf-operator, too.

@gaocegege gaocegege added problems/bug Something isn't working kind/bug labels Oct 24, 2018
@johnugeorge
Copy link
Member

I haven't seen this before. I couldn't reproduce this issue.
In the screenshot, I see that lastUpdateTime of Create event(2018-10-24) is later than the other two. How did it happen?

@codeflitting
Copy link
Member Author

@johnugeorge

spec:
  cleanPodPolicy: None
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            helm.sh/namespace: test
            helm.sh/path: pytorchjob-1540278456
            helm.sh/release: pytorchjob-1540278456
          creationTimestamp: null
        spec:
          containers:
          - command:
            - /bin/bash
            - -c
            - python /mnist-pytorch/train.py
            image: pytorch:v0.4.1-py36
            name: pytorch
            ports:
            - containerPort: 23456
              name: pytorchjob-port
            resources:
              limits:
                nvidia.com/gpu: "0"
              requests:
                cpu: 500m
                memory: 512Mi

@johnugeorge
Copy link
Member

I haven't seen this before. I couldn't reproduce this issue.
In the screenshot, I see that lastUpdateTime of Create event(2018-10-24) is later than the other two. How did it happen?

Did you do any manual change?

@codeflitting
Copy link
Member Author

I haven't seen this before. I couldn't reproduce this issue.
In the screenshot, I see that lastUpdateTime of Create event(2018-10-24) is later than the other two. How did it happen?

Did you do any manual change?

no, just a normal job

@johnugeorge
Copy link
Member

johnugeorge commented Oct 24, 2018

I tried out with repo examples and I couldn't reproduce it. Are you seeing this consistently? Is the specified image public? I can try out with your image

@codeflitting
Copy link
Member Author

are you seeing this consistently?

I didn't care about it before. Not all pytorch job like this,just a fraction

@gaocegege
Copy link
Member

I looked through the code. We only have one place to update Created condition:

err = updatePyTorchJobConditions(job, v1alpha2.PyTorchJobCreated, pytorchJobCreatedReason, msg)

Maybe it is caused by multiple runs of the operator. For example, you run the operator and deal with the create event. Then the operator is crashed and restarted, then addjob is called again. The condition is updated again.

🤔

@johnugeorge
Copy link
Member

I am confused at the timestamp of the events. eg: the lastTransitionTime and lastUpdatedTime of the Create event

I looked through the code. We only have one place to update Created condition:

pytorch-operator/pkg/controller.v2/pytorch/job.go
Line 84 in 46e8cb6
err = updatePyTorchJobConditions(job, v1alpha2.PyTorchJobCreated, pytorchJobCreatedReason, msg)
Maybe it is caused by multiple runs of the operator. For example, you run the operator and deal with the create event. Then the operator is crashed and restarted, then addjob is called again. The condition is updated again.

🤔

@codeflitting
Copy link
Member Author

codeflitting commented Oct 25, 2018

pytorch-operator-operator-v1-0-745458d6f9-5tcth 1/1 Running 6 10d

It did reboot a few times

if the operator has more than one replicas,it maybe reproduce this issue,right ?

I looked through the code. We only have one place to update Created condition:

pytorch-operator/pkg/controller.v2/pytorch/job.go

Line 84 in 46e8cb6

err = updatePyTorchJobConditions(job, v1alpha2.PyTorchJobCreated, pytorchJobCreatedReason, msg)
Maybe it is caused by multiple runs of the operator. For example, you run the operator and deal with the create event. Then the operator is crashed and restarted, then addjob is called again. The condition is updated again.

🤔

@gaocegege
Copy link
Member

I think so

@johnugeorge
Copy link
Member

I feel, this should be common across operators then

@gaocegege
Copy link
Member

Yeah, definitely

@gaocegege
Copy link
Member

gaocegege commented Apr 25, 2019

I think the created condition should be created and never updated.

I will work on the issue.

@gaocegege
Copy link
Member

/assign

@johnugeorge
Copy link
Member

This has been already fixed by #114

@johnugeorge
Copy link
Member

johnugeorge commented Jun 5, 2019

Closing the issue

@johnugeorge
Copy link
Member

/close

@k8s-ci-robot
Copy link

@johnugeorge: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug problems/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants