-
Notifications
You must be signed in to change notification settings - Fork 143
[bug]: pytorchjob status conditions out-of-order #88
Comments
Can you give us more details about it? I think conditions does not require order, we can only rely on the time field in them. |
When the job just ended the conditions is
After a while the conditions is
when the last conditions is succeeded,tf-operator wont update the conditions, i think this is a pytorch-operator bug |
Gotcha, then I think so. And we should check it in tf-operator, too. |
I haven't seen this before. I couldn't reproduce this issue. |
spec:
cleanPodPolicy: None
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
helm.sh/namespace: test
helm.sh/path: pytorchjob-1540278456
helm.sh/release: pytorchjob-1540278456
creationTimestamp: null
spec:
containers:
- command:
- /bin/bash
- -c
- python /mnist-pytorch/train.py
image: pytorch:v0.4.1-py36
name: pytorch
ports:
- containerPort: 23456
name: pytorchjob-port
resources:
limits:
nvidia.com/gpu: "0"
requests:
cpu: 500m
memory: 512Mi |
Did you do any manual change? |
no, just a normal job |
I tried out with repo examples and I couldn't reproduce it. Are you seeing this consistently? Is the specified image public? I can try out with your image |
I didn't care about it before. Not all pytorch job like this,just a fraction |
I looked through the code. We only have one place to update Created condition:
Maybe it is caused by multiple runs of the operator. For example, you run the operator and deal with the create event. Then the operator is crashed and restarted, then addjob is called again. The condition is updated again. 🤔 |
I am confused at the timestamp of the events. eg: the lastTransitionTime and lastUpdatedTime of the Create event
|
pytorch-operator-operator-v1-0-745458d6f9-5tcth 1/1 Running 6 10d It did reboot a few times if the operator has more than one replicas,it maybe reproduce this issue,right ?
|
I think so |
I feel, this should be common across operators then |
Yeah, definitely |
I think the created condition should be created and never updated. I will work on the issue. |
/assign |
This has been already fixed by #114 |
Closing the issue |
/close |
@johnugeorge: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Created after Succeeded
The text was updated successfully, but these errors were encountered: