-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky test: [It] should delete job when expired time is up #1821
Labels
Comments
1 task
Similar flaky test: ------------------------------
• [FAILED] [3.037 seconds]
TFJob controller Test TTL Seconds After Finished [It] should delete job when expired time is up
/home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:528
Timeline >>
STEP: preparing cases succeeded job with TTL 3s @ 07/03/23 15:40:21.929
STEP: creating a TFJob @ 07/03/23 15:40:21.929
STEP: getting a created TFJob @ 07/03/23 15:40:21.933
STEP: prepare pod @ 07/03/23 15:40:21.933
STEP: update job replica statuses @ 07/03/23 15:40:21.933
STEP: update job status @ 07/03/23 15:40:21.933
STEP: updating job status... @ 07/03/23 15:40:21.933
2023-07-03T15:40:21Z DEBUG events TFJob default/test-bof-0 successfully completed. {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"9ae58d27-8c03-[483](https://github.com/kubeflow/training-operator/actions/runs/5446286573/jobs/9906794119?pr=1843#step:4:484)c-bcb9-075186e331c5","apiVersion":"kubeflow.org/v1","resourceVersion":"751"}, "reason": "TFJobSucceeded"}
2023-07-03T15:40:21Z DEBUG events Created pod: test-bof-0-worker-0 {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"9ae58d27-8c03-483c-bcb9-075186e331c5","apiVersion":"kubeflow.org/v1","resourceVersion":"751"}, "reason": "SuccessfulCreatePod"}
2023-07-03T15:40:21Z DEBUG events Created service: test-bof-0-worker-0 {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"9ae58d27-8c03-483c-bcb9-075186e331c5","apiVersion":"kubeflow.org/v1","resourceVersion":"751"}, "reason": "SuccessfulCreateService"}
STEP: waiting for updating replicaStatus for workers @ 07/03/23 15:40:21.943
2023-07-03T15:40:21Z ERROR Reconciler error {"controller": "tfjob-controller", "object": {"name":"test-bof-0","namespace":"default"}, "namespace": "default", "name": "test-bof-0", "reconcileID": "ba1c3f09-182a-4c0e-a33f-38290f7a64db", "error": "Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-bof-0\": the object has been modified; please apply your changes to the latest version and try again"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226
2023-07-03T15:40:22Z ERROR Reconciler error {"controller": "tfjob-controller", "object": {"name":"test-tfjob","namespace":"tfjob-ns-vbllp"}, "namespace": "tfjob-ns-vbllp", "name": "test-tfjob", "reconcileID": "981bba90-db95-4994-8374-7299bdf7d9dd", "error": "unable to create services: services \"test-tfjob-chief-0\" is forbidden: unable to create new content in namespace tfjob-ns-vbllp because it is being terminated"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226
2023-07-03T15:40:22Z DEBUG events Error creating: services "test-tfjob-chief-0" is forbidden: unable to create new content in namespace tfjob-ns-vbllp because it is being terminated {"type": "Warning", "object": {"kind":"TFJob","namespace":"tfjob-ns-vbllp","name":"test-tfjob","uid":"9a2f3de8-890c-4071-a0ca-40a13fed22e8","apiVersion":"kubeflow.org/v1","resourceVersion":"372"}, "reason": "FailedCreateService"}
2023-07-03T15:40:23Z ERROR Reconciler error {"controller": "tfjob-controller", "object": {"name":"test-tfjob","namespace":"tfjob-ns-97b2d"}, "namespace": "tfjob-ns-97b2d", "name": "test-tfjob", "reconcileID": "9b5e3b4d-1154-4962-8a4d-a787579c87c0", "error": "pods \"test-tfjob-worker-0\" is forbidden: unable to create new content in namespace tfjob-ns-97b2d because it is being terminated"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226
2023-07-03T15:40:23Z DEBUG events Error creating: pods "test-tfjob-worker-0" is forbidden: unable to create new content in namespace tfjob-ns-97b2d because it is being terminated {"type": "Warning", "object": {"kind":"TFJob","namespace":"tfjob-ns-97b2d","name":"test-tfjob","uid":"d9e1b604-4d09-4118-bc2c-70107156a8a5","apiVersion":"kubeflow.org/v1","resourceVersion":"376"}, "reason": "FailedCreatePod"}
2023-07-03T15:40:24Z DEBUG events Deleted job: test-bof-0 {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"9ae58d27-8c03-483c-bcb9-075186e331c5","apiVersion":"kubeflow.org/v1","resourceVersion":"755"}, "reason": "SuccessfulDeleteJob"}
2023-07-03T15:40:24Z INFO TFJob.kubeflow.org "test-bof-0" not found {"tfjob": {"name":"test-bof-0","namespace":"default"}, "unable to fetch TFJob": "default/test-bof-0"}
2023-07-03T15:40:24Z INFO TFJob.kubeflow.org "test-bof-0" not found {"tfjob": {"name":"test-bof-0","namespace":"default"}, "unable to fetch TFJob": "default/test-bof-0"}
STEP: preparing cases failed job with TTL 3s @ 07/03/23 15:40:24.944
STEP: creating a TFJob @ 07/03/23 15:40:24.944
STEP: getting a created TFJob @ 07/03/23 15:40:24.949
STEP: prepare pod @ 07/03/23 15:40:24.949
STEP: update job replica statuses @ 07/03/23 15:40:24.949
STEP: update job status @ 07/03/23 15:40:24.949
STEP: updating job status... @ 07/03/23 15:40:24.949
2023-07-03T15:40:24Z DEBUG events TFJob default/test-bof-1 has failed because 1 Worker replica(s) failed. {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-1","uid":"1e0d4ced-bbc1-[497](https://github.com/kubeflow/training-operator/actions/runs/5446286573/jobs/9906794119?pr=1843#step:4:498)c-b47b-cc3abc6633ce","apiVersion":"kubeflow.org/v1","resourceVersion":"760"}, "reason": "TFJobFailed"}
2023-07-03T15:40:24Z DEBUG events Created pod: test-bof-1-worker-0 {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-1","uid":"1e0d4ced-bbc1-497c-b47b-cc3abc6633ce","apiVersion":"kubeflow.org/v1","resourceVersion":"760"}, "reason": "SuccessfulCreatePod"}
2023-07-03T15:40:24Z DEBUG events Created service: test-bof-1-worker-0 {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-1","uid":"1e0d4ced-bbc1-497c-b47b-cc3abc6633ce","apiVersion":"kubeflow.org/v1","resourceVersion":"760"}, "reason": "SuccessfulCreateService"}
[FAILED] in [It] - /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:579 @ 07/03/23 15:40:24.966
<< Timeline
[FAILED] Expected success, but got an error:
<*errors.StatusError | 0xc0004e2f00>:
Operation cannot be fulfilled on tfjobs.kubeflow.org "test-bof-1": the object has been modified; please apply your changes to the latest version and try again
{
ErrStatus: {
TypeMeta: {Kind: "", APIVersion: ""},
ListMeta: {
SelfLink: "",
ResourceVersion: "",
Continue: "",
RemainingItemCount: nil,
},
Status: "Failure",
Message: "Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-bof-1\": the object has been modified; please apply your changes to the latest version and try again",
Reason: "Conflict",
Details: {
Name: "test-bof-1",
Group: "kubeflow.org",
Kind: "tfjobs",
UID: "",
Causes: nil,
RetryAfterSeconds: 0,
},
Code: 409,
},
}
In [It] at: /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:579 @ 07/03/23 15:40:24.966
------------------------------ |
1 task
Similar flaky test: ------------------------------
• [FAILED] [0.022 seconds]
TFJob controller Test TTL Seconds After Finished [It] should delete job when expired time is up
/home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:525
Timeline >>
STEP: preparing cases succeeded job with TTL 3s @ 07/04/23 22:10:21.22
STEP: creating a TFJob @ 07/04/23 22:10:21.22
STEP: getting a created TFJob @ 07/04/23 22:10:21.225
STEP: prepare pod @ 07/04/23 22:10:21.225
STEP: update job replica statuses @ 07/04/23 22:10:21.225
STEP: update job status @ 07/04/23 22:10:21.225
STEP: updating job status... @ 07/04/23 22:10:21.225
2023-07-04T22:10:21Z DEBUG events TFJob default/test-bof-0 successfully completed. {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"80e11775-0312-4aa7-a6f3-6c25927dc4d6","apiVersion":"kubeflow.org/v1","resourceVersion":"1143"}, "reason": "TFJobSucceeded"}
2023-07-04T22:10:21Z DEBUG events Created pod: test-bof-0-worker-0 {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"80e11775-0312-4aa7-a6f3-6c25927dc4d6","apiVersion":"kubeflow.org/v1","resourceVersion":"1143"}, "reason": "SuccessfulCreatePod"}
2023-07-04T22:10:21Z DEBUG events Created service: test-bof-0-worker-0 {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"80e11775-0312-4aa7-a6f3-6c25927dc4d6","apiVersion":"kubeflow.org/v1","resourceVersion":"1143"}, "reason": "SuccessfulCreateService"}
[FAILED] in [It] - /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:576 @ 07/04/23 22:10:21.241
<< Timeline
[FAILED] Expected success, but got an error:
<*errors.StatusError | 0xc0001546e0>:
Operation cannot be fulfilled on tfjobs.kubeflow.org "test-bof-0": the object has been modified; please apply your changes to the latest version and try again
{
ErrStatus: {
TypeMeta: {Kind: "", APIVersion: ""},
ListMeta: {
SelfLink: "",
ResourceVersion: "",
Continue: "",
RemainingItemCount: nil,
},
Status: "Failure",
Message: "Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-bof-0\": the object has been modified; please apply your changes to the latest version and try again",
Reason: "Conflict",
Details: {
Name: "test-bof-0",
Group: "kubeflow.org",
Kind: "tfjobs",
UID: "",
Causes: nil,
RetryAfterSeconds: 0,
},
Code: 409,
},
}
In [It] at: /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:576 @ 07/04/23 22:10:21.241 |
This was referenced Jul 4, 2023
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/lifecycle frozen |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
https://github.com/kubeflow/training-operator/actions/runs/5133950363/jobs/9237255986#step:4:208
The text was updated successfully, but these errors were encountered: