Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test: [It] should delete job when expired time is up #1821

Open
tenzen-y opened this issue May 31, 2023 · 4 comments
Open

Flaky test: [It] should delete job when expired time is up #1821

tenzen-y opened this issue May 31, 2023 · 4 comments

Comments

@tenzen-y
Copy link
Member

------------------------------
• [FAILED] [0.017 seconds]
TFJob controller Test TTL Seconds After Finished [It] should delete job when expired time is up
/home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:528

  Timeline >>
  STEP: preparing cases succeeded job with TTL 3s @ 05/31/23 14:16:41.447
  STEP: creating a TFJob @ 05/31/23 14:16:41.447
  STEP: getting a created TFJob @ 05/31/23 14:16:41.451
  STEP: prepare pod @ 05/31/23 14:16:41.451
  STEP: update job replica statuses @ 05/31/23 14:16:41.451
  STEP: update job status @ 05/31/23 14:16:41.451
  STEP: updating job status... @ 05/31/23 14:16:41.451
  2023-05-31T14:16:41Z	DEBUG	events	TFJob default/test-bof-0 successfully completed.	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"54b81f4e-3d97-4817-9089-cb45cbbbc57d","apiVersion":"kubeflow.org/v1","resourceVersion":"540"}, "reason": "TFJobSucceeded"}
  2023-05-31T14:16:41Z	DEBUG	events	Created pod: test-bof-0-worker-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"54b81f4e-3d97-4817-9089-cb45cbbbc57d","apiVersion":"kubeflow.org/v1","resourceVersion":"540"}, "reason": "SuccessfulCreatePod"}
  2023-05-31T14:16:41Z	DEBUG	events	Created service: test-bof-0-worker-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"54b81f4e-3d97-4817-9089-cb45cbbbc57d","apiVersion":"kubeflow.org/v1","resourceVersion":"540"}, "reason": "SuccessfulCreateService"}
  [FAILED] in [It] - /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:579 @ 05/31/23 14:16:41.464
  << Timeline

  [FAILED] Expected success, but got an error:
      <*errors.StatusError | 0xc001988960>: {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-bof-0\": the object has been modified; please apply your changes to the latest version and try again",
              Reason: "Conflict",
              Details: {
                  Name: "test-bof-0",
                  Group: "kubeflow.org",
                  Kind: "tfjobs",
                  UID: "",
                  Causes: nil,
                  RetryAfterSeconds: 0,
              },
              Code: 409,
          },
      }
      Operation cannot be fulfilled on tfjobs.kubeflow.org "test-bof-0": the object has been modified; please apply your changes to the latest version and try again
  In [It] at: /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:579 @ 05/31/23 14:16:41.464
------------------------------

https://github.com/kubeflow/training-operator/actions/runs/5133950363/jobs/9237255986#step:4:208

@tenzen-y
Copy link
Member Author

tenzen-y commented Jul 3, 2023

Similar flaky test:

------------------------------
• [FAILED] [3.037 seconds]
TFJob controller Test TTL Seconds After Finished [It] should delete job when expired time is up
/home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:528

  Timeline >>
  STEP: preparing cases succeeded job with TTL 3s @ 07/03/23 15:40:21.929
  STEP: creating a TFJob @ 07/03/23 15:40:21.929
  STEP: getting a created TFJob @ 07/03/23 15:40:21.933
  STEP: prepare pod @ 07/03/23 15:40:21.933
  STEP: update job replica statuses @ 07/03/23 15:40:21.933
  STEP: update job status @ 07/03/23 15:40:21.933
  STEP: updating job status... @ 07/03/23 15:40:21.933
  2023-07-03T15:40:21Z	DEBUG	events	TFJob default/test-bof-0 successfully completed.	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"9ae58d27-8c03-[483](https://github.com/kubeflow/training-operator/actions/runs/5446286573/jobs/9906794119?pr=1843#step:4:484)c-bcb9-075186e331c5","apiVersion":"kubeflow.org/v1","resourceVersion":"751"}, "reason": "TFJobSucceeded"}
  2023-07-03T15:40:21Z	DEBUG	events	Created pod: test-bof-0-worker-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"9ae58d27-8c03-483c-bcb9-075186e331c5","apiVersion":"kubeflow.org/v1","resourceVersion":"751"}, "reason": "SuccessfulCreatePod"}
  2023-07-03T15:40:21Z	DEBUG	events	Created service: test-bof-0-worker-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"9ae58d27-8c03-483c-bcb9-075186e331c5","apiVersion":"kubeflow.org/v1","resourceVersion":"751"}, "reason": "SuccessfulCreateService"}
  STEP: waiting for updating replicaStatus for workers @ 07/03/23 15:40:21.943
  2023-07-03T15:40:21Z	ERROR	Reconciler error	{"controller": "tfjob-controller", "object": {"name":"test-bof-0","namespace":"default"}, "namespace": "default", "name": "test-bof-0", "reconcileID": "ba1c3f09-182a-4c0e-a33f-38290f7a64db", "error": "Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-bof-0\": the object has been modified; please apply your changes to the latest version and try again"}
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226
  2023-07-03T15:40:22Z	ERROR	Reconciler error	{"controller": "tfjob-controller", "object": {"name":"test-tfjob","namespace":"tfjob-ns-vbllp"}, "namespace": "tfjob-ns-vbllp", "name": "test-tfjob", "reconcileID": "981bba90-db95-4994-8374-7299bdf7d9dd", "error": "unable to create services: services \"test-tfjob-chief-0\" is forbidden: unable to create new content in namespace tfjob-ns-vbllp because it is being terminated"}
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226
  2023-07-03T15:40:22Z	DEBUG	events	Error creating: services "test-tfjob-chief-0" is forbidden: unable to create new content in namespace tfjob-ns-vbllp because it is being terminated	{"type": "Warning", "object": {"kind":"TFJob","namespace":"tfjob-ns-vbllp","name":"test-tfjob","uid":"9a2f3de8-890c-4071-a0ca-40a13fed22e8","apiVersion":"kubeflow.org/v1","resourceVersion":"372"}, "reason": "FailedCreateService"}
  2023-07-03T15:40:23Z	ERROR	Reconciler error	{"controller": "tfjob-controller", "object": {"name":"test-tfjob","namespace":"tfjob-ns-97b2d"}, "namespace": "tfjob-ns-97b2d", "name": "test-tfjob", "reconcileID": "9b5e3b4d-1154-4962-8a4d-a787579c87c0", "error": "pods \"test-tfjob-worker-0\" is forbidden: unable to create new content in namespace tfjob-ns-97b2d because it is being terminated"}
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226
  2023-07-03T15:40:23Z	DEBUG	events	Error creating: pods "test-tfjob-worker-0" is forbidden: unable to create new content in namespace tfjob-ns-97b2d because it is being terminated	{"type": "Warning", "object": {"kind":"TFJob","namespace":"tfjob-ns-97b2d","name":"test-tfjob","uid":"d9e1b604-4d09-4118-bc2c-70107156a8a5","apiVersion":"kubeflow.org/v1","resourceVersion":"376"}, "reason": "FailedCreatePod"}
  2023-07-03T15:40:24Z	DEBUG	events	Deleted job: test-bof-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"9ae58d27-8c03-483c-bcb9-075186e331c5","apiVersion":"kubeflow.org/v1","resourceVersion":"755"}, "reason": "SuccessfulDeleteJob"}
  2023-07-03T15:40:24Z	INFO	TFJob.kubeflow.org "test-bof-0" not found	{"tfjob": {"name":"test-bof-0","namespace":"default"}, "unable to fetch TFJob": "default/test-bof-0"}
  2023-07-03T15:40:24Z	INFO	TFJob.kubeflow.org "test-bof-0" not found	{"tfjob": {"name":"test-bof-0","namespace":"default"}, "unable to fetch TFJob": "default/test-bof-0"}
  STEP: preparing cases failed job with TTL 3s @ 07/03/23 15:40:24.944
  STEP: creating a TFJob @ 07/03/23 15:40:24.944
  STEP: getting a created TFJob @ 07/03/23 15:40:24.949
  STEP: prepare pod @ 07/03/23 15:40:24.949
  STEP: update job replica statuses @ 07/03/23 15:40:24.949
  STEP: update job status @ 07/03/23 15:40:24.949
  STEP: updating job status... @ 07/03/23 15:40:24.949
  2023-07-03T15:40:24Z	DEBUG	events	TFJob default/test-bof-1 has failed because 1 Worker replica(s) failed.	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-1","uid":"1e0d4ced-bbc1-[497](https://github.com/kubeflow/training-operator/actions/runs/5446286573/jobs/9906794119?pr=1843#step:4:498)c-b47b-cc3abc6633ce","apiVersion":"kubeflow.org/v1","resourceVersion":"760"}, "reason": "TFJobFailed"}
  2023-07-03T15:40:24Z	DEBUG	events	Created pod: test-bof-1-worker-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-1","uid":"1e0d4ced-bbc1-497c-b47b-cc3abc6633ce","apiVersion":"kubeflow.org/v1","resourceVersion":"760"}, "reason": "SuccessfulCreatePod"}
  2023-07-03T15:40:24Z	DEBUG	events	Created service: test-bof-1-worker-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-1","uid":"1e0d4ced-bbc1-497c-b47b-cc3abc6633ce","apiVersion":"kubeflow.org/v1","resourceVersion":"760"}, "reason": "SuccessfulCreateService"}
  [FAILED] in [It] - /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:579 @ 07/03/23 15:40:24.966
  << Timeline

  [FAILED] Expected success, but got an error:
      <*errors.StatusError | 0xc0004e2f00>: 
      Operation cannot be fulfilled on tfjobs.kubeflow.org "test-bof-1": the object has been modified; please apply your changes to the latest version and try again
      {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-bof-1\": the object has been modified; please apply your changes to the latest version and try again",
              Reason: "Conflict",
              Details: {
                  Name: "test-bof-1",
                  Group: "kubeflow.org",
                  Kind: "tfjobs",
                  UID: "",
                  Causes: nil,
                  RetryAfterSeconds: 0,
              },
              Code: 409,
          },
      }
  In [It] at: /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:579 @ 07/03/23 15:40:24.966
------------------------------

https://github.com/kubeflow/training-operator/actions/runs/5446286573/jobs/9906794119?pr=1843#step:4:480

@tenzen-y
Copy link
Member Author

tenzen-y commented Jul 4, 2023

Similar flaky test:

------------------------------
• [FAILED] [0.022 seconds]
TFJob controller Test TTL Seconds After Finished [It] should delete job when expired time is up
/home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:525

  Timeline >>
  STEP: preparing cases succeeded job with TTL 3s @ 07/04/23 22:10:21.22
  STEP: creating a TFJob @ 07/04/23 22:10:21.22
  STEP: getting a created TFJob @ 07/04/23 22:10:21.225
  STEP: prepare pod @ 07/04/23 22:10:21.225
  STEP: update job replica statuses @ 07/04/23 22:10:21.225
  STEP: update job status @ 07/04/23 22:10:21.225
  STEP: updating job status... @ 07/04/23 22:10:21.225
  2023-07-04T22:10:21Z	DEBUG	events	TFJob default/test-bof-0 successfully completed.	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"80e11775-0312-4aa7-a6f3-6c25927dc4d6","apiVersion":"kubeflow.org/v1","resourceVersion":"1143"}, "reason": "TFJobSucceeded"}
  2023-07-04T22:10:21Z	DEBUG	events	Created pod: test-bof-0-worker-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"80e11775-0312-4aa7-a6f3-6c25927dc4d6","apiVersion":"kubeflow.org/v1","resourceVersion":"1143"}, "reason": "SuccessfulCreatePod"}
  2023-07-04T22:10:21Z	DEBUG	events	Created service: test-bof-0-worker-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"80e11775-0312-4aa7-a6f3-6c25927dc4d6","apiVersion":"kubeflow.org/v1","resourceVersion":"1143"}, "reason": "SuccessfulCreateService"}
  [FAILED] in [It] - /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:576 @ 07/04/23 22:10:21.241
  << Timeline

  [FAILED] Expected success, but got an error:
      <*errors.StatusError | 0xc0001546e0>: 
      Operation cannot be fulfilled on tfjobs.kubeflow.org "test-bof-0": the object has been modified; please apply your changes to the latest version and try again
      {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-bof-0\": the object has been modified; please apply your changes to the latest version and try again",
              Reason: "Conflict",
              Details: {
                  Name: "test-bof-0",
                  Group: "kubeflow.org",
                  Kind: "tfjobs",
                  UID: "",
                  Causes: nil,
                  RetryAfterSeconds: 0,
              },
              Code: 409,
          },
      }
  In [It] at: /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:576 @ 07/04/23 22:10:21.241

https://github.com/kubeflow/training-operator/actions/runs/5458679683/jobs/9934001719?pr=1849#step:4:793

@github-actions
Copy link

github-actions bot commented Oct 3, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@tenzen-y
Copy link
Member Author

tenzen-y commented Oct 3, 2023

/lifecycle frozen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant