Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix the reconcile flow #1111

Merged
merged 1 commit into from
Dec 12, 2019
Merged

fix the reconcile flow #1111

merged 1 commit into from
Dec 12, 2019

Conversation

ChanYiLin
Copy link
Member

@ChanYiLin ChanYiLin commented Dec 9, 2019

If the tfjob has already terminated, we don't need to check activedeadline and backofflimit.

Originally, even the job has terminated it still checks the Activedeadline and appends the event to it.
So the event that shows the job failed after it succeeded might happen as follow,
the log of tf-operator will also keep showing the failure massage of past Activedeadline

Events:
  Type    Reason                   Age                  From         Message
  ----    ------                   ----                 ----         -------
  Normal  SuccessfulCreatePod      27m                  tf-operator  Created pod: dist-mnist-for-e2e-test-worker-0
  Normal  SuccessfulCreateService  27m                  tf-operator  Created service: dist-mnist-for-e2e-test-worker-0
  Normal  ExitedWithCode           27m                  tf-operator  Pod: default.dist-mnist-for-e2e-test-worker-0 exited with code 0
  Normal  TFJobSucceeded           27m                  tf-operator  TFJob dist-mnist-for-e2e-test successfully completed.
  Normal  TFJobFailed              2m8s (x51 over 27m)  tf-operator  TFJob dist-mnist-for-e2e-test has failed because it was active longer than specified deadline

  {"filename":"record/event.go:221","level":"info","msg":"Event(v1.ObjectReference{Kind:\"TFJob\", Namespace:\"default\", Name:\"dist-mnist-for-e2e-test\", UID:\"a0fca1b0-1a59-11ea-b297-42010af000e3\", APIVersion:\"kubeflow.org/v1\", ResourceVersion:\"49317\", FieldPath:\"\"}): type: 'Normal' reason: 'TFJobFailed' TFJob dist-mnist-for-e2e-test has failed because it was active longer than specified deadline","time":"2019-12-09T08:53:31Z"}

In this MR, I reorder the checking process so if the job has terminated(Succeed, Failed), it will return instead of further reconcile.


This change is Reviewable

@ChanYiLin
Copy link
Member Author

/assign @gaocegege @richardsliu
Can you help me to review the PR?
Thanks!

@TravisBuddy
Copy link

Travis tests have failed

Hey @ChanYiLin,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

1st Build

View build log

hack/verify-codegen.sh
Generating deepcopy funcs
Generating clientset for tensorflow:v1 at github.com/kubeflow/tf-operator/pkg/client/clientset
Generating listers for tensorflow:v1 at github.com/kubeflow/tf-operator/pkg/client/listers
Generating informers for tensorflow:v1 at github.com/kubeflow/tf-operator/pkg/client/informers
Generating defaulters for tensorflow/v1
Generating OpenAPI specification for tensorflow/v1
diffing hack/../pkg against freshly generated codegen
diff -Naupr hack/../pkg/apis/tensorflow/v1/openapi_generated.go hack/../_tmp/pkg/apis/tensorflow/v1/openapi_generated.go
--- hack/../pkg/apis/tensorflow/v1/openapi_generated.go	2019-12-09 10:23:39.679629679 +0000
+++ hack/../_tmp/pkg/apis/tensorflow/v1/openapi_generated.go	2019-12-09 10:22:33.000000000 +0000
@@ -125,14 +125,14 @@ func GetOpenAPIDefinitions(ref common.Re
 					Properties: map[string]spec.Schema{
 						"activeDeadlineSeconds": {
 							SchemaProps: spec.SchemaProps{
-								Description: "Specifies the duration (in seconds) since startTime during which the job can remain active before it is terminated. Must be a positive integer.",
+								Description: "Specifies the duration (in seconds) since startTime during which the job can remain active before it is terminated. Must be a positive integer. This setting applies only to pods where restartPolicy is OnFailure or Always.",
 								Type:        []string{"integer"},
 								Format:      "int64",
 							},
 						},
 						"backoffLimit": {
 							SchemaProps: spec.SchemaProps{
-								Description: "Number of retries before marking this job as failed. This setting applies only to pods where restartPolicy is OnFailure or Always.",
+								Description: "Number of retries before marking this job as failed.",
 								Type:        []string{"integer"},
 								Format:      "int32",
 							},
hack/../pkg is out of date. Please run hack/update-codegen.sh
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/*.go,pkg/util/*/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go,pkg/apis/common/*/zz_generated.*.go,pkg/apis/common/*/*_generated.go"
=== RUN   TestSetTypeNames
--- PASS: TestSetTypeNames (0.00s)
=== RUN   TestSetDefaultTFJob
--- PASS: TestSetDefaultTFJob (0.00s)
=== RUN   TestIsChieforMaster
--- PASS: TestIsChieforMaster (0.00s)
PASS
coverage: 27.7% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
ok  	github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1	0.044s	coverage: 27.7% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
=== RUN   TestValidateV1TFJobSpec
time="2019-12-09T10:26:01Z" level=error msg="TFJobSpec is not valid: Image is undefined in the container of Worker"
time="2019-12-09T10:26:01Z" level=error msg="TFJobSpec is not valid: There is no container named tensorflow in Worker"
--- PASS: TestValidateV1TFJobSpec (0.00s)
PASS
coverage: 20.1% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
ok  	github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation	0.046s	coverage: 20.1% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
?   	github.com/kubeflow/tf-operator/pkg/client/clientset/versioned	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/informers/externalversions	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1	[no test files]
=== RUN   TestGenGeneralName
--- PASS: TestGenGeneralName (0.00s)
PASS
coverage: 0.5% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
ok  	github.com/kubeflow/tf-operator/pkg/common/jobcontroller	0.023s	coverage: 0.5% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
?   	github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured	[no test files]
=== RUN   TestCreatePods
--- PASS: TestCreatePods (0.01s)
=== RUN   TestCreateService
time="2019-12-09T10:26:34Z" level=info msg="Controller test-tfjob created service empty_service"
--- PASS: TestCreateService (0.00s)
=== RUN   TestCreateServicesWithControllerRef
time="2019-12-09T10:26:34Z" level=info msg="Controller test-tfjob created service empty_service"
--- PASS: TestCreateServicesWithControllerRef (0.00s)
=== RUN   TestClaimServices
--- PASS: TestClaimServices (0.00s)
PASS
coverage: 24.2% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
ok  	github.com/kubeflow/tf-operator/pkg/control	0.068s	coverage: 24.2% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
=== RUN   TestNormalPath
time="2019-12-09T10:26:41Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:41Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:41Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (4.214056ms)" job=default.test-tfjob
time="2019-12-09T10:26:41Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:41Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:41Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: ps-0" job=default.test-tfjob replica-type=ps uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: ps-0" job=default.test-tfjob replica-type=ps uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: worker-1" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: worker-2" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: worker-1" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: worker-2" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (1.270164ms)" job=default.test-tfjob
time="2019-12-09T10:26:41Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:41Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:41Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (611.509µs)" job=default.test-tfjob
time="2019-12-09T10:26:41Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:41Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:41Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: worker-2" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: worker-2" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (673.01µs)" job=default.test-tfjob
time="2019-12-09T10:26:41Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:41Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:41Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="Ignoring inactive pod default/worker-2 in state Succeeded, deletion time <nil>"
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=3, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (1.478247ms)" job=default.test-tfjob
time="2019-12-09T10:26:41Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:41Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:41Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=4, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (588.495µs)" job=default.test-tfjob
time="2019-12-09T10:26:41Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:41Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:41Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=1, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (1.102501ms)" job=default.test-tfjob
time="2019-12-09T10:26:41Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:41Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:41Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="Ignoring inactive pod default/ps-1 in state Succeeded, deletion time <nil>"
time="2019-12-09T10:26:41Z" level=info msg="Ignoring inactive pod default/worker-0 in state Succeeded, deletion time <nil>"
time="2019-12-09T10:26:41Z" level=info msg="Ignoring inactive pod default/worker-1 in state Succeeded, deletion time <nil>"
time="2019-12-09T10:26:41Z" level=info msg="Ignoring inactive pod default/worker-2 in state Succeeded, deletion time <nil>"
time="2019-12-09T10:26:41Z" level=info msg="Ignoring inactive pod default/worker-3 in state Succeeded, deletion time <nil>"
time="2019-12-09T10:26:41Z" level=info msg="Ignoring inactive pod default/ps-0 in state Succeeded, deletion time <nil>"
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=0, running=0, failed=0" job=default.test-tfjob uid=
E1209 10:26:41.662494   10390 event.go:259] Could not construct reference to: '&v1.TFJob{TypeMeta:v1.TypeMeta{Kind:"TFJob", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"test-tfjob", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v1.TFJobSpec{ActiveDeadlineSeconds:(*int64)(nil), BackoffLimit:(*int32)(nil), CleanPodPolicy:(*v1.CleanPodPolicy)(0xc0006e3460), TTLSecondsAfterFinished:(*int32)(nil), TFReplicaSpecs:map[v1.TFReplicaType]*v1.ReplicaSpec{"PS":(*v1.ReplicaSpec)(0xc00070a840), "Worker":(*v1.ReplicaSpec)(0xc00070ab00)}}, Status:v1.JobStatus{Conditions:[]v1.JobCondition(nil), ReplicaStatuses:map[v1.ReplicaType]*v1.ReplicaStatus{"PS":(*v1.ReplicaStatus)(0xc0003c2510), "Worker":(*v1.ReplicaStatus)(0xc0003c25c0)}, StartTime:(*v1.Time)(0xc0007161a0), CompletionTime:(*v1.Time)(nil), LastReconcileTime:(*v1.Time)(nil)}}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'TFJobSucceeded' 'TFJob test-tfjob successfully completed.'
time="2019-12-09T10:26:41Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (3.642967ms)" job=default.test-tfjob
--- PASS: TestNormalPath (0.03s)
=== RUN   TestRun
time="2019-12-09T10:26:41Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:41Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:41Z" level=info msg="Starting TFJob controller"
time="2019-12-09T10:26:41Z" level=info msg="Waiting for informer caches to sync"
time="2019-12-09T10:26:41Z" level=info msg="Starting 1 workers"
time="2019-12-09T10:26:41Z" level=info msg="Started workers"
time="2019-12-09T10:26:42Z" level=info msg="Shutting down workers"
--- PASS: TestRun (0.50s)
=== RUN   TestAddTFJob
time="2019-12-09T10:26:42Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:42Z" level=info msg="Starting TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Waiting for informer caches to sync"
time="2019-12-09T10:26:42Z" level=info msg="TFJob test-tfjob is created." job=default.test-tfjob uid=
time="2019-12-09T10:26:42Z" level=info msg="Starting 1 workers"
time="2019-12-09T10:26:42Z" level=info msg="Started workers"
--- PASS: TestAddTFJob (0.10s)
time="2019-12-09T10:26:42Z" level=info msg="Shutting down workers"
=== RUN   TestCopyLabelsAndAnnotation
time="2019-12-09T10:26:42Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:42Z" level=info msg="Starting TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Waiting for informer caches to sync"
time="2019-12-09T10:26:42Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:42Z" level=info msg="Need to create new pod: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:42Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:42Z" level=info msg="need to create new service: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:42Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (356.553µs)" job=default.test-tfjob
--- PASS: TestCopyLabelsAndAnnotation (0.00s)
=== RUN   TestDeletePodsAndServices
time="2019-12-09T10:26:42Z" level=info msg="Starting 1 workers"
time="2019-12-09T10:26:42Z" level=info msg="Started workers"
time="2019-12-09T10:26:42Z" level=info msg="Shutting down workers"
time="2019-12-09T10:26:42Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:42Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:42Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (298.459µs)" job=default.test-tfjob
time="2019-12-09T10:26:42Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:42Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:42Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (337.382µs)" job=default.test-tfjob
time="2019-12-09T10:26:42Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:42Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:42Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (287.694µs)" job=default.test-tfjob
time="2019-12-09T10:26:42Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:42Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:42Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (313.257µs)" job=default.test-tfjob
--- PASS: TestDeletePodsAndServices (0.01s)
=== RUN   TestCleanupTFJob
time="2019-12-09T10:26:42Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:42Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:42Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (305.242µs)" job=default.test-tfjob
time="2019-12-09T10:26:42Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:42Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:42Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (332.109µs)" job=default.test-tfjob
time="2019-12-09T10:26:42Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:44Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:44Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (578.563µs)" job=default.test-tfjob
--- PASS: TestCleanupTFJob (2.00s)
=== RUN   TestActiveDeadlineSeconds
time="2019-12-09T10:26:44Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:44Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:44Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:44Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:44Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=4, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:44Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:44Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
--- PASS: TestActiveDeadlineSeconds (2.00s)
=== RUN   TestBackoffForOnFailure
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=warning msg="The restart policy of replica PS of the job test-tfjob is not OnFailure or Always. Not counted in backoff limit." job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (611.912µs)" job=default.test-tfjob
--- PASS: TestBackoffForOnFailure (0.00s)
=== RUN   TestAddPod
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="Starting TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Waiting for informer caches to sync"
time="2019-12-09T10:26:46Z" level=info msg="Starting 1 workers"
time="2019-12-09T10:26:46Z" level=info msg="Started workers"
--- PASS: TestAddPod (0.10s)
=== RUN   TestClusterSpec
--- PASS: TestClusterSpec (0.00s)
=== RUN   TestIsDistributed
--- PASS: TestIsDistributed (0.00s)
time="2019-12-09T10:26:46Z" level=info msg="Shutting down workers"
=== RUN   TestRestartPolicy
--- PASS: TestRestartPolicy (0.00s)
=== RUN   TestExitCode
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="Starting TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Waiting for informer caches to sync"
time="2019-12-09T10:26:46Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Ignoring inactive pod default/worker-0 in state Failed, deletion time <nil>"
time="2019-12-09T10:26:46Z" level=info msg="Pod: default.worker-0 exited with code 130" job=default.test-tfjob replica-type=worker uid=
E1209 10:26:46.398240   10390 event.go:259] Could not construct reference to: '&v1.TFJob{TypeMeta:v1.TypeMeta{Kind:"TFJob", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"test-tfjob", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v1.TFJobSpec{ActiveDeadlineSeconds:(*int64)(nil), BackoffLimit:(*int32)(nil), CleanPodPolicy:(*v1.CleanPodPolicy)(0xc000414b70), TTLSecondsAfterFinished:(*int32)(nil), TFReplicaSpecs:map[v1.TFReplicaType]*v1.ReplicaSpec{"Worker":(*v1.ReplicaSpec)(0xc000235080)}}, Status:v1.JobStatus{Conditions:[]v1.JobCondition(nil), ReplicaStatuses:map[v1.ReplicaType]*v1.ReplicaStatus{"Worker":(*v1.ReplicaStatus)(0xc000409f00)}, StartTime:(*v1.Time)(nil), CompletionTime:(*v1.Time)(nil), LastReconcileTime:(*v1.Time)(nil)}}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'ExitedWithCode' 'Pod: default.worker-0 exited with code 130'
time="2019-12-09T10:26:46Z" level=info msg="Need to restart the pod: default.worker-0" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=0, failed=1" job=default.test-tfjob uid=
E1209 10:26:46.398365   10390 event.go:259] Could not construct reference to: '&v1.TFJob{TypeMeta:v1.TypeMeta{Kind:"TFJob", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"test-tfjob", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v1.TFJobSpec{ActiveDeadlineSeconds:(*int64)(nil), BackoffLimit:(*int32)(nil), CleanPodPolicy:(*v1.CleanPodPolicy)(0xc000414b70), TTLSecondsAfterFinished:(*int32)(nil), TFReplicaSpecs:map[v1.TFReplicaType]*v1.ReplicaSpec{"Worker":(*v1.ReplicaSpec)(0xc000235080)}}, Status:v1.JobStatus{Conditions:[]v1.JobCondition(nil), ReplicaStatuses:map[v1.ReplicaType]*v1.ReplicaStatus{"Worker":(*v1.ReplicaStatus)(0xc000409f00)}, StartTime:(*v1.Time)(0xc0006dece0), CompletionTime:(*v1.Time)(nil), LastReconcileTime:(*v1.Time)(nil)}}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Warning' 'TFJobRestarting' 'TFJob test-tfjob is restarting because 1 Worker replica(s) failed.'
time="2019-12-09T10:26:46Z" level=info msg="need to create new service: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:46Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (531.426µs)" job=default.test-tfjob
--- PASS: TestExitCode (0.00s)
=== RUN   TestAddService
time="2019-12-09T10:26:46Z" level=info msg="Starting 1 workers"
time="2019-12-09T10:26:46Z" level=info msg="Started workers"
time="2019-12-09T10:26:46Z" level=info msg="Shutting down workers"
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="Starting TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Waiting for informer caches to sync"
time="2019-12-09T10:26:46Z" level=info msg="Starting 1 workers"
time="2019-12-09T10:26:46Z" level=info msg="Started workers"
--- PASS: TestAddService (0.10s)
=== RUN   TestFailed
time="2019-12-09T10:26:46Z" level=info msg="Shutting down workers"
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=3, running=0, failed=1" job=default.test-tfjob uid=
E1209 10:26:46.502398   10390 event.go:259] Could not construct reference to: '&v1.TFJob{TypeMeta:v1.TypeMeta{Kind:"TFJob", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"test-tfjob", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v1.TFJobSpec{ActiveDeadlineSeconds:(*int64)(nil), BackoffLimit:(*int32)(nil), CleanPodPolicy:(*v1.CleanPodPolicy)(0xc0006e3e90), TTLSecondsAfterFinished:(*int32)(nil), TFReplicaSpecs:map[v1.TFReplicaType]*v1.ReplicaSpec{"Worker":(*v1.ReplicaSpec)(0xc0000f0580)}}, Status:v1.JobStatus{Conditions:[]v1.JobCondition(nil), ReplicaStatuses:map[v1.ReplicaType]*v1.ReplicaStatus{"Worker":(*v1.ReplicaStatus)(0xc00035c960)}, StartTime:(*v1.Time)(0xc000698940), CompletionTime:(*v1.Time)(nil), LastReconcileTime:(*v1.Time)(nil)}}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'TFJobFailed' 'TFJob test-tfjob has failed because 1 Worker replica(s) failed.'
--- PASS: TestFailed (0.00s)
=== RUN   TestStatus
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=1, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=0, failed=1" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=0, failed=1" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=1, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=2, failed=2" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=2, running=0, failed=2" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=3, running=3, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=1, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=0, failed=4" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=1, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=1, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=1, failed=1" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=0, failed=1" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=0, failed=4" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=0, failed=1" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=0, failed=4" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
--- PASS: TestStatus (0.02s)
=== RUN   TestGenOwnerReference
--- PASS: TestGenOwnerReference (0.00s)
=== RUN   TestGenLabels
--- PASS: TestGenLabels (0.00s)
=== RUN   TestConvertTFJobToUnstructured
--- PASS: TestConvertTFJobToUnstructured (0.00s)
PASS
coverage: 53.2% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
ok  	github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow	4.935s	coverage: 53.2% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
?   	github.com/kubeflow/tf-operator/pkg/logger	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/util	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/util/k8sutil	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/util/signals	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/util/train	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/version	[no test files]
ignoring pkg/apis/tensorflow/v1/openapi_generated.go
ignoring pkg/apis/tensorflow/v1/zz_generated.deepcopy.go
ignoring pkg/apis/tensorflow/v1/zz_generated.defaults.go
ignoring pkg/util/util.go
Bad response status from coveralls: 422
{"message":"Couldn't find Travis Job 622587028 from https://api.travis-ci.org (Service name: travis-ci). ","error":true}
TravisBuddy Request Identifier: 69e8a340-1a6e-11ea-8ac4-55ffd8dc6fb9

@coveralls
Copy link

coveralls commented Dec 9, 2019

Coverage Status

Coverage remained the same at 96.512% when pulling 4d13e4f on ChanYiLin:master into f3509e6 on kubeflow:master.

@TravisBuddy
Copy link

Hey @ChanYiLin,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 18b88040-1a72-11ea-8ac4-55ffd8dc6fb9

@ChanYiLin
Copy link
Member Author

/retest

if tc.Config.EnableGangScheduling {
minAvailableReplicas := getTotalReplicas(tfjob)
_, err := tc.SyncPodGroup(tfjob, minAvailableReplicas)
err := updateTFJobConditions(tfjob, common.JobFailed, tfJobFailedReason, failureMessage)
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err != nil {
if err := updateTFJobConditions(
tfjob, common.JobFailed, tfJobFailedReason, failureMessage); err != nil {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Done!

@gaocegege
Copy link
Member

Could you please update pytorch-operator, too?

@ChanYiLin
Copy link
Member Author

Could you please update pytorch-operator, too?

Sure, no problem!

@TravisBuddy
Copy link

Hey @ChanYiLin,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 52f49570-1af2-11ea-b9ce-6f43500ed087

@TravisBuddy
Copy link

Hey @ChanYiLin,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: d26bab50-1af6-11ea-b9ce-6f43500ed087

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

/assign @johnugeorge @richardsliu

@johnugeorge
Copy link
Member

Since there is no more reconcile after completion, does it also solve #965?

@gaocegege
Copy link
Member

I think so.

@ChanYiLin
Copy link
Member Author

ChanYiLin commented Dec 11, 2019

Since there is no more reconcile after completion, does it also solve #965?

Yes, by moving isSucceeded(tfjob.Status) || isFailed(tfjob.Status) to the top
this PR avoid checking the backofflimit and activedeadline for those complete jobs and preventing the events being appended to the jobs.

The complete jobs will only do the cleanup process(if set to cleanup) then return nil(if no status update)
I think it can help to improve the performance the issue mentioned.

@johnugeorge

@johnugeorge
Copy link
Member

Does it improve the performance issue in #965? After a relook, It looks like the the return conditions are exactly same as before. Cleanup check will happen for every reconcile call(before and after the changes) and if job status hasn't changed, the control returns instead of further reconcile(before and after the changes)
However, this PR will solve the specific ActiveDeadline events issue that is raised as successful/failed check happens before ActiveDeadline check.

I can merge this unless you have some thoughts on it.

@ChanYiLin
Copy link
Member Author

ChanYiLin commented Dec 12, 2019

@johnugeorge Yes you can merge this PR first, I think this is enough for this PR. Also I can create the same PR to the pytorch-operator then. Thanks!

For the performance issue, we can discuss in #965 . However this PR indeed prevents checking backofflimit and activedeadline for those complete jobs which might have some help.

@johnugeorge
Copy link
Member

Great. Thanks @ChanYiLin
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Jeffwan added a commit to Jeffwan/common that referenced this pull request May 17, 2020
k8s-ci-robot pushed a commit to kubeflow/common that referenced this pull request May 17, 2020
)

* Skip check activeDeadline or backoffLimit if job terminated

This is originally from kubeflow/training-operator#1111

Signed-off-by: Jiaxin Shan <[email protected]>

* Add PodGroup reconcile logic

This is missing in kubeflow/common. We need this to make sure minAvailableReplicas is correct in PodGroup for each training job

Signed-off-by: Jiaxin Shan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants