fix the reconcile flow #1111

ChanYiLin · 2019-12-09T10:21:42Z

If the tfjob has already terminated, we don't need to check activedeadline and backofflimit.

Originally, even the job has terminated it still checks the Activedeadline and appends the event to it.
So the event that shows the job failed after it succeeded might happen as follow,
the log of tf-operator will also keep showing the failure massage of past Activedeadline

Events:
  Type    Reason                   Age                  From         Message
  ----    ------                   ----                 ----         -------
  Normal  SuccessfulCreatePod      27m                  tf-operator  Created pod: dist-mnist-for-e2e-test-worker-0
  Normal  SuccessfulCreateService  27m                  tf-operator  Created service: dist-mnist-for-e2e-test-worker-0
  Normal  ExitedWithCode           27m                  tf-operator  Pod: default.dist-mnist-for-e2e-test-worker-0 exited with code 0
  Normal  TFJobSucceeded           27m                  tf-operator  TFJob dist-mnist-for-e2e-test successfully completed.
  Normal  TFJobFailed              2m8s (x51 over 27m)  tf-operator  TFJob dist-mnist-for-e2e-test has failed because it was active longer than specified deadline

  {"filename":"record/event.go:221","level":"info","msg":"Event(v1.ObjectReference{Kind:\"TFJob\", Namespace:\"default\", Name:\"dist-mnist-for-e2e-test\", UID:\"a0fca1b0-1a59-11ea-b297-42010af000e3\", APIVersion:\"kubeflow.org/v1\", ResourceVersion:\"49317\", FieldPath:\"\"}): type: 'Normal' reason: 'TFJobFailed' TFJob dist-mnist-for-e2e-test has failed because it was active longer than specified deadline","time":"2019-12-09T08:53:31Z"}

In this MR, I reorder the checking process so if the job has terminated(Succeed, Failed), it will return instead of further reconcile.

This change is

ChanYiLin · 2019-12-09T10:22:56Z

/assign @gaocegege @richardsliu
Can you help me to review the PR?
Thanks!

TravisBuddy · 2019-12-09T10:26:54Z

Travis tests have failed

Hey @ChanYiLin,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

1st Build

View build log

hack/verify-codegen.sh

Generating deepcopy funcs
Generating clientset for tensorflow:v1 at github.com/kubeflow/tf-operator/pkg/client/clientset
Generating listers for tensorflow:v1 at github.com/kubeflow/tf-operator/pkg/client/listers
Generating informers for tensorflow:v1 at github.com/kubeflow/tf-operator/pkg/client/informers
Generating defaulters for tensorflow/v1
Generating OpenAPI specification for tensorflow/v1
diffing hack/../pkg against freshly generated codegen
diff -Naupr hack/../pkg/apis/tensorflow/v1/openapi_generated.go hack/../_tmp/pkg/apis/tensorflow/v1/openapi_generated.go
--- hack/../pkg/apis/tensorflow/v1/openapi_generated.go	2019-12-09 10:23:39.679629679 +0000
+++ hack/../_tmp/pkg/apis/tensorflow/v1/openapi_generated.go	2019-12-09 10:22:33.000000000 +0000
@@ -125,14 +125,14 @@ func GetOpenAPIDefinitions(ref common.Re
 					Properties: map[string]spec.Schema{
 						"activeDeadlineSeconds": {
 							SchemaProps: spec.SchemaProps{
-								Description: "Specifies the duration (in seconds) since startTime during which the job can remain active before it is terminated. Must be a positive integer.",
+								Description: "Specifies the duration (in seconds) since startTime during which the job can remain active before it is terminated. Must be a positive integer. This setting applies only to pods where restartPolicy is OnFailure or Always.",
 								Type:        []string{"integer"},
 								Format:      "int64",
 							},
 						},
 						"backoffLimit": {
 							SchemaProps: spec.SchemaProps{
-								Description: "Number of retries before marking this job as failed. This setting applies only to pods where restartPolicy is OnFailure or Always.",
+								Description: "Number of retries before marking this job as failed.",
 								Type:        []string{"integer"},
 								Format:      "int32",
 							},
hack/../pkg is out of date. Please run hack/update-codegen.sh

goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/*.go,pkg/util/*/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go,pkg/apis/common/*/zz_generated.*.go,pkg/apis/common/*/*_generated.go"

=== RUN   TestSetTypeNames
--- PASS: TestSetTypeNames (0.00s)
=== RUN   TestSetDefaultTFJob
--- PASS: TestSetDefaultTFJob (0.00s)
=== RUN   TestIsChieforMaster
--- PASS: TestIsChieforMaster (0.00s)
PASS
coverage: 27.7% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
ok  	github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1	0.044s	coverage: 27.7% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
=== RUN   TestValidateV1TFJobSpec
time="2019-12-09T10:26:01Z" level=error msg="TFJobSpec is not valid: Image is undefined in the container of Worker"
time="2019-12-09T10:26:01Z" level=error msg="TFJobSpec is not valid: There is no container named tensorflow in Worker"
--- PASS: TestValidateV1TFJobSpec (0.00s)
PASS
coverage: 20.1% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
ok  	github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation	0.046s	coverage: 20.1% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
?   	github.com/kubeflow/tf-operator/pkg/client/clientset/versioned	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/informers/externalversions	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1	[no test files]
=== RUN   TestGenGeneralName
--- PASS: TestGenGeneralName (0.00s)
PASS
coverage: 0.5% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
ok  	github.com/kubeflow/tf-operator/pkg/common/jobcontroller	0.023s	coverage: 0.5% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
?   	github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured	[no test files]
=== RUN   TestCreatePods
--- PASS: TestCreatePods (0.01s)
=== RUN   TestCreateService
time="2019-12-09T10:26:34Z" level=info msg="Controller test-tfjob created service empty_service"
--- PASS: TestCreateService (0.00s)
=== RUN   TestCreateServicesWithControllerRef
time="2019-12-09T10:26:34Z" level=info msg="Controller test-tfjob created service empty_service"
--- PASS: TestCreateServicesWithControllerRef (0.00s)
=== RUN   TestClaimServices
--- PASS: TestClaimServices (0.00s)
PASS
coverage: 24.2% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
ok  	github.com/kubeflow/tf-operator/pkg/control	0.068s	coverage: 24.2% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
=== RUN   TestNormalPath
time="2019-12-09T10:26:41Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:41Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:41Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (4.214056ms)" job=default.test-tfjob
time="2019-12-09T10:26:41Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:41Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:41Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: ps-0" job=default.test-tfjob replica-type=ps uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: ps-0" job=default.test-tfjob replica-type=ps uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: worker-1" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: worker-2" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: worker-1" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: worker-2" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (1.270164ms)" job=default.test-tfjob
time="2019-12-09T10:26:41Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:41Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:41Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (611.509µs)" job=default.test-tfjob
time="2019-12-09T10:26:41Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:41Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:41Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: worker-2" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: worker-2" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (673.01µs)" job=default.test-tfjob
time="2019-12-09T10:26:41Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:41Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:41Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="Ignoring inactive pod default/worker-2 in state Succeeded, deletion time <nil>"
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=3, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (1.478247ms)" job=default.test-tfjob
time="2019-12-09T10:26:41Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:41Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:41Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=4, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (588.495µs)" job=default.test-tfjob
time="2019-12-09T10:26:41Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:41Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:41Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2019-12-09T10:26:41Z" level=info msg="Need to create new pod: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=1, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="need to create new service: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:41Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (1.102501ms)" job=default.test-tfjob
time="2019-12-09T10:26:41Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:41Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:41Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="Ignoring inactive pod default/ps-1 in state Succeeded, deletion time <nil>"
time="2019-12-09T10:26:41Z" level=info msg="Ignoring inactive pod default/worker-0 in state Succeeded, deletion time <nil>"
time="2019-12-09T10:26:41Z" level=info msg="Ignoring inactive pod default/worker-1 in state Succeeded, deletion time <nil>"
time="2019-12-09T10:26:41Z" level=info msg="Ignoring inactive pod default/worker-2 in state Succeeded, deletion time <nil>"
time="2019-12-09T10:26:41Z" level=info msg="Ignoring inactive pod default/worker-3 in state Succeeded, deletion time <nil>"
time="2019-12-09T10:26:41Z" level=info msg="Ignoring inactive pod default/ps-0 in state Succeeded, deletion time <nil>"
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:41Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=0, running=0, failed=0" job=default.test-tfjob uid=
E1209 10:26:41.662494   10390 event.go:259] Could not construct reference to: '&v1.TFJob{TypeMeta:v1.TypeMeta{Kind:"TFJob", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"test-tfjob", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v1.TFJobSpec{ActiveDeadlineSeconds:(*int64)(nil), BackoffLimit:(*int32)(nil), CleanPodPolicy:(*v1.CleanPodPolicy)(0xc0006e3460), TTLSecondsAfterFinished:(*int32)(nil), TFReplicaSpecs:map[v1.TFReplicaType]*v1.ReplicaSpec{"PS":(*v1.ReplicaSpec)(0xc00070a840), "Worker":(*v1.ReplicaSpec)(0xc00070ab00)}}, Status:v1.JobStatus{Conditions:[]v1.JobCondition(nil), ReplicaStatuses:map[v1.ReplicaType]*v1.ReplicaStatus{"PS":(*v1.ReplicaStatus)(0xc0003c2510), "Worker":(*v1.ReplicaStatus)(0xc0003c25c0)}, StartTime:(*v1.Time)(0xc0007161a0), CompletionTime:(*v1.Time)(nil), LastReconcileTime:(*v1.Time)(nil)}}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'TFJobSucceeded' 'TFJob test-tfjob successfully completed.'
time="2019-12-09T10:26:41Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (3.642967ms)" job=default.test-tfjob
--- PASS: TestNormalPath (0.03s)
=== RUN   TestRun
time="2019-12-09T10:26:41Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:41Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:41Z" level=info msg="Starting TFJob controller"
time="2019-12-09T10:26:41Z" level=info msg="Waiting for informer caches to sync"
time="2019-12-09T10:26:41Z" level=info msg="Starting 1 workers"
time="2019-12-09T10:26:41Z" level=info msg="Started workers"
time="2019-12-09T10:26:42Z" level=info msg="Shutting down workers"
--- PASS: TestRun (0.50s)
=== RUN   TestAddTFJob
time="2019-12-09T10:26:42Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:42Z" level=info msg="Starting TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Waiting for informer caches to sync"
time="2019-12-09T10:26:42Z" level=info msg="TFJob test-tfjob is created." job=default.test-tfjob uid=
time="2019-12-09T10:26:42Z" level=info msg="Starting 1 workers"
time="2019-12-09T10:26:42Z" level=info msg="Started workers"
--- PASS: TestAddTFJob (0.10s)
time="2019-12-09T10:26:42Z" level=info msg="Shutting down workers"
=== RUN   TestCopyLabelsAndAnnotation
time="2019-12-09T10:26:42Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:42Z" level=info msg="Starting TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Waiting for informer caches to sync"
time="2019-12-09T10:26:42Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:42Z" level=info msg="Need to create new pod: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:42Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:42Z" level=info msg="need to create new service: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:42Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (356.553µs)" job=default.test-tfjob
--- PASS: TestCopyLabelsAndAnnotation (0.00s)
=== RUN   TestDeletePodsAndServices
time="2019-12-09T10:26:42Z" level=info msg="Starting 1 workers"
time="2019-12-09T10:26:42Z" level=info msg="Started workers"
time="2019-12-09T10:26:42Z" level=info msg="Shutting down workers"
time="2019-12-09T10:26:42Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:42Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:42Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (298.459µs)" job=default.test-tfjob
time="2019-12-09T10:26:42Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:42Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:42Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (337.382µs)" job=default.test-tfjob
time="2019-12-09T10:26:42Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:42Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:42Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (287.694µs)" job=default.test-tfjob
time="2019-12-09T10:26:42Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:42Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:42Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (313.257µs)" job=default.test-tfjob
--- PASS: TestDeletePodsAndServices (0.01s)
=== RUN   TestCleanupTFJob
time="2019-12-09T10:26:42Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:42Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:42Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (305.242µs)" job=default.test-tfjob
time="2019-12-09T10:26:42Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:42Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:42Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (332.109µs)" job=default.test-tfjob
time="2019-12-09T10:26:42Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:42Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:44Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:44Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (578.563µs)" job=default.test-tfjob
--- PASS: TestCleanupTFJob (2.00s)
=== RUN   TestActiveDeadlineSeconds
time="2019-12-09T10:26:44Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:44Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:44Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:44Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:44Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=4, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:44Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:44Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
--- PASS: TestActiveDeadlineSeconds (2.00s)
=== RUN   TestBackoffForOnFailure
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=warning msg="The restart policy of replica PS of the job test-tfjob is not OnFailure or Always. Not counted in backoff limit." job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (611.912µs)" job=default.test-tfjob
--- PASS: TestBackoffForOnFailure (0.00s)
=== RUN   TestAddPod
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="Starting TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Waiting for informer caches to sync"
time="2019-12-09T10:26:46Z" level=info msg="Starting 1 workers"
time="2019-12-09T10:26:46Z" level=info msg="Started workers"
--- PASS: TestAddPod (0.10s)
=== RUN   TestClusterSpec
--- PASS: TestClusterSpec (0.00s)
=== RUN   TestIsDistributed
--- PASS: TestIsDistributed (0.00s)
time="2019-12-09T10:26:46Z" level=info msg="Shutting down workers"
=== RUN   TestRestartPolicy
--- PASS: TestRestartPolicy (0.00s)
=== RUN   TestExitCode
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="Starting TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Waiting for informer caches to sync"
time="2019-12-09T10:26:46Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Ignoring inactive pod default/worker-0 in state Failed, deletion time <nil>"
time="2019-12-09T10:26:46Z" level=info msg="Pod: default.worker-0 exited with code 130" job=default.test-tfjob replica-type=worker uid=
E1209 10:26:46.398240   10390 event.go:259] Could not construct reference to: '&v1.TFJob{TypeMeta:v1.TypeMeta{Kind:"TFJob", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"test-tfjob", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v1.TFJobSpec{ActiveDeadlineSeconds:(*int64)(nil), BackoffLimit:(*int32)(nil), CleanPodPolicy:(*v1.CleanPodPolicy)(0xc000414b70), TTLSecondsAfterFinished:(*int32)(nil), TFReplicaSpecs:map[v1.TFReplicaType]*v1.ReplicaSpec{"Worker":(*v1.ReplicaSpec)(0xc000235080)}}, Status:v1.JobStatus{Conditions:[]v1.JobCondition(nil), ReplicaStatuses:map[v1.ReplicaType]*v1.ReplicaStatus{"Worker":(*v1.ReplicaStatus)(0xc000409f00)}, StartTime:(*v1.Time)(nil), CompletionTime:(*v1.Time)(nil), LastReconcileTime:(*v1.Time)(nil)}}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'ExitedWithCode' 'Pod: default.worker-0 exited with code 130'
time="2019-12-09T10:26:46Z" level=info msg="Need to restart the pod: default.worker-0" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=0, failed=1" job=default.test-tfjob uid=
E1209 10:26:46.398365   10390 event.go:259] Could not construct reference to: '&v1.TFJob{TypeMeta:v1.TypeMeta{Kind:"TFJob", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"test-tfjob", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v1.TFJobSpec{ActiveDeadlineSeconds:(*int64)(nil), BackoffLimit:(*int32)(nil), CleanPodPolicy:(*v1.CleanPodPolicy)(0xc000414b70), TTLSecondsAfterFinished:(*int32)(nil), TFReplicaSpecs:map[v1.TFReplicaType]*v1.ReplicaSpec{"Worker":(*v1.ReplicaSpec)(0xc000235080)}}, Status:v1.JobStatus{Conditions:[]v1.JobCondition(nil), ReplicaStatuses:map[v1.ReplicaType]*v1.ReplicaStatus{"Worker":(*v1.ReplicaStatus)(0xc000409f00)}, StartTime:(*v1.Time)(0xc0006dece0), CompletionTime:(*v1.Time)(nil), LastReconcileTime:(*v1.Time)(nil)}}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Warning' 'TFJobRestarting' 'TFJob test-tfjob is restarting because 1 Worker replica(s) failed.'
time="2019-12-09T10:26:46Z" level=info msg="need to create new service: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2019-12-09T10:26:46Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (531.426µs)" job=default.test-tfjob
--- PASS: TestExitCode (0.00s)
=== RUN   TestAddService
time="2019-12-09T10:26:46Z" level=info msg="Starting 1 workers"
time="2019-12-09T10:26:46Z" level=info msg="Started workers"
time="2019-12-09T10:26:46Z" level=info msg="Shutting down workers"
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="Starting TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Waiting for informer caches to sync"
time="2019-12-09T10:26:46Z" level=info msg="Starting 1 workers"
time="2019-12-09T10:26:46Z" level=info msg="Started workers"
--- PASS: TestAddService (0.10s)
=== RUN   TestFailed
time="2019-12-09T10:26:46Z" level=info msg="Shutting down workers"
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=3, running=0, failed=1" job=default.test-tfjob uid=
E1209 10:26:46.502398   10390 event.go:259] Could not construct reference to: '&v1.TFJob{TypeMeta:v1.TypeMeta{Kind:"TFJob", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"test-tfjob", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v1.TFJobSpec{ActiveDeadlineSeconds:(*int64)(nil), BackoffLimit:(*int32)(nil), CleanPodPolicy:(*v1.CleanPodPolicy)(0xc0006e3e90), TTLSecondsAfterFinished:(*int32)(nil), TFReplicaSpecs:map[v1.TFReplicaType]*v1.ReplicaSpec{"Worker":(*v1.ReplicaSpec)(0xc0000f0580)}}, Status:v1.JobStatus{Conditions:[]v1.JobCondition(nil), ReplicaStatuses:map[v1.ReplicaType]*v1.ReplicaStatus{"Worker":(*v1.ReplicaStatus)(0xc00035c960)}, StartTime:(*v1.Time)(0xc000698940), CompletionTime:(*v1.Time)(nil), LastReconcileTime:(*v1.Time)(nil)}}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'TFJobFailed' 'TFJob test-tfjob has failed because 1 Worker replica(s) failed.'
--- PASS: TestFailed (0.00s)
=== RUN   TestStatus
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=1, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=0, failed=1" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=0, failed=1" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=1, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=2, failed=2" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=2, running=0, failed=2" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=3, running=3, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=1, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=0, failed=4" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=1, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=1, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=1, failed=1" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=0, failed=1" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=0, failed=4" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="Creating TFJob controller"
time="2019-12-09T10:26:46Z" level=info msg="Creating Job controller"
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=0, failed=1" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=0, failed=4" job=default.test-tfjob uid=
time="2019-12-09T10:26:46Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
--- PASS: TestStatus (0.02s)
=== RUN   TestGenOwnerReference
--- PASS: TestGenOwnerReference (0.00s)
=== RUN   TestGenLabels
--- PASS: TestGenLabels (0.00s)
=== RUN   TestConvertTFJobToUnstructured
--- PASS: TestConvertTFJobToUnstructured (0.00s)
PASS
coverage: 53.2% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
ok  	github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow	4.935s	coverage: 53.2% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
?   	github.com/kubeflow/tf-operator/pkg/logger	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/util	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/util/k8sutil	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/util/signals	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/util/train	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/version	[no test files]
ignoring pkg/apis/tensorflow/v1/openapi_generated.go
ignoring pkg/apis/tensorflow/v1/zz_generated.deepcopy.go
ignoring pkg/apis/tensorflow/v1/zz_generated.defaults.go
ignoring pkg/util/util.go
Bad response status from coveralls: 422
{"message":"Couldn't find Travis Job 622587028 from https://api.travis-ci.org (Service name: travis-ci). ","error":true}

TravisBuddy Request Identifier: 69e8a340-1a6e-11ea-8ac4-55ffd8dc6fb9

coveralls · 2019-12-09T10:53:14Z

Coverage remained the same at 96.512% when pulling 4d13e4f on ChanYiLin:master into f3509e6 on kubeflow:master.

TravisBuddy · 2019-12-09T10:53:16Z

Hey @ChanYiLin,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 18b88040-1a72-11ea-8ac4-55ffd8dc6fb9

ChanYiLin · 2019-12-09T15:06:19Z

/retest

gaocegege · 2019-12-10T01:43:15Z

pkg/controller.v1/tensorflow/controller.go

-	if tc.Config.EnableGangScheduling {
-		minAvailableReplicas := getTotalReplicas(tfjob)
-		_, err := tc.SyncPodGroup(tfjob, minAvailableReplicas)
+		err := updateTFJobConditions(tfjob, common.JobFailed, tfJobFailedReason, failureMessage)
 		if err != nil {


Suggested change

if err != nil {

if err := updateTFJobConditions(

tfjob, common.JobFailed, tfJobFailedReason, failureMessage); err != nil {

Thanks, Done!

gaocegege · 2019-12-10T01:56:14Z

Could you please update pytorch-operator, too?

ChanYiLin · 2019-12-10T02:08:25Z

Could you please update pytorch-operator, too?

Sure, no problem!

pkg/controller.v1/tensorflow/controller.go

TravisBuddy · 2019-12-10T02:11:09Z

Hey @ChanYiLin,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 52f49570-1af2-11ea-b9ce-6f43500ed087

…e and backofflimit

TravisBuddy · 2019-12-10T02:43:20Z

Hey @ChanYiLin,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: d26bab50-1af6-11ea-b9ce-6f43500ed087

gaocegege

/lgtm

/assign @johnugeorge @richardsliu

johnugeorge · 2019-12-10T18:13:47Z

Since there is no more reconcile after completion, does it also solve #965?

gaocegege · 2019-12-11T01:26:01Z

I think so.

ChanYiLin · 2019-12-11T02:01:26Z

Since there is no more reconcile after completion, does it also solve #965?

Yes, by moving isSucceeded(tfjob.Status) || isFailed(tfjob.Status) to the top
this PR avoid checking the backofflimit and activedeadline for those complete jobs and preventing the events being appended to the jobs.

The complete jobs will only do the cleanup process(if set to cleanup) then return nil(if no status update)
I think it can help to improve the performance the issue mentioned.

@johnugeorge

johnugeorge · 2019-12-12T08:50:47Z

Does it improve the performance issue in #965? After a relook, It looks like the the return conditions are exactly same as before. Cleanup check will happen for every reconcile call(before and after the changes) and if job status hasn't changed, the control returns instead of further reconcile(before and after the changes)
However, this PR will solve the specific ActiveDeadline events issue that is raised as successful/failed check happens before ActiveDeadline check.

I can merge this unless you have some thoughts on it.

ChanYiLin · 2019-12-12T09:09:27Z

@johnugeorge Yes you can merge this PR first, I think this is enough for this PR. Also I can create the same PR to the pytorch-operator then. Thanks!

For the performance issue, we can discuss in #965 . However this PR indeed prevents checking backofflimit and activedeadline for those complete jobs which might have some help.

johnugeorge · 2019-12-12T09:10:18Z

Great. Thanks @ChanYiLin
/approve

k8s-ci-robot · 2019-12-12T09:10:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [johnugeorge]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This is originally from kubeflow/training-operator#1111 Signed-off-by: Jiaxin Shan <[email protected]>

) * Skip check activeDeadline or backoffLimit if job terminated This is originally from kubeflow/training-operator#1111 Signed-off-by: Jiaxin Shan <[email protected]> * Add PodGroup reconcile logic This is missing in kubeflow/common. We need this to make sure minAvailableReplicas is correct in PodGroup for each training job Signed-off-by: Jiaxin Shan <[email protected]>

k8s-ci-robot requested review from jimexist and johnugeorge December 9, 2019 10:21

k8s-ci-robot added the size/L label Dec 9, 2019

k8s-ci-robot assigned gaocegege and richardsliu Dec 9, 2019

gaocegege reviewed Dec 10, 2019

View reviewed changes

pkg/controller.v1/tensorflow/controller.go Outdated Show resolved Hide resolved

if the tfjob has already terminated don't need to check activedeadlin…

4d13e4f

…e and backofflimit

gaocegege reviewed Dec 10, 2019

View reviewed changes

k8s-ci-robot assigned johnugeorge Dec 10, 2019

k8s-ci-robot added the lgtm label Dec 10, 2019

k8s-ci-robot added the approved label Dec 12, 2019

k8s-ci-robot merged commit 0267710 into kubeflow:master Dec 12, 2019

ChanYiLin mentioned this pull request Dec 18, 2019

fix the reconcile flow kubeflow/pytorch-operator#242

Merged

ChanYiLin mentioned this pull request Apr 28, 2020

Fix the reconcile flow kubeflow/mxnet-operator#74

Merged

Jeffwan mentioned this pull request May 16, 2020

Skip check active deadline and backoff limit if job is terminated kubeflow/common#92

Closed

Jeffwan added a commit to Jeffwan/common that referenced this pull request May 17, 2020

Skip check activeDeadline or backoffLimit if job terminated

339bfcb

This is originally from kubeflow/training-operator#1111 Signed-off-by: Jiaxin Shan <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix the reconcile flow #1111

fix the reconcile flow #1111

ChanYiLin commented Dec 9, 2019 •

edited

Loading

ChanYiLin commented Dec 9, 2019

TravisBuddy commented Dec 9, 2019

coveralls commented Dec 9, 2019 •

edited

Loading

TravisBuddy commented Dec 9, 2019

ChanYiLin commented Dec 9, 2019

gaocegege Dec 10, 2019

ChanYiLin Dec 10, 2019

gaocegege commented Dec 10, 2019

ChanYiLin commented Dec 10, 2019

TravisBuddy commented Dec 10, 2019

TravisBuddy commented Dec 10, 2019

gaocegege left a comment

johnugeorge commented Dec 10, 2019

gaocegege commented Dec 11, 2019

ChanYiLin commented Dec 11, 2019 •

edited

Loading

johnugeorge commented Dec 12, 2019

ChanYiLin commented Dec 12, 2019 •

edited

Loading

johnugeorge commented Dec 12, 2019

k8s-ci-robot commented Dec 12, 2019

	if err != nil {
	if err := updateTFJobConditions(
	tfjob, common.JobFailed, tfJobFailedReason, failureMessage); err != nil {

fix the reconcile flow #1111

fix the reconcile flow #1111

Conversation

ChanYiLin commented Dec 9, 2019 • edited Loading

ChanYiLin commented Dec 9, 2019

TravisBuddy commented Dec 9, 2019

Travis tests have failed

1st Build

TravisBuddy Request Identifier: 69e8a340-1a6e-11ea-8ac4-55ffd8dc6fb9

coveralls commented Dec 9, 2019 • edited Loading

TravisBuddy commented Dec 9, 2019

TravisBuddy Request Identifier: 18b88040-1a72-11ea-8ac4-55ffd8dc6fb9

ChanYiLin commented Dec 9, 2019

gaocegege Dec 10, 2019

Choose a reason for hiding this comment

ChanYiLin Dec 10, 2019

Choose a reason for hiding this comment

gaocegege commented Dec 10, 2019

ChanYiLin commented Dec 10, 2019

TravisBuddy commented Dec 10, 2019

TravisBuddy Request Identifier: 52f49570-1af2-11ea-b9ce-6f43500ed087

TravisBuddy commented Dec 10, 2019

TravisBuddy Request Identifier: d26bab50-1af6-11ea-b9ce-6f43500ed087

gaocegege left a comment

Choose a reason for hiding this comment

johnugeorge commented Dec 10, 2019

gaocegege commented Dec 11, 2019

ChanYiLin commented Dec 11, 2019 • edited Loading

johnugeorge commented Dec 12, 2019

ChanYiLin commented Dec 12, 2019 • edited Loading

johnugeorge commented Dec 12, 2019

k8s-ci-robot commented Dec 12, 2019

ChanYiLin commented Dec 9, 2019 •

edited

Loading

coveralls commented Dec 9, 2019 •

edited

Loading

ChanYiLin commented Dec 11, 2019 •

edited

Loading

ChanYiLin commented Dec 12, 2019 •

edited

Loading