Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add enableDynamicWorker flag #1142

Closed
wants to merge 2 commits into from

Conversation

zhujl1991
Copy link
Member

@zhujl1991 zhujl1991 commented Mar 12, 2020

This completes the first task in Implementation Details here #1141 .


This change is Reviewable

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign scorpiocph
You can assign the PR to them by writing /assign @scorpiocph in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

Hi @zhujl1991. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@coveralls
Copy link

coveralls commented Mar 12, 2020

Coverage Status

Coverage remained the same at 96.512% when pulling fa96bb1 on zhujl1991:enableDynamicWorker into f6433c5 on kubeflow:master.

@TravisBuddy
Copy link

Travis tests have failed

Hey @zhujl1991,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

1st Build

View build log

hack/verify-codegen.sh
Generating deepcopy funcs
Generating clientset for tensorflow:v1 at github.com/kubeflow/tf-operator/pkg/client/clientset
Generating listers for tensorflow:v1 at github.com/kubeflow/tf-operator/pkg/client/listers
Generating informers for tensorflow:v1 at github.com/kubeflow/tf-operator/pkg/client/informers
Generating defaulters for tensorflow/v1
Generating OpenAPI specification for tensorflow/v1
diffing hack/../pkg against freshly generated codegen
diff -Naupr hack/../pkg/apis/tensorflow/v1/openapi_generated.go hack/../_tmp/pkg/apis/tensorflow/v1/openapi_generated.go
--- hack/../pkg/apis/tensorflow/v1/openapi_generated.go	2020-03-12 18:54:36.112918838 +0000
+++ hack/../_tmp/pkg/apis/tensorflow/v1/openapi_generated.go	2020-03-12 18:53:55.000000000 +0000
@@ -164,13 +164,6 @@ func GetOpenAPIDefinitions(ref common.Re
 								},
 							},
 						},
-						"enableDynamicWorker": {
-							SchemaProps: spec.SchemaProps{
-								Description: "A switch to enable dynamic worker",
-								Type:        []string{"boolean"},
-								Format:      "",
-							},
-						},
 					},
 					Required: []string{"tfReplicaSpecs"},
 				},
hack/../pkg is out of date. Please run hack/update-codegen.sh
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/*.go,pkg/util/*/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go,pkg/apis/common/*/zz_generated.*.go,pkg/apis/common/*/*_generated.go"
=== RUN   TestSetTypeNames
--- PASS: TestSetTypeNames (0.00s)
=== RUN   TestSetDefaultTFJob
--- PASS: TestSetDefaultTFJob (0.00s)
=== RUN   TestIsChieforMaster
--- PASS: TestIsChieforMaster (0.00s)
PASS
coverage: 27.7% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
ok  	github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1	0.037s	coverage: 27.7% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
=== RUN   TestValidateV1TFJobSpec
time="2020-03-12T18:56:20Z" level=error msg="TFJobSpec is not valid: Image is undefined in the container of Worker"
time="2020-03-12T18:56:20Z" level=error msg="TFJobSpec is not valid: There is no container named tensorflow in Worker"
--- PASS: TestValidateV1TFJobSpec (0.00s)
PASS
coverage: 20.1% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
ok  	github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation	0.035s	coverage: 20.1% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
?   	github.com/kubeflow/tf-operator/pkg/client/clientset/versioned	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/informers/externalversions	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1	[no test files]
=== RUN   TestGenGeneralName
--- PASS: TestGenGeneralName (0.00s)
PASS
coverage: 0.5% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
ok  	github.com/kubeflow/tf-operator/pkg/common/jobcontroller	0.016s	coverage: 0.5% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
?   	github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured	[no test files]
=== RUN   TestCreatePods
--- PASS: TestCreatePods (0.01s)
=== RUN   TestCreateService
time="2020-03-12T18:56:45Z" level=info msg="Controller test-tfjob created service empty_service"
--- PASS: TestCreateService (0.00s)
=== RUN   TestCreateServicesWithControllerRef
time="2020-03-12T18:56:45Z" level=info msg="Controller test-tfjob created service empty_service"
--- PASS: TestCreateServicesWithControllerRef (0.00s)
=== RUN   TestClaimServices
--- PASS: TestClaimServices (0.00s)
PASS
coverage: 24.4% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
ok  	github.com/kubeflow/tf-operator/pkg/control	0.051s	coverage: 24.4% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
=== RUN   TestNormalPath
time="2020-03-12T18:56:51Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:51Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:51Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="Need to create new pod: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:51Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="need to create new service: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:51Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (716.003µs)" job=default.test-tfjob
time="2020-03-12T18:56:51Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:51Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:51Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="Need to create new pod: ps-0" job=default.test-tfjob replica-type=ps uid=
time="2020-03-12T18:56:51Z" level=info msg="Need to create new pod: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2020-03-12T18:56:51Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="need to create new service: ps-0" job=default.test-tfjob replica-type=ps uid=
time="2020-03-12T18:56:51Z" level=info msg="need to create new service: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2020-03-12T18:56:51Z" level=info msg="Need to create new pod: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:51Z" level=info msg="Need to create new pod: worker-1" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:51Z" level=info msg="Need to create new pod: worker-2" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:51Z" level=info msg="Need to create new pod: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:51Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="need to create new service: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:51Z" level=info msg="need to create new service: worker-1" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:51Z" level=info msg="need to create new service: worker-2" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:51Z" level=info msg="need to create new service: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:51Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (883.371µs)" job=default.test-tfjob
time="2020-03-12T18:56:51Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:51Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:51Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (399.877µs)" job=default.test-tfjob
time="2020-03-12T18:56:51Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:51Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:51Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="Need to create new pod: worker-2" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:51Z" level=info msg="Need to create new pod: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:51Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="need to create new service: worker-2" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:51Z" level=info msg="need to create new service: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:51Z" level=info msg="Need to create new pod: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2020-03-12T18:56:51Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="need to create new service: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2020-03-12T18:56:51Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (663.243µs)" job=default.test-tfjob
time="2020-03-12T18:56:51Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:51Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:51Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="Ignoring inactive pod default/worker-2 in state Succeeded, deletion time <nil>"
time="2020-03-12T18:56:51Z" level=info msg="Need to create new pod: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2020-03-12T18:56:51Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="need to create new service: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2020-03-12T18:56:51Z" level=info msg="Need to create new pod: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:51Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=3, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="need to create new service: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:51Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (783.927µs)" job=default.test-tfjob
time="2020-03-12T18:56:51Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:51Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:51Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=4, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (526.919µs)" job=default.test-tfjob
time="2020-03-12T18:56:51Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:51Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:51Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="Need to create new pod: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:51Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=1, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="need to create new service: worker-3" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:51Z" level=info msg="Need to create new pod: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2020-03-12T18:56:51Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="need to create new service: ps-1" job=default.test-tfjob replica-type=ps uid=
time="2020-03-12T18:56:51Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (731.892µs)" job=default.test-tfjob
time="2020-03-12T18:56:51Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:51Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:51Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="Ignoring inactive pod default/worker-2 in state Succeeded, deletion time <nil>"
time="2020-03-12T18:56:51Z" level=info msg="Ignoring inactive pod default/worker-3 in state Succeeded, deletion time <nil>"
time="2020-03-12T18:56:51Z" level=info msg="Ignoring inactive pod default/ps-0 in state Succeeded, deletion time <nil>"
time="2020-03-12T18:56:51Z" level=info msg="Ignoring inactive pod default/ps-1 in state Succeeded, deletion time <nil>"
time="2020-03-12T18:56:51Z" level=info msg="Ignoring inactive pod default/worker-0 in state Succeeded, deletion time <nil>"
time="2020-03-12T18:56:51Z" level=info msg="Ignoring inactive pod default/worker-1 in state Succeeded, deletion time <nil>"
time="2020-03-12T18:56:51Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=0, running=0, failed=0" job=default.test-tfjob uid=
E0312 18:56:51.427823   10480 event.go:259] Could not construct reference to: '&v1.TFJob{TypeMeta:v1.TypeMeta{Kind:"TFJob", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"test-tfjob", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v1.TFJobSpec{ActiveDeadlineSeconds:(*int64)(nil), BackoffLimit:(*int32)(nil), CleanPodPolicy:(*v1.CleanPodPolicy)(0xc0005adcf0), TTLSecondsAfterFinished:(*int32)(nil), TFReplicaSpecs:map[v1.TFReplicaType]*v1.ReplicaSpec{"PS":(*v1.ReplicaSpec)(0xc0006a8580), "Worker":(*v1.ReplicaSpec)(0xc0006a8840)}, EnableDynamicWorker:false}, Status:v1.JobStatus{Conditions:[]v1.JobCondition(nil), ReplicaStatuses:map[v1.ReplicaType]*v1.ReplicaStatus{"PS":(*v1.ReplicaStatus)(0xc00035cbc0), "Worker":(*v1.ReplicaStatus)(0xc00035cc10)}, StartTime:(*v1.Time)(0xc00067f3e0), CompletionTime:(*v1.Time)(nil), LastReconcileTime:(*v1.Time)(nil)}}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'TFJobSucceeded' 'TFJob test-tfjob successfully completed.'
time="2020-03-12T18:56:51Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (4.060718ms)" job=default.test-tfjob
--- PASS: TestNormalPath (0.02s)
=== RUN   TestRun
time="2020-03-12T18:56:51Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:51Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:51Z" level=info msg="Starting TFJob controller"
time="2020-03-12T18:56:51Z" level=info msg="Waiting for informer caches to sync"
time="2020-03-12T18:56:51Z" level=info msg="Starting 1 workers"
time="2020-03-12T18:56:51Z" level=info msg="Started workers"
time="2020-03-12T18:56:51Z" level=info msg="Shutting down workers"
--- PASS: TestRun (0.50s)
=== RUN   TestAddTFJob
time="2020-03-12T18:56:51Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:51Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:51Z" level=info msg="TFJob test-tfjob is created." job=default.test-tfjob uid=
time="2020-03-12T18:56:51Z" level=info msg="Starting TFJob controller"
time="2020-03-12T18:56:51Z" level=info msg="Waiting for informer caches to sync"
time="2020-03-12T18:56:52Z" level=info msg="Starting 1 workers"
time="2020-03-12T18:56:52Z" level=info msg="Started workers"
--- PASS: TestAddTFJob (0.10s)
=== RUN   TestCopyLabelsAndAnnotation
time="2020-03-12T18:56:52Z" level=info msg="Shutting down workers"
time="2020-03-12T18:56:52Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:52Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:52Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2020-03-12T18:56:52Z" level=info msg="Need to create new pod: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:52Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:52Z" level=info msg="need to create new service: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:52Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (269.503µs)" job=default.test-tfjob
--- PASS: TestCopyLabelsAndAnnotation (0.00s)
=== RUN   TestDeletePodsAndServices
time="2020-03-12T18:56:52Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:52Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:52Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2020-03-12T18:56:52Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (296.037µs)" job=default.test-tfjob
time="2020-03-12T18:56:52Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:52Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:52Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2020-03-12T18:56:52Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (267.971µs)" job=default.test-tfjob
time="2020-03-12T18:56:52Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:52Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:52Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2020-03-12T18:56:52Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (249.148µs)" job=default.test-tfjob
time="2020-03-12T18:56:52Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:52Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:52Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2020-03-12T18:56:52Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (272.488µs)" job=default.test-tfjob
--- PASS: TestDeletePodsAndServices (0.00s)
=== RUN   TestCleanupTFJob
time="2020-03-12T18:56:52Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:52Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:52Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2020-03-12T18:56:52Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (252.724µs)" job=default.test-tfjob
time="2020-03-12T18:56:52Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:52Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:52Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2020-03-12T18:56:52Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (236.558µs)" job=default.test-tfjob
time="2020-03-12T18:56:52Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:52Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:52Z" level=info msg="Starting TFJob controller"
time="2020-03-12T18:56:52Z" level=info msg="Waiting for informer caches to sync"
time="2020-03-12T18:56:52Z" level=info msg="Starting 1 workers"
time="2020-03-12T18:56:52Z" level=info msg="Started workers"
time="2020-03-12T18:56:52Z" level=info msg="Shutting down workers"
time="2020-03-12T18:56:54Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2020-03-12T18:56:54Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (401.84µs)" job=default.test-tfjob
--- PASS: TestCleanupTFJob (2.00s)
=== RUN   TestActiveDeadlineSeconds
time="2020-03-12T18:56:54Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:54Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:54Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2020-03-12T18:56:54Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=4, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:54Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:54Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:54Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
--- PASS: TestActiveDeadlineSeconds (2.00s)
=== RUN   TestBackoffForOnFailure
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=warning msg="The restart policy of replica PS of the job test-tfjob is not OnFailure or Always. Not counted in backoff limit." job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (842.161µs)" job=default.test-tfjob
--- PASS: TestBackoffForOnFailure (0.00s)
=== RUN   TestAddPod
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="Starting TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Waiting for informer caches to sync"
time="2020-03-12T18:56:56Z" level=info msg="Starting 1 workers"
time="2020-03-12T18:56:56Z" level=info msg="Started workers"
--- PASS: TestAddPod (0.11s)
=== RUN   TestClusterSpec
--- PASS: TestClusterSpec (0.00s)
=== RUN   TestIsDistributed
--- PASS: TestIsDistributed (0.00s)
=== RUN   TestRestartPolicy
--- PASS: TestRestartPolicy (0.00s)
=== RUN   TestExitCode
time="2020-03-12T18:56:56Z" level=info msg="Shutting down workers"
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="Reconcile TFJobs test-tfjob" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="Ignoring inactive pod default/worker-0 in state Failed, deletion time <nil>"
time="2020-03-12T18:56:56Z" level=info msg="Pod: default.worker-0 exited with code 130" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:56Z" level=info msg="Starting TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Waiting for informer caches to sync"
E0312 18:56:56.158229   10480 event.go:259] Could not construct reference to: '&v1.TFJob{TypeMeta:v1.TypeMeta{Kind:"TFJob", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"test-tfjob", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v1.TFJobSpec{ActiveDeadlineSeconds:(*int64)(nil), BackoffLimit:(*int32)(nil), CleanPodPolicy:(*v1.CleanPodPolicy)(0xc0006ab590), TTLSecondsAfterFinished:(*int32)(nil), TFReplicaSpecs:map[v1.TFReplicaType]*v1.ReplicaSpec{"Worker":(*v1.ReplicaSpec)(0xc0009838c0)}, EnableDynamicWorker:false}, Status:v1.JobStatus{Conditions:[]v1.JobCondition(nil), ReplicaStatuses:map[v1.ReplicaType]*v1.ReplicaStatus{"Worker":(*v1.ReplicaStatus)(0xc0004c7480)}, StartTime:(*v1.Time)(nil), CompletionTime:(*v1.Time)(nil), LastReconcileTime:(*v1.Time)(nil)}}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'ExitedWithCode' 'Pod: default.worker-0 exited with code 130'
time="2020-03-12T18:56:56Z" level=info msg="Need to restart the pod: default.worker-0" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=0, failed=1" job=default.test-tfjob uid=
E0312 18:56:56.158441   10480 event.go:259] Could not construct reference to: '&v1.TFJob{TypeMeta:v1.TypeMeta{Kind:"TFJob", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"test-tfjob", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v1.TFJobSpec{ActiveDeadlineSeconds:(*int64)(nil), BackoffLimit:(*int32)(nil), CleanPodPolicy:(*v1.CleanPodPolicy)(0xc0006ab590), TTLSecondsAfterFinished:(*int32)(nil), TFReplicaSpecs:map[v1.TFReplicaType]*v1.ReplicaSpec{"Worker":(*v1.ReplicaSpec)(0xc0009838c0)}, EnableDynamicWorker:false}, Status:v1.JobStatus{Conditions:[]v1.JobCondition(nil), ReplicaStatuses:map[v1.ReplicaType]*v1.ReplicaStatus{"Worker":(*v1.ReplicaStatus)(0xc0004c7480)}, StartTime:(*v1.Time)(0xc000242460), CompletionTime:(*v1.Time)(nil), LastReconcileTime:(*v1.Time)(nil)}}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Warning' 'TFJobRestarting' 'TFJob test-tfjob is restarting because 1 Worker replica(s) failed.'
time="2020-03-12T18:56:56Z" level=info msg="need to create new service: worker-0" job=default.test-tfjob replica-type=worker uid=
time="2020-03-12T18:56:56Z" level=info msg="Finished syncing tfjob \"default/test-tfjob\" (723.187µs)" job=default.test-tfjob
--- PASS: TestExitCode (0.00s)
=== RUN   TestAddService
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="Starting TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Waiting for informer caches to sync"
time="2020-03-12T18:56:56Z" level=info msg="Starting 1 workers"
time="2020-03-12T18:56:56Z" level=info msg="Started workers"
time="2020-03-12T18:56:56Z" level=info msg="Shutting down workers"
time="2020-03-12T18:56:56Z" level=info msg="Starting 1 workers"
time="2020-03-12T18:56:56Z" level=info msg="Started workers"
--- PASS: TestAddService (0.10s)
=== RUN   TestFailed
time="2020-03-12T18:56:56Z" level=info msg="Shutting down workers"
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=3, running=0, failed=1" job=default.test-tfjob uid=
E0312 18:56:56.261388   10480 event.go:259] Could not construct reference to: '&v1.TFJob{TypeMeta:v1.TypeMeta{Kind:"TFJob", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"test-tfjob", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v1.TFJobSpec{ActiveDeadlineSeconds:(*int64)(nil), BackoffLimit:(*int32)(nil), CleanPodPolicy:(*v1.CleanPodPolicy)(0xc0003a41b0), TTLSecondsAfterFinished:(*int32)(nil), TFReplicaSpecs:map[v1.TFReplicaType]*v1.ReplicaSpec{"Worker":(*v1.ReplicaSpec)(0xc0004d6840)}, EnableDynamicWorker:false}, Status:v1.JobStatus{Conditions:[]v1.JobCondition(nil), ReplicaStatuses:map[v1.ReplicaType]*v1.ReplicaStatus{"Worker":(*v1.ReplicaStatus)(0xc0005ffc10)}, StartTime:(*v1.Time)(0xc000365ac0), CompletionTime:(*v1.Time)(nil), LastReconcileTime:(*v1.Time)(nil)}}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'TFJobFailed' 'TFJob test-tfjob has failed because 1 Worker replica(s) failed.'
--- PASS: TestFailed (0.00s)
=== RUN   TestStatus
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=1, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=0, failed=1" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=0, failed=1" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=1, running=1, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=2, failed=2" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=2, running=0, failed=2" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=3, running=3, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=1, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=0, failed=4" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=1, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=1, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=1, failed=1" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=0, failed=1" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=0, running=0, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=0, failed=4" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="Creating TFJob controller"
time="2020-03-12T18:56:56Z" level=info msg="Creating Job controller"
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Chief expected=1, running=0, failed=1" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=Worker expected=4, running=0, failed=4" job=default.test-tfjob uid=
time="2020-03-12T18:56:56Z" level=info msg="TFJob=test-tfjob, ReplicaType=PS expected=2, running=2, failed=0" job=default.test-tfjob uid=
--- PASS: TestStatus (0.01s)
=== RUN   TestGenOwnerReference
--- PASS: TestGenOwnerReference (0.00s)
=== RUN   TestGenLabels
--- PASS: TestGenLabels (0.00s)
=== RUN   TestConvertTFJobToUnstructured
--- PASS: TestConvertTFJobToUnstructured (0.00s)
PASS
coverage: 53.1% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
ok  	github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow	4.905s	coverage: 53.1% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
?   	github.com/kubeflow/tf-operator/pkg/logger	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/util	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/util/k8sutil	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/util/signals	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/util/train	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/version	[no test files]
ignoring pkg/apis/tensorflow/v1/openapi_generated.go
ignoring pkg/apis/tensorflow/v1/zz_generated.deepcopy.go
ignoring pkg/apis/tensorflow/v1/zz_generated.defaults.go
ignoring pkg/util/util.go
Job #2738.1
https://coveralls.io/jobs/60036467
TravisBuddy Request Identifier: 41560cf0-6493-11ea-b857-47bc0e3c428e

@ChanYiLin
Copy link
Member

/lgtm

@zhujl1991
Copy link
Member Author

/retest

@k8s-ci-robot
Copy link

@zhujl1991: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@zhujl1991
Copy link
Member Author

Can anyone give some instructions to fix the test? Thanks. @gaocegege @ChanYiLin

@ChanYiLin
Copy link
Member

I think the error is related to python sdk
@jinchihe can you help?

@jinchihe
Copy link
Member

/retest

@gaocegege
Copy link
Member

/ok-to-test

@jinchihe
Copy link
Member

Seems that's a kubenertest Client bug, will have a deep investigation.

           obj_dict = {obj.attribute_map[attr]: getattr(obj, attr)
>                       for attr, _ in six.iteritems(obj.openapi_types)
                       if getattr(obj, attr) is not None}
E           AttributeError: 'V1TFJob' object has no attribute 'openapi_types'
/usr/local/lib/python3.5/dist-packages/kubernetes/client/api_client.py:230: AttributeError

@jinchihe
Copy link
Member

@zhujl1991 The CI problem has been fixed in #1143 , please rebase, thanks.

@zhujl1991 zhujl1991 force-pushed the enableDynamicWorker branch from 71ffee1 to fa96bb1 Compare March 13, 2020 20:06
@k8s-ci-robot
Copy link

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the lgtm label Mar 13, 2020
@zhujl1991
Copy link
Member Author

@zhujl1991 The CI problem has been fixed in #1143 , please rebase, thanks.

Rebased but still failed.

@jinchihe
Copy link
Member

SDK CI tests already passed.

/retest

@zhujl1991
Copy link
Member Author

Looks like the tests are flaky.
/retest

@zhujl1991
Copy link
Member Author

@gaocegege @richardsliu @johnugeorge Can you guys take a look? Thanks.

@zhujl1991
Copy link
Member Author

Synced with @gaocegege offline, instead of submitting multiple PRs to finish the feature here #1142, I'll do this in one PR with multiple commits.

@gaocegege
Copy link
Member

@zhujl1991 Thanks. Then will you submit all changes in this PR?

@zhujl1991
Copy link
Member Author

@zhujl1991 Thanks. Then will you submit all changes in this PR?

#1149 . Close this PR.

@zhujl1991 zhujl1991 closed this Mar 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants