Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set completion time when job exceed specified deadline. #1150

Merged
merged 1 commit into from
Apr 9, 2020

Conversation

SimonCqk
Copy link
Contributor

If job stays in Created and all its controlled pods waits in Pending state, util it exceed its ActiveDeadline(if job.spec.ActiveDeadlineSeconds has been set), this job will be reconciled as past active deadline and be cleaned up, however, job.status.completionTime remains a nil pointer, and it will absolutely panic in JobController.cleanupJob func.

@kubeflow-bot
Copy link

This change is Reviewable

@k8s-ci-robot
Copy link

Hi @SimonCqk. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@SimonCqk
Copy link
Contributor Author

/assign @johnugeorge

@TravisBuddy
Copy link

Travis tests have failed

Hey @SimonCqk,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

1st Build

View build log

go build -o tf-operator.v1 github.com/kubeflow/tf-operator/cmd/tf-operator.v1
# github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow
pkg/controller.v1/tensorflow/controller.go:426:3: undefined: jobStatus
golangci-lint run ./...
pkg/controller.v1/tensorflow/controller.go:426:3: undeclared name: `jobStatus` (typecheck)
		jobStatus.CompletionTime = &now
		^
goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/util/*.go,pkg/util/*/*.go,pkg/apis/tensorflow/*/zz_generated.*.go,pkg/apis/tensorflow/*/*_generated.go,pkg/apis/common/*/zz_generated.*.go,pkg/apis/common/*/*_generated.go"
=== RUN   TestSetTypeNames
--- PASS: TestSetTypeNames (0.00s)
=== RUN   TestSetDefaultTFJob
--- PASS: TestSetDefaultTFJob (0.00s)
=== RUN   TestIsChieforMaster
--- PASS: TestIsChieforMaster (0.00s)
PASS
coverage: 27.7% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
ok  	github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1	0.034s	coverage: 27.7% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
=== RUN   TestValidateV1TFJobSpec
time="2020-03-31T08:15:18Z" level=error msg="TFJobSpec is not valid: Image is undefined in the container of Worker"
time="2020-03-31T08:15:18Z" level=error msg="TFJobSpec is not valid: There is no container named tensorflow in Worker"
--- PASS: TestValidateV1TFJobSpec (0.00s)
PASS
coverage: 20.1% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
ok  	github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation	0.032s	coverage: 20.1% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
?   	github.com/kubeflow/tf-operator/pkg/client/clientset/versioned	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/informers/externalversions	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1	[no test files]
=== RUN   TestGenGeneralName
--- PASS: TestGenGeneralName (0.00s)
PASS
coverage: 0.5% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
ok  	github.com/kubeflow/tf-operator/pkg/common/jobcontroller	0.015s	coverage: 0.5% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
?   	github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil	[no test files]
?   	github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured	[no test files]
=== RUN   TestCreatePods
--- PASS: TestCreatePods (0.01s)
=== RUN   TestCreateService
time="2020-03-31T08:15:43Z" level=info msg="Controller test-tfjob created service empty_service"
--- PASS: TestCreateService (0.00s)
=== RUN   TestCreateServicesWithControllerRef
time="2020-03-31T08:15:43Z" level=info msg="Controller test-tfjob created service empty_service"
--- PASS: TestCreateServicesWithControllerRef (0.00s)
=== RUN   TestClaimServices
--- PASS: TestClaimServices (0.00s)
PASS
coverage: 24.4% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
ok  	github.com/kubeflow/tf-operator/pkg/control	0.052s	coverage: 24.4% of statements in github.com/kubeflow/tf-operator/pkg/apis/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/apis/tensorflow/validation, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/scheme, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/internalinterfaces, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow, github.com/kubeflow/tf-operator/pkg/client/informers/externalversions/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/client/listers/tensorflow/v1, github.com/kubeflow/tf-operator/pkg/common/jobcontroller, github.com/kubeflow/tf-operator/pkg/common/util/v1/testutil, github.com/kubeflow/tf-operator/pkg/common/util/v1/unstructured, github.com/kubeflow/tf-operator/pkg/control, github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow, github.com/kubeflow/tf-operator/pkg/logger, github.com/kubeflow/tf-operator/pkg/util, github.com/kubeflow/tf-operator/pkg/util/k8sutil, github.com/kubeflow/tf-operator/pkg/util/signals, github.com/kubeflow/tf-operator/pkg/util/train, github.com/kubeflow/tf-operator/pkg/version
FAIL	github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow [build failed]
exit status 2: warning: no packages being tested depend on matches for pattern github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/fake
warning: no packages being tested depend on matches for pattern github.com/kubeflow/tf-operator/pkg/client/clientset/versioned/typed/tensorflow/v1/fake
warning: no packages being tested depend on matches for pattern github.com/kubeflow/tf-operator/pkg/util
warning: no packages being tested depend on matches for pattern github.com/kubeflow/tf-operator/pkg/util/signals
warning: no packages being tested depend on matches for pattern github.com/kubeflow/tf-operator/pkg/version
# github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow [github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow.test]
pkg/controller.v1/tensorflow/controller.go:426: undefined: jobStatus in jobStatus.CompletionTime
TravisBuddy Request Identifier: d3601670-7327-11ea-8d6e-5f9eb1f9039f

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/assign @ChanYiLin

Thanks for your contribution! 🎉 👍

@coveralls
Copy link

coveralls commented Mar 31, 2020

Coverage Status

Coverage remained the same at 96.512% when pulling 21e7089 on SimonCqk:master into 95a0f62 on kubeflow:master.

@ChanYiLin
Copy link
Member

ChanYiLin commented Apr 6, 2020

Hi,
I think I have fixed the issue before
so if the tfJobExceedsLimit is true
then it will

  1. deletePodsAndServices
    https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/controller.go#L432
  2. record the CompletionTime
    https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/controller.go#L447

in the following if-else
so job.status.completionTime will not be a nil pointer in this case.
Please correct me if I missed something.

@SimonCqk
Copy link
Contributor Author

SimonCqk commented Apr 7, 2020

Hi,
I think I have fixed the issue before
so if the tfJobExceedsLimit is true
then it will

  1. deletePodsAndServices
    https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/controller.go#L432
  2. record the CompletionTime
    https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/controller.go#L447

in the following if-else
so job.status.completionTime will not be a nil pointer in this case.
Please correct me if I missed something.

hi,the panic exception happens inside cleanupTFJob while the completion time has not set yet, and the job.status.completionTime=now should be promoted.

@johnugeorge
Copy link
Member

/lgtm

@ChanYiLin
Copy link
Member

ChanYiLin commented Apr 7, 2020

Hi,
I think I have fixed the issue before
so if the tfJobExceedsLimit is true
then it will

  1. deletePodsAndServices
    https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/controller.go#L432
  2. record the CompletionTime
    https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/controller.go#L447

in the following if-else
so job.status.completionTime will not be a nil pointer in this case.
Please correct me if I missed something.

hi,the panic exception happens inside cleanupTFJob while the completion time has not set yet, and the job.status.completionTime=now should be promoted.

Ok, I got it !
Then the fix should be like

if tfJobExceedsLimit {
                // move the following code snippet to here
                if tfjob.Status.CompletionTime == nil {
			now := metav1.Now()
			tfjob.Status.CompletionTime = &now
		}

		// If the TFJob exceeds backoff limit or is past active deadline
		// delete all pods and services, then set the status to failed
		if err := tc.deletePodsAndServices(tfjob, pods); err != nil {
			return err
		}
		if err := tc.cleanupTFJob(tfjob); err != nil {
			return err
		}
		if tc.Config.EnableGangScheduling {
			if err := tc.DeletePodGroup(tfjob); err != nil {
				return err
			}
		}
		tc.Recorder.Event(tfjob, v1.EventTypeNormal, tfJobFailedReason, failureMessage)
		 
                 // remove the following code snippet
                 // if tfjob.Status.CompletionTime == nil {
		//	now := metav1.Now()
		//	tfjob.Status.CompletionTime = &now
		// }
...
...
}

In your fix, there are two issues

  1. it will update the finishtime every time when reconciling , because you didn't add if tfjob.Status.CompletionTime == nil condition
  2. you also have to do the same thing to if exceedsBackoffLimit || pastBackoffLimit this condition, because this is also the same situation as tfJobExceedsLimit.
    Thus, I think adding the code snippet to the if tfJobExceedsLimit{...} scope as above is much better.
    Then you don't need to add finish time to both condition if exceedsBackoffLimit || pastBackoffLimit{...} else if tc.pastActiveDeadline(tfjob){...}

@SimonCqk
Copy link
Contributor Author

SimonCqk commented Apr 7, 2020

Hi,
I think I have fixed the issue before
so if the tfJobExceedsLimit is true
then it will

  1. deletePodsAndServices
    https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/controller.go#L432
  2. record the CompletionTime
    https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/controller.go#L447

in the following if-else
so job.status.completionTime will not be a nil pointer in this case.
Please correct me if I missed something.

hi,the panic exception happens inside cleanupTFJob while the completion time has not set yet, and the job.status.completionTime=now should be promoted.

Ok, I got it !
Then the fix should be like

if tfJobExceedsLimit {
                // move the following code snippet to here
                if tfjob.Status.CompletionTime == nil {
			now := metav1.Now()
			tfjob.Status.CompletionTime = &now
		}

		// If the TFJob exceeds backoff limit or is past active deadline
		// delete all pods and services, then set the status to failed
		if err := tc.deletePodsAndServices(tfjob, pods); err != nil {
			return err
		}
		if err := tc.cleanupTFJob(tfjob); err != nil {
			return err
		}
		if tc.Config.EnableGangScheduling {
			if err := tc.DeletePodGroup(tfjob); err != nil {
				return err
			}
		}
		tc.Recorder.Event(tfjob, v1.EventTypeNormal, tfJobFailedReason, failureMessage)
		 
                 // remove the following code snippet
                 // if tfjob.Status.CompletionTime == nil {
		//	now := metav1.Now()
		//	tfjob.Status.CompletionTime = &now
		// }
...
...
}

In your fix, there are two issues

  1. it will update the finishtime every time when reconciling , because you didn't add if tfjob.Status.CompletionTime == nil condition
  2. you also have to do the same thing to if exceedsBackoffLimit || pastBackoffLimit this condition, because this is also the same situation as tfJobExceedsLimit.
    Thus, I think adding the code snippet to the if tfJobExceedsLimit{...} scope as above is much better.
    Then you don't need to add finish time to both condition if exceedsBackoffLimit || pastBackoffLimit{...} else if tc.pastActiveDeadline(tfjob){...}

I got you, I'll fix this soon. THX!

@ChanYiLin
Copy link
Member

/LGTM
/approve
Thanks for your contribution !

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ChanYiLin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@SimonCqk
Copy link
Contributor Author

SimonCqk commented Apr 9, 2020

/retest

@k8s-ci-robot
Copy link

@SimonCqk: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@SimonCqk
Copy link
Contributor Author

SimonCqk commented Apr 9, 2020

@gaocegege @jimexist The test failed and I am not a trusted user.

@ChanYiLin
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot merged commit a74b423 into kubeflow:master Apr 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants