Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TrainingJob.reconcile not called periodically #309

Closed
jlewi opened this issue Jan 15, 2018 · 7 comments
Closed

TrainingJob.reconcile not called periodically #309

jlewi opened this issue Jan 15, 2018 · 7 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jan 15, 2018

It looks like as a result of the refactor (#206) TrainingJob.Reconcile doesn't get called periodically and as a result jobs aren't being marked as completed which is why our tests are failing (#293).

The initial design of the TfJob controller relied on periodically polling the world to determine when the job was finished.

@wackxu @DjangoPeng @gaocegege is there a standard way to periodically invoke Reconcile using the informer and controller classes? Do we just need to run a go func to periodically generate events to trigger Reconcile?

@jlewi
Copy link
Contributor Author

jlewi commented Jan 15, 2018

Looks like we can just requeue the work item see #308

@gaocegege
Copy link
Member

@jlewi

In the implementation of kubernetes/sample-controller or caicloud/kubefow-controller, there is no goroutine to update the status. And I think it is not the best practice since if we restart the operator, the goroutine could not be restarted correspondingly.

And we update the status when there is a new event. For example, if there is a pod failed and our operator will get the event. Then the controller gets the TFJob which controls the pod and update its status. Then we do not need the map and goroutine to update status for the TFJob. The owner reference and event-driven controller could help us to do it.

As for the current design, I think it do not work since we do not know the pods' status. Then the TFJob's status is static since there are no input events to update it. This is a bug and I think we should find a place to add the goroutine or add a pod informer in the controller to get the pod events.

@ScorpioCPH
Copy link
Member

@jlewi Hi,

is there a standard way to periodically invoke Reconcile using the informer and controller classes?

I think the standard way is event-driven:

  • We need to set up an event handler for when TFJob resources changed.
  • It will enqueue that TFJob resource for processing.
  • syncTFJob will dequeue and process the single work item (TFJob) and call TrainingJob.Reconcile.

@mqliang
Copy link
Contributor

mqliang commented Jan 15, 2018

@jlewi

is there a standard way to periodically invoke Reconcile using the informer and controller classes?

The Informer framework accept a parameter which defines a "resync" period, and we set it as "30s", see https://github.com/tensorflow/k8s/blob/master/cmd/tf_operator/app/server.go#L83

The problem is: when "resync", the informer will send out "Update" events, but we currently only handle "Add" events and "Delete" events.

			Handler: cache.ResourceEventHandlerFuncs{
				AddFunc:    controller.enqueueController,
				DeleteFunc: controller.handleDelete,
			},

We should also add a "UpdateFunc" to handle "Update" events like this:

			Handler: cache.ResourceEventHandlerFuncs{
				AddFunc:    controller.enqueueController,
				UpdateFunc: func(oldObj, newObj interface{}) {
					controller.enqueueController(newObj)
				},
				DeleteFunc: controller.handleDelete,
			},

@jlewi
Copy link
Contributor Author

jlewi commented Jan 15, 2018

Thanks @mqliang that sounds exactly like what we want.

jlewi pushed a commit that referenced this issue Jan 15, 2018
* Add an Update function to the controller.

* The informer periodically generates an Update event but we aren't processing these because we don't have an update function.

* As a result, TrainingJob.reconcile doesn't get called periodically and we aren't properly updating the status of the job.

ref #309
ref #293
@mqliang
Copy link
Contributor

mqliang commented Jan 15, 2018

@jlewi

Some details about using the "UpdateFunc" handler function:

  • Even if a resource NOT actually get updated, the informer framework will still send out "Updated" event every "ResyncDuration" period, but in such a case "reflect.deepEqual(oldObj, newObj) == true". So, if we only want handle a "real" Updated event, we can use a handler like this:
UpdateFunc: func(oldObj, newObj interface{}) {
	if !reflect.DeepEqual(oldObj, newObj) {
		controller.enqueueController(newObj)
	}
}
  • If we only want to handle a updating in Spec(ignoring update in Status), we can use a handler like this:
UpdateFunc: func(oldObj, newObj interface{}) {
	oldTfJob, _ := oldObj.(*tfv1alpha1.TfJob)
	newTfJob, _ := newObj.(*tfv1alpha1.TfJob)
	if !reflect.DeepEqual(oldTfJob.Spec, newTfJob.Spec) {
		controller.enqueueController(newObj)
	}
}
  • if we want periodically reconcile even if there is no update, we can use a handler like this:
UpdateFunc: func(oldObj, newObj interface{}) {
	controller.enqueueController(newObj)
}

@jlewi
Copy link
Contributor Author

jlewi commented Jan 15, 2018

Thanks!.

#313 has been merged so I'm closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants