-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TrainingJob.reconcile not called periodically #309
Comments
Looks like we can just requeue the work item see #308 |
In the implementation of kubernetes/sample-controller or caicloud/kubefow-controller, there is no goroutine to update the status. And I think it is not the best practice since if we restart the operator, the goroutine could not be restarted correspondingly. And we update the status when there is a new event. For example, if there is a pod failed and our operator will get the event. Then the controller gets the TFJob which controls the pod and update its status. Then we do not need the map and goroutine to update status for the TFJob. The owner reference and event-driven controller could help us to do it. As for the current design, I think it do not work since we do not know the pods' status. Then the TFJob's status is static since there are no input events to update it. This is a bug and I think we should find a place to add the goroutine or add a pod informer in the controller to get the pod events. |
@jlewi Hi,
I think the standard way is
|
The Informer framework accept a parameter which defines a "resync" period, and we set it as "30s", see https://github.com/tensorflow/k8s/blob/master/cmd/tf_operator/app/server.go#L83 The problem is: when "resync", the informer will send out "Update" events, but we currently only handle "Add" events and "Delete" events.
We should also add a "UpdateFunc" to handle "Update" events like this:
|
Thanks @mqliang that sounds exactly like what we want. |
* Add an Update function to the controller. * The informer periodically generates an Update event but we aren't processing these because we don't have an update function. * As a result, TrainingJob.reconcile doesn't get called periodically and we aren't properly updating the status of the job. ref #309 ref #293
Some details about using the "UpdateFunc" handler function:
|
Thanks!. #313 has been merged so I'm closing this issue. |
It looks like as a result of the refactor (#206) TrainingJob.Reconcile doesn't get called periodically and as a result jobs aren't being marked as completed which is why our tests are failing (#293).
The initial design of the TfJob controller relied on periodically polling the world to determine when the job was finished.
@wackxu @DjangoPeng @gaocegege is there a standard way to periodically invoke Reconcile using the informer and controller classes? Do we just need to run a go func to periodically generate events to trigger Reconcile?
The text was updated successfully, but these errors were encountered: