-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use K8s Garbage Collection #42
Comments
Thanks for reporting this. You are correct that the method FullyCollect won't work because it depends on the app label which currently isn't applied. It looks like the functions k8sutil.LabelsForJob is never called.
This is by design. We leave the pods around until the TfJob is deleted. This allows logs to be fetched via kubectl logs. This mirrors the behavior of the built in K8s JobController.
Ideally we should be using K8s built in support for Garbage Collection. If we set owner references and the deletion policy correctly then I think K8s will automatically delete all resources when the TfJob is deleted. I think originally Garbage Collection didn't support custom resources. That's probably why the CoreOS etcd added gc.go. But it looks like this is fixed. So I think we need to update the code to make sure OwnerReferences are properly set on all resources created by the TfJob. |
Garbage collection for CRD should be in 1.8. |
@enisoc @kow3ns I could use your advice on how to properly clean up resources. Currently the TfJob creates a bunch of resources
Currently these resources are explicitly deleted by the TfJob CRD controller in response to a delete event. Questions:
Thanks |
You just need to add an entry to Here's where ReplicaSet does that: |
@enisoc Thank you. |
Add synk-secret
The code in gc.go collect pod、svc、deploy with special label app=tensorflow-job.
But the resource in a tfjob such as job svc doesn't have that label.
So, will you add the label when create job in the future.
In addition, tfjob doesn't have a k8s deploy,and delete pod of a job doesn't work because the job object haven't been deleted.
At the end, where does garbage come from? When a user deletes a tfjob, tf-operator crashes or the tf-operator receives delete event but restart before it deletes the tfjob, There will be some garbages.Such circumstance rare occur. Any other cases?
The text was updated successfully, but these errors were encountered: