Use K8s Garbage Collection #42

loadwiki · 2017-09-21T12:45:27Z

The code in gc.go collect pod、svc、deploy with special label app=tensorflow-job.
But the resource in a tfjob such as job svc doesn't have that label.
So, will you add the label when create job in the future.
In addition, tfjob doesn't have a k8s deploy,and delete pod of a job doesn't work because the job object haven't been deleted.
At the end, where does garbage come from? When a user deletes a tfjob, tf-operator crashes or the tf-operator receives delete event but restart before it deletes the tfjob, There will be some garbages.Such circumstance rare occur. Any other cases?

jlewi · 2017-09-26T13:30:59Z

Thanks for reporting this. You are correct that the method FullyCollect won't work because it depends on the app label which currently isn't applied. It looks like the functions k8sutil.LabelsForJob is never called.

In addition, tfjob doesn't have a k8s deploy,and delete pod of a job doesn't work because the job object haven't been deleted.

This is by design. We leave the pods around until the TfJob is deleted. This allows logs to be fetched via kubectl logs. This mirrors the behavior of the built in K8s JobController.

At the end, where does garbage come from? When a user deletes a tfjob, tf-operator crashes or the tf-operator receives delete event but restart before it deletes the tfjob, There will be some garbages.Such circumstance rare occur. Any other cases?

Ideally we should be using K8s built in support for Garbage Collection. If we set owner references and the deletion policy correctly then I think K8s will automatically delete all resources when the TfJob is deleted.

I think originally Garbage Collection didn't support custom resources. That's probably why the CoreOS etcd added gc.go. But it looks like this is fixed.

So I think we need to update the code to make sure OwnerReferences are properly set on all resources created by the TfJob.

jlewi · 2017-09-26T16:03:25Z

Garbage collection for CRD should be in 1.8.

jlewi · 2017-11-06T00:49:48Z

@enisoc @kow3ns I could use your advice on how to properly clean up resources.

Currently the TfJob creates a bunch of resources

Multiple Job controllers
Services
Deployments

Currently these resources are explicitly deleted by the TfJob CRD controller in response to a delete event.

Questions:

Is there any reason the CRD controller should explicitly delete these resources? Should I just rely on K8s Garbage Collection?
How can I set the default cascading deletion policy for my CRD?

Thanks

enisoc · 2017-11-06T19:33:05Z

If you want them to stick around as long as the parent TfJob still exists, then I suggest relying on GC. I would only suggest deleting a child object yourself if it's part of ongoing management rather than final cleanup -- for example if you support some kind of rollout process like Deployment, it would make sense for your controller to directly delete old things after the rollout finishes.
The default deletion policy for all resources defined through CRD is to cascade. However, as you noted above, this only works for CRD as of k8s 1.8+.

You just need to add an entry to metadata.ownerReferences pointing back to the parent, whenever you create a child object.

Here's where ReplicaSet does that:

https://github.com/kubernetes/kubernetes/blob/298c42bbcd95c1536e0dc5f7a0aed48cec91eaf1/pkg/controller/replicaset/replica_set.go#L462-L469

jlewi · 2017-11-07T02:58:29Z

@enisoc Thank you.

Add synk-secret

jlewi added the kind/enhancement label Sep 29, 2017

jlewi changed the title ~~The code in gc.go couldn't collect garbage (such as job svc)now.~~ Use K8s Garbage Collection Nov 2, 2017

jlewi mentioned this issue Nov 4, 2017

TensorBoard replica set not deleted when job deleted. #107

Closed

jlewi closed this as completed in ead44b0 Nov 7, 2017

oksanabaza pushed a commit to oksanabaza/training-operator that referenced this issue Jan 13, 2025

Merge pull request kubeflow#42 from red-hat-data-services/add-synk

dd11d51

Add synk-secret

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use K8s Garbage Collection #42

Use K8s Garbage Collection #42

loadwiki commented Sep 21, 2017

jlewi commented Sep 26, 2017

jlewi commented Sep 26, 2017

jlewi commented Nov 6, 2017

enisoc commented Nov 6, 2017

jlewi commented Nov 7, 2017

Use K8s Garbage Collection #42

Use K8s Garbage Collection #42

Comments

loadwiki commented Sep 21, 2017

jlewi commented Sep 26, 2017

jlewi commented Sep 26, 2017

jlewi commented Nov 6, 2017

enisoc commented Nov 6, 2017

jlewi commented Nov 7, 2017