Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use K8s Garbage Collection #42

Closed
loadwiki opened this issue Sep 21, 2017 · 5 comments
Closed

Use K8s Garbage Collection #42

loadwiki opened this issue Sep 21, 2017 · 5 comments

Comments

@loadwiki
Copy link
Contributor

The code in gc.go collect pod、svc、deploy with special label app=tensorflow-job.
But the resource in a tfjob such as job svc doesn't have that label.
So, will you add the label when create job in the future.
In addition, tfjob doesn't have a k8s deploy,and delete pod of a job doesn't work because the job object haven't been deleted.
At the end, where does garbage come from? When a user deletes a tfjob, tf-operator crashes or the tf-operator receives delete event but restart before it deletes the tfjob, There will be some garbages.Such circumstance rare occur. Any other cases?

@jlewi
Copy link
Contributor

jlewi commented Sep 26, 2017

Thanks for reporting this. You are correct that the method FullyCollect won't work because it depends on the app label which currently isn't applied. It looks like the functions k8sutil.LabelsForJob is never called.

In addition, tfjob doesn't have a k8s deploy,and delete pod of a job doesn't work because the job object haven't been deleted.

This is by design. We leave the pods around until the TfJob is deleted. This allows logs to be fetched via kubectl logs. This mirrors the behavior of the built in K8s JobController.

At the end, where does garbage come from? When a user deletes a tfjob, tf-operator crashes or the tf-operator receives delete event but restart before it deletes the tfjob, There will be some garbages.Such circumstance rare occur. Any other cases?

Ideally we should be using K8s built in support for Garbage Collection. If we set owner references and the deletion policy correctly then I think K8s will automatically delete all resources when the TfJob is deleted.

I think originally Garbage Collection didn't support custom resources. That's probably why the CoreOS etcd added gc.go. But it looks like this is fixed.

So I think we need to update the code to make sure OwnerReferences are properly set on all resources created by the TfJob.

@jlewi
Copy link
Contributor

jlewi commented Sep 26, 2017

Garbage collection for CRD should be in 1.8.

@jlewi jlewi changed the title The code in gc.go couldn't collect garbage (such as job svc)now. Use K8s Garbage Collection Nov 2, 2017
@jlewi
Copy link
Contributor

jlewi commented Nov 6, 2017

@enisoc @kow3ns I could use your advice on how to properly clean up resources.

Currently the TfJob creates a bunch of resources

  • Multiple Job controllers
  • Services
  • Deployments

Currently these resources are explicitly deleted by the TfJob CRD controller in response to a delete event.

Questions:

  1. Is there any reason the CRD controller should explicitly delete these resources? Should I just rely on K8s Garbage Collection?
  2. How can I set the default cascading deletion policy for my CRD?

Thanks

@enisoc
Copy link

enisoc commented Nov 6, 2017

  1. If you want them to stick around as long as the parent TfJob still exists, then I suggest relying on GC. I would only suggest deleting a child object yourself if it's part of ongoing management rather than final cleanup -- for example if you support some kind of rollout process like Deployment, it would make sense for your controller to directly delete old things after the rollout finishes.

  2. The default deletion policy for all resources defined through CRD is to cascade. However, as you noted above, this only works for CRD as of k8s 1.8+.

You just need to add an entry to metadata.ownerReferences pointing back to the parent, whenever you create a child object.

Here's where ReplicaSet does that:

https://github.com/kubernetes/kubernetes/blob/298c42bbcd95c1536e0dc5f7a0aed48cec91eaf1/pkg/controller/replicaset/replica_set.go#L462-L469

@jlewi
Copy link
Contributor

jlewi commented Nov 7, 2017

@enisoc Thank you.

@jlewi jlewi closed this as completed in ead44b0 Nov 7, 2017
oksanabaza pushed a commit to oksanabaza/training-operator that referenced this issue Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants