-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] Add Cleanup Policy to TFJob Spec #536
Comments
Yeah, that is a good point. Thanks for your issue. I will take a look on v1alpha2. |
I don't think not cleaning up the pods is the right solution. I think the correct solution is to implement an appropriate cluster level logging solution. Cluster logging doesn't require cloud storage. You could implement cluster level logging that stores the logs in a PD or volume mount. Has anyone looked to see if there is an out of the box cluster logger that would do this? My suspicion is there is a fluentd plugin that does this. |
@jlewi Instead of deleting the resources (pods), could we work out something where the entry process completes (exits with 0) when receiving a signal (could be a POST to an endpoint). A log aggregator should obviously be in place, but we have found users be pretty confused about loosing the access to the logs immediately. |
I'm open to suggestions but I haven't been able to come up with a good solution for forcing terminations of the pod. The requirement here is that we don't want uses to have to change their TF code. So how do we force the process to exit with a code of 0. |
Re: not changing the code; completely agreed - we can look at this. |
add CleanPolicy in tfjob's spec may help this,you can give user chance to choose the clean up logic as they prefer. |
I don't think forcing users to explicitly set the behavior is all that useful. Users would likely only want to set this if they know ahead of time that there job isn't working and they want to capture logs. In practice users won't know ahead of time that there job won't work. Forcing users to rerun their job in order to get logs is cumbersome. Can someone investigate to see if there is a simple fluentd plugin that we could use to write cluster logs to a file? I think providing a default cluster logging backend combined with tooling to fetch logs (kubeflow/community#54) seems like a better approach to me. |
@jlewi I started to look at a file based fluentd daemonset after coming across this problem last week. It pipes all the logs to /tmp/data on the host (poor location, but proof of concept). The current output is too noisy due to the core Kubernetes output. It's a mismatch of line + json making it hard to filter using something like Progress so far: I'm sure with some updates to the Fluent conf (https://github.com/BenHall/fluentd-kubernetes-daemonset/blob/cb275459dcd707fafe88b35893911a33bc0182ba/docker-image/v0.12/alpine-file/conf/fluent.conf) we could break out the different namespaces into different log files on disk. |
Just bringing this up one more time; from the user perspective (and this in not inferred, but reported to us). It is confusing that getting logs from a TFJob run is different from getting logs from a plain kubernetes job. The log aggregators are in place, but it is pretty common to use If we can make it work, I would argue we should try to make the pods complete instead of delete them. |
@nqn I agree with you but to my knowledge there is no good way to force termination of a pod. I was hoping that we could use CreateEviction to force pods to terminate. I asked internally whether the pods would still be accessible via API server so that we could fetch the logs and I was told no and that this was a known issue. The original behavior of TFJob was to leave pods running until TFJob deletion as a way to make logs available. But users viewed this as a bug; in part because we ended up consuming resources. I would argue that So my contention is that @BenHall 's work providing a default backend + @kkasravi work on using Kubeless to implement CLI/APIs is the more promising direction. |
We tinkered with this (abstracting logging) as well, but found that For the termination, the operator controls the entry point for the worker pods, right? If so, we may have a few options for wrapping the worker command or adding side car containers to make it terminate. The main issue is that the python process is blocking? If it is waiting for gRPC communication, maybe there is a way to make it shut down gracefully? If it is not there, maybe something we could bring to the tensorflow community. If we want to enhance |
@jlewi I agree with your argument here, but at the same time when users are still in template development phase they may require to get a shell into the running Pod and inspect it and many other scenarios that require Pods to be available. For that reason I believe that leaving decision on resource cleanup to users and their jobs is more appropriate here. So I believe a flag around Job completion policy and what to do with resources after that is not a bad idea. |
Addressing my previous comment to a different thread here #558 because it is tackling Resource preservation rather than just keeping logs around which this issue is really about 🙂 |
If users don't wand their job to terminate they can do this by adding a sleep forever to their job; e.g. by wrapping their binary in a bash script that invokes their command then just does I think that's much better than adding a flag to TFJob operator to prevent resource cleanup. Importantly, leaving a container running so people can get a shell into it is very different from not deleting the pod to preserve logs. |
@jlewi , my question is that how we check and make sure the logs are collected completely before the pods are deleted, as I know the pods are deleted once the tfjob is complete. |
@cheyang I think ensure logs are properly collected is a generic/k8s problem; I don't have a good answer. |
I like the solution in #685 of defining a CleanupPolicy per job. In particular, I like the idea of having a policy that terminates running pods but not completed pods. That seems like a good stop gap until K8s gives us a way to terminate pods. An extension of that idea would to allow users to have a handler "/quit" that we could call in order to terminate their pods. Its possible there's a way we could auto inject a side car that would issue sigterm to the process in the TF container. #685 implemented the cleanup policy for v1alpha1; we'll need to implement that for v1alpha2. |
Tagging this 0.2. Even though it won't make 0.2.0 we should consider including it in a 0.2.1 release before 0.3.0. |
I think we could close it since it is implemented by #691 |
#128
I see with the above fix, it cleans off the pods when job completed to avoid resource issues.
Pain points:
Solution:
pod cleanup
behavior. So that logs are persisted for those who wants to debug.Please let me know your thoughts or any suggestions to support this.
The text was updated successfully, but these errors were encountered: