Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Add Cleanup Policy to TFJob Spec #536

Closed
karthikvadla opened this issue Apr 12, 2018 · 19 comments
Closed

[feature] Add Cleanup Policy to TFJob Spec #536

karthikvadla opened this issue Apr 12, 2018 · 19 comments

Comments

@karthikvadla
Copy link

#128

I see with the above fix, it cleans off the pods when job completed to avoid resource issues.

Pain points:

  • What if the user does not have any cluster level logging.? In future, we are planning to have in-lab setup where we don't want to use any cloud storage for logging purpose.
  • Currently working on some POC's with tf-operator, the pods are getting wiped out immediately. It's really hard to debug what the issue is.

Solution:

  • Adding a flag to disable pod cleanup behavior. So that logs are persisted for those who wants to debug.

Please let me know your thoughts or any suggestions to support this.

@gaocegege
Copy link
Member

Yeah, that is a good point. Thanks for your issue.

I will take a look on v1alpha2.

@jlewi
Copy link
Contributor

jlewi commented Apr 16, 2018

I don't think not cleaning up the pods is the right solution. I think the correct solution is to implement an appropriate cluster level logging solution.

Cluster logging doesn't require cloud storage. You could implement cluster level logging that stores the logs in a PD or volume mount.

Has anyone looked to see if there is an out of the box cluster logger that would do this? My suspicion is there is a fluentd plugin that does this.

@nqn
Copy link

nqn commented Apr 17, 2018

@jlewi Instead of deleting the resources (pods), could we work out something where the entry process completes (exits with 0) when receiving a signal (could be a POST to an endpoint). A log aggregator should obviously be in place, but we have found users be pretty confused about loosing the access to the logs immediately.

@jlewi
Copy link
Contributor

jlewi commented Apr 18, 2018

I'm open to suggestions but I haven't been able to come up with a good solution for forcing terminations of the pod.

The requirement here is that we don't want uses to have to change their TF code. So how do we force the process to exit with a code of 0.

@nqn
Copy link

nqn commented Apr 19, 2018

Re: not changing the code; completely agreed - we can look at this.

@u2takey
Copy link
Contributor

u2takey commented Apr 22, 2018

add CleanPolicy in tfjob's spec may help this,you can give user chance to choose the clean up logic as they prefer.
For example: CleanPolicy can be cleanupAll/cleanupNone/cleanupSuccess and so on.
(cleanupNone will clean nothing, cleanup All will always clean remaining pod as tfjob success/fail )

@jlewi
Copy link
Contributor

jlewi commented Apr 23, 2018

I don't think forcing users to explicitly set the behavior is all that useful. Users would likely only want to set this if they know ahead of time that there job isn't working and they want to capture logs.

In practice users won't know ahead of time that there job won't work. Forcing users to rerun their job in order to get logs is cumbersome.

Can someone investigate to see if there is a simple fluentd plugin that we could use to write cluster logs to a file?

I think providing a default cluster logging backend combined with tooling to fetch logs (kubeflow/community#54) seems like a better approach to me.

@BenHall
Copy link

BenHall commented Apr 23, 2018

@jlewi I started to look at a file based fluentd daemonset after coming across this problem last week.

It pipes all the logs to /tmp/data on the host (poor location, but proof of concept). The current output is too noisy due to the core Kubernetes output. It's a mismatch of line + json making it hard to filter using something like jq.

Progress so far:
BenHall/fluentd-kubernetes-daemonset@cb27545

I'm sure with some updates to the Fluent conf (https://github.com/BenHall/fluentd-kubernetes-daemonset/blob/cb275459dcd707fafe88b35893911a33bc0182ba/docker-image/v0.12/alpine-file/conf/fluent.conf) we could break out the different namespaces into different log files on disk.

@nqn
Copy link

nqn commented Apr 23, 2018

Just bringing this up one more time; from the user perspective (and this in not inferred, but reported to us). It is confusing that getting logs from a TFJob run is different from getting logs from a plain kubernetes job. The log aggregators are in place, but it is pretty common to use kubectl logs to get the latest logs quickly.

If we can make it work, I would argue we should try to make the pods complete instead of delete them.

@jlewi
Copy link
Contributor

jlewi commented Apr 23, 2018

@nqn I agree with you but to my knowledge there is no good way to force termination of a pod. I was hoping that we could use CreateEviction to force pods to terminate. I asked internally whether the pods would still be accessible via API server so that we could fetch the logs and I was told no and that this was a known issue.

The original behavior of TFJob was to leave pods running until TFJob deletion as a way to make logs available. But users viewed this as a bug; in part because we ended up consuming resources.

I would argue that kubectl logs <pod> is not really the ideal experience either. Users would probably prefer to do kubectl logs <tf job> as opposed to specifying individual pods.

So my contention is that @BenHall 's work providing a default backend + @kkasravi work on using Kubeless to implement CLI/APIs is the more promising direction.

@nqn
Copy link

nqn commented Apr 23, 2018

We tinkered with this (abstracting logging) as well, but found that kubectl logs --selector=... wasn't enough complexity to hide behind another layer. Folks run other than tf-operator jobs on the cluster, so the UI/UX consistency ends up mattering too. So our first path was just a documented "best known method". kubetail has been useful too, for tailing logs for multiple pods and the pod name pattern (where it'll be prefixed with your tfjob name) matching.

For the termination, the operator controls the entry point for the worker pods, right? If so, we may have a few options for wrapping the worker command or adding side car containers to make it terminate. The main issue is that the python process is blocking? If it is waiting for gRPC communication, maybe there is a way to make it shut down gracefully? If it is not there, maybe something we could bring to the tensorflow community.
We still need to dig into whether this is a viable path, but is where we (at least) could get a lot of value from, as it makes the debugging experience more "kubernetes-native".

If we want to enhance kubectl logs, it seems like a question to have had in the kubernetes community; as it won't be unique to kubeflow style jobs.

@gaocegege gaocegege changed the title Adding flag in tf-operator to preserve Logs Adding flag in tf-operator to preserve logs Apr 24, 2018
@ashahba
Copy link
Member

ashahba commented Apr 24, 2018

@jlewi I agree with your argument here, but at the same time when users are still in template development phase they may require to get a shell into the running Pod and inspect it and many other scenarios that require Pods to be available.

For that reason I believe that leaving decision on resource cleanup to users and their jobs is more appropriate here.
Cluster admins should always be able to define Resource Quotas per user and per namespace.
This way we encourage people to be mindful of resources available to them and also give them option to use the resource however they choose to rather than enforcing the policy on them 🙂 .

So I believe a flag around Job completion policy and what to do with resources after that is not a bad idea.

@ashahba
Copy link
Member

ashahba commented Apr 24, 2018

Addressing my previous comment to a different thread here #558 because it is tackling Resource preservation rather than just keeping logs around which this issue is really about 🙂

@jlewi
Copy link
Contributor

jlewi commented Apr 25, 2018

If users don't wand their job to terminate they can do this by adding a sleep forever to their job; e.g. by wrapping their binary in a bash script that invokes their command then just does tail -f /dev/null.

I think that's much better than adding a flag to TFJob operator to prevent resource cleanup.

Importantly, leaving a container running so people can get a shell into it is very different from not deleting the pod to preserve logs.

@gaocegege gaocegege changed the title Adding flag in tf-operator to preserve logs [feature] Add log backend to preserve logs Apr 26, 2018
@cheyang
Copy link
Contributor

cheyang commented Jun 19, 2018

@jlewi I started to look at a file based fluentd daemonset after coming across this problem last week.

It pipes all the logs to /tmp/data on the host (poor location, but proof of concept). The current output is too noisy due to the core Kubernetes output. It's a mismatch of line + json making it hard to filter using something like jq.

Progress so far:
BenHall/fluentd-kubernetes-daemonset@cb27545

I'm sure with some updates to the Fluent conf (https://github.com/BenHall/fluentd-kubernetes-daemonset/blob/cb275459dcd707fafe88b35893911a33bc0182ba/docker-image/v0.12/alpine-file/conf/fluent.conf) we could break out the different namespaces into different log files on disk.

@jlewi , my question is that how we check and make sure the logs are collected completely before the pods are deleted, as I know the pods are deleted once the tfjob is complete.

@jlewi
Copy link
Contributor

jlewi commented Jun 20, 2018

@cheyang I think ensure logs are properly collected is a generic/k8s problem; I don't have a good answer.

@jlewi jlewi changed the title [feature] Add log backend to preserve logs [feature] Add Cleanup Policy to TFJob Spec Jun 20, 2018
@jlewi
Copy link
Contributor

jlewi commented Jun 20, 2018

I like the solution in #685 of defining a CleanupPolicy per job. In particular, I like the idea of having a policy that terminates running pods but not completed pods. That seems like a good stop gap until K8s gives us a way to terminate pods.

An extension of that idea would to allow users to have a handler "/quit" that we could call in order to terminate their pods. Its possible there's a way we could auto inject a side car that would issue sigterm to the process in the TF container.

#685 implemented the cleanup policy for v1alpha1; we'll need to implement that for v1alpha2.

@jlewi
Copy link
Contributor

jlewi commented Jun 20, 2018

Tagging this 0.2. Even though it won't make 0.2.0 we should consider including it in a 0.2.1 release before 0.3.0.

@gaocegege
Copy link
Member

I think we could close it since it is implemented by #691

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants