[feature] Add Cleanup Policy to TFJob Spec #536

karthikvadla · 2018-04-12T18:54:35Z

#128

I see with the above fix, it cleans off the pods when job completed to avoid resource issues.

Pain points:

What if the user does not have any cluster level logging.? In future, we are planning to have in-lab setup where we don't want to use any cloud storage for logging purpose.
Currently working on some POC's with tf-operator, the pods are getting wiped out immediately. It's really hard to debug what the issue is.

Solution:

Adding a flag to disable pod cleanup behavior. So that logs are persisted for those who wants to debug.

Please let me know your thoughts or any suggestions to support this.

The text was updated successfully, but these errors were encountered:

gaocegege · 2018-04-13T02:34:39Z

Yeah, that is a good point. Thanks for your issue.

I will take a look on v1alpha2.

jlewi · 2018-04-16T17:20:41Z

I don't think not cleaning up the pods is the right solution. I think the correct solution is to implement an appropriate cluster level logging solution.

Cluster logging doesn't require cloud storage. You could implement cluster level logging that stores the logs in a PD or volume mount.

Has anyone looked to see if there is an out of the box cluster logger that would do this? My suspicion is there is a fluentd plugin that does this.

nqn · 2018-04-17T18:31:09Z

@jlewi Instead of deleting the resources (pods), could we work out something where the entry process completes (exits with 0) when receiving a signal (could be a POST to an endpoint). A log aggregator should obviously be in place, but we have found users be pretty confused about loosing the access to the logs immediately.

jlewi · 2018-04-18T06:09:43Z

I'm open to suggestions but I haven't been able to come up with a good solution for forcing terminations of the pod.

The requirement here is that we don't want uses to have to change their TF code. So how do we force the process to exit with a code of 0.

nqn · 2018-04-19T21:42:49Z

Re: not changing the code; completely agreed - we can look at this.

u2takey · 2018-04-22T06:22:11Z

add CleanPolicy in tfjob's spec may help this，you can give user chance to choose the clean up logic as they prefer.
For example: CleanPolicy can be cleanupAll/cleanupNone/cleanupSuccess and so on.
(cleanupNone will clean nothing, cleanup All will always clean remaining pod as tfjob success/fail )

jlewi · 2018-04-23T02:41:25Z

I don't think forcing users to explicitly set the behavior is all that useful. Users would likely only want to set this if they know ahead of time that there job isn't working and they want to capture logs.

In practice users won't know ahead of time that there job won't work. Forcing users to rerun their job in order to get logs is cumbersome.

Can someone investigate to see if there is a simple fluentd plugin that we could use to write cluster logs to a file?

I think providing a default cluster logging backend combined with tooling to fetch logs (kubeflow/community#54) seems like a better approach to me.

BenHall · 2018-04-23T09:20:45Z

@jlewi I started to look at a file based fluentd daemonset after coming across this problem last week.

It pipes all the logs to /tmp/data on the host (poor location, but proof of concept). The current output is too noisy due to the core Kubernetes output. It's a mismatch of line + json making it hard to filter using something like jq.

Progress so far:
BenHall/fluentd-kubernetes-daemonset@cb27545

I'm sure with some updates to the Fluent conf (https://github.com/BenHall/fluentd-kubernetes-daemonset/blob/cb275459dcd707fafe88b35893911a33bc0182ba/docker-image/v0.12/alpine-file/conf/fluent.conf) we could break out the different namespaces into different log files on disk.

nqn · 2018-04-23T14:58:10Z

Just bringing this up one more time; from the user perspective (and this in not inferred, but reported to us). It is confusing that getting logs from a TFJob run is different from getting logs from a plain kubernetes job. The log aggregators are in place, but it is pretty common to use kubectl logs to get the latest logs quickly.

If we can make it work, I would argue we should try to make the pods complete instead of delete them.

jlewi · 2018-04-23T15:22:01Z

@nqn I agree with you but to my knowledge there is no good way to force termination of a pod. I was hoping that we could use CreateEviction to force pods to terminate. I asked internally whether the pods would still be accessible via API server so that we could fetch the logs and I was told no and that this was a known issue.

The original behavior of TFJob was to leave pods running until TFJob deletion as a way to make logs available. But users viewed this as a bug; in part because we ended up consuming resources.

I would argue that kubectl logs <pod> is not really the ideal experience either. Users would probably prefer to do kubectl logs <tf job> as opposed to specifying individual pods.

So my contention is that @BenHall 's work providing a default backend + @kkasravi work on using Kubeless to implement CLI/APIs is the more promising direction.

nqn · 2018-04-23T15:49:28Z

We tinkered with this (abstracting logging) as well, but found that kubectl logs --selector=... wasn't enough complexity to hide behind another layer. Folks run other than tf-operator jobs on the cluster, so the UI/UX consistency ends up mattering too. So our first path was just a documented "best known method". kubetail has been useful too, for tailing logs for multiple pods and the pod name pattern (where it'll be prefixed with your tfjob name) matching.

For the termination, the operator controls the entry point for the worker pods, right? If so, we may have a few options for wrapping the worker command or adding side car containers to make it terminate. The main issue is that the python process is blocking? If it is waiting for gRPC communication, maybe there is a way to make it shut down gracefully? If it is not there, maybe something we could bring to the tensorflow community.
We still need to dig into whether this is a viable path, but is where we (at least) could get a lot of value from, as it makes the debugging experience more "kubernetes-native".

If we want to enhance kubectl logs, it seems like a question to have had in the kubernetes community; as it won't be unique to kubeflow style jobs.

ashahba · 2018-04-24T20:37:00Z

@jlewi I agree with your argument here, but at the same time when users are still in template development phase they may require to get a shell into the running Pod and inspect it and many other scenarios that require Pods to be available.

For that reason I believe that leaving decision on resource cleanup to users and their jobs is more appropriate here.
Cluster admins should always be able to define Resource Quotas per user and per namespace.
This way we encourage people to be mindful of resources available to them and also give them option to use the resource however they choose to rather than enforcing the policy on them 🙂 .

So I believe a flag around Job completion policy and what to do with resources after that is not a bad idea.

ashahba · 2018-04-24T21:16:57Z

Addressing my previous comment to a different thread here #558 because it is tackling Resource preservation rather than just keeping logs around which this issue is really about 🙂

jlewi · 2018-04-25T03:27:32Z

If users don't wand their job to terminate they can do this by adding a sleep forever to their job; e.g. by wrapping their binary in a bash script that invokes their command then just does tail -f /dev/null.

I think that's much better than adding a flag to TFJob operator to prevent resource cleanup.

Importantly, leaving a container running so people can get a shell into it is very different from not deleting the pod to preserve logs.

cheyang · 2018-06-19T06:06:03Z

@jlewi I started to look at a file based fluentd daemonset after coming across this problem last week.

It pipes all the logs to /tmp/data on the host (poor location, but proof of concept). The current output is too noisy due to the core Kubernetes output. It's a mismatch of line + json making it hard to filter using something like jq.

Progress so far:
BenHall/fluentd-kubernetes-daemonset@cb27545

I'm sure with some updates to the Fluent conf (https://github.com/BenHall/fluentd-kubernetes-daemonset/blob/cb275459dcd707fafe88b35893911a33bc0182ba/docker-image/v0.12/alpine-file/conf/fluent.conf) we could break out the different namespaces into different log files on disk.

@jlewi , my question is that how we check and make sure the logs are collected completely before the pods are deleted, as I know the pods are deleted once the tfjob is complete.

jlewi · 2018-06-20T23:33:41Z

@cheyang I think ensure logs are properly collected is a generic/k8s problem; I don't have a good answer.

jlewi · 2018-06-20T23:37:18Z

I like the solution in #685 of defining a CleanupPolicy per job. In particular, I like the idea of having a policy that terminates running pods but not completed pods. That seems like a good stop gap until K8s gives us a way to terminate pods.

An extension of that idea would to allow users to have a handler "/quit" that we could call in order to terminate their pods. Its possible there's a way we could auto inject a side car that would issue sigterm to the process in the TF container.

#685 implemented the cleanup policy for v1alpha1; we'll need to implement that for v1alpha2.

jlewi · 2018-06-20T23:39:29Z

Tagging this 0.2. Even though it won't make 0.2.0 we should consider including it in a 0.2.1 release before 0.3.0.

gaocegege · 2018-07-04T11:25:29Z

I think we could close it since it is implemented by #691

gaocegege added kind/feature priority/p2 difficulty/low labels Apr 13, 2018

jlewi mentioned this issue Apr 18, 2018

Configurable tooling proposal kubeflow/community#54

Closed

gaocegege mentioned this issue Apr 22, 2018

[v1beta2] Add ActiveDeadlineSeconds and BackoffLimit #550

Closed

gaocegege changed the title ~~Adding flag in tf-operator to preserve Logs~~ Adding flag in tf-operator to preserve logs Apr 24, 2018

jlewi mentioned this issue Apr 26, 2018

Add a timeout flag in tf-operator to preserve resources after job completion for a given period #558

Closed

gaocegege changed the title ~~Adding flag in tf-operator to preserve logs~~ [feature] Add log backend to preserve logs Apr 26, 2018

This was referenced Jun 14, 2018

[feasibility study] Investigate strategy to stop PS after job is completed #661

Closed

define cleanup policy #685

Merged

jlewi changed the title ~~[feature] Add log backend to preserve logs~~ [feature] Add Cleanup Policy to TFJob Spec Jun 20, 2018

jlewi added priority/p1 and removed priority/p2 labels Jun 20, 2018

jlewi added the area/0.2.0 label Jun 20, 2018

jlewi mentioned this issue Jun 20, 2018

TFJob pods deleted on completion/failure impairing debugging kubeflow/kubeflow#1039

Closed

gaocegege closed this as completed Jul 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Add Cleanup Policy to TFJob Spec #536

[feature] Add Cleanup Policy to TFJob Spec #536

karthikvadla commented Apr 12, 2018

gaocegege commented Apr 13, 2018

jlewi commented Apr 16, 2018

nqn commented Apr 17, 2018

jlewi commented Apr 18, 2018

nqn commented Apr 19, 2018

u2takey commented Apr 22, 2018 •

edited

Loading

jlewi commented Apr 23, 2018

BenHall commented Apr 23, 2018

nqn commented Apr 23, 2018

jlewi commented Apr 23, 2018

nqn commented Apr 23, 2018

ashahba commented Apr 24, 2018 •

edited

Loading

ashahba commented Apr 24, 2018

jlewi commented Apr 25, 2018

cheyang commented Jun 19, 2018

jlewi commented Jun 20, 2018

jlewi commented Jun 20, 2018

jlewi commented Jun 20, 2018

gaocegege commented Jul 4, 2018

[feature] Add Cleanup Policy to TFJob Spec #536

[feature] Add Cleanup Policy to TFJob Spec #536

Comments

karthikvadla commented Apr 12, 2018

Pain points:

Solution:

gaocegege commented Apr 13, 2018

jlewi commented Apr 16, 2018

nqn commented Apr 17, 2018

jlewi commented Apr 18, 2018

nqn commented Apr 19, 2018

u2takey commented Apr 22, 2018 • edited Loading

jlewi commented Apr 23, 2018

BenHall commented Apr 23, 2018

nqn commented Apr 23, 2018

jlewi commented Apr 23, 2018

nqn commented Apr 23, 2018

ashahba commented Apr 24, 2018 • edited Loading

ashahba commented Apr 24, 2018

jlewi commented Apr 25, 2018

cheyang commented Jun 19, 2018

jlewi commented Jun 20, 2018

jlewi commented Jun 20, 2018

jlewi commented Jun 20, 2018

gaocegege commented Jul 4, 2018

u2takey commented Apr 22, 2018 •

edited

Loading

ashahba commented Apr 24, 2018 •

edited

Loading