You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here's the issue related to changing the behavior in v1alpha1 to delete pods when job finishes. #128
Per that issue, I think this is a real issue for using TFJob. If you launch a job which uses GPUs then that job will continue to consume GPUs even after the job finishes.
I think this is a major barrier to doing actual work.
Bumping this to P0 for this reason and to match priority with which we rated it in v1alpha1.
Doesn't look like the pods are being deleted when the job finishes.
This means the pods will still be running and continue to consume resources.
For v1alpha1 we decided that was undesirable and that pods should be deleted when the job finishes.
I think we should preserve that behavior even though it means access to logs depends on cluster level logging.
If a user wants to leave pods running until job finishes they could do this by keeping the processes alive via their code.
The text was updated successfully, but these errors were encountered: