Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

live-EKS automatically remove completed Kubernetes Jobs created by a CronJob #3055

Closed
vijay-veeranki opened this issue Jul 16, 2021 · 3 comments
Assignees

Comments

@vijay-veeranki
Copy link
Contributor

When users create cronjobs, we do clean-up by completed Jobs that cleans up the Pods they create, which helps the Kubernetes cluster to use its CPU and memory resources efficiently.

This ticket is to work on the below 3 issues related to clean up in eks-live.

  1. We suggest users use ttlSecondsAfterFinished, but that is only supported from K8's v1.20.
    https://user-guide.cloud-platform.service.justice.gov.uk/documentation/other-topics/Cronjobs.html#deploying-a-cronjob-to-your-namespace.

Issue related to it:
aws/containers-roadmap#255

Investigate if there are any workarounds or should we wait until v1.20 and communicate to users before migration to EKS-Live.

  1. We have a delete-completed-jobs concourse job which will clean up all completed jobs which do not have ttlSecondsAfterFinished defined.

Set up this job in eks-live cluster.

  1. User can set up ".spec.successfulJobsHistoryLimit" and ".spec.failedJobsHistoryLimit fields". These fields specify how many completed and failed jobs should be kept. But it is not working as expected.

Try to set up a corn job with these fields and figure out why they are not working on EKS as it is working on live-1

@poornima-krishnasamy
Copy link
Contributor

poornima-krishnasamy commented Aug 4, 2021

Because ttlSecondsAfterFinished will not be considered there is a much need for deleting completed jobs as a garbage collection.

if users use cronjob they can setup failedJobsHistoryLimit and successfulJobsHistoryLimit which we need to respect and not delete those using the delete-completed-jobs pipeline. Currently all jobs are deleted irrespective of whether it is owned by cronjob or not.
Hence we need a more robust approach something like https://github.com/lwolf/kube-cleanup-operator which will ignore cronjobs but do the cleanup of jobs after completion.

By default the fields have these values
.spec.successfulJobsHistoryLimit : 3
spec.failedJobsHistoryLimit: 1

For the ".spec.successfulJobsHistoryLimit" and ".spec.failedJobsHistoryLimit fields", it works based on the restartPolicy and BackoffLimit. The job is set to be failed based on these other parameters. When testing these fields in EKS it works as expected and seen the similar behaviour as in the kops test cluster.

@poornima-krishnasamy
Copy link
Contributor

After discussing with the team we have decided

  1. Not to run delete completed jobs pipeline until the KubeTooManyPods are triggered in "live" or when "ttlSecondsAfterFinished" is enabled in EKS 1.20

  2. Let the completed/failed jobs get deleted based on users setup failedJobsHistoryLimit and successfulJobsHistoryLimit

  3. Update the migration guide to include the setup failedJobsHistoryLimit and successfulJobsHistoryLimit and the jobs will get deleted based on default 3 and 1 limit

@poornima-krishnasamy
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants