Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added cleanup policy for deployments #8691

Merged

Conversation

stevekuznetsov
Copy link
Contributor

@stevekuznetsov stevekuznetsov commented Apr 29, 2016

This PR adds past deployment cleanup to the DeploymentController which ensures that we don't have too many replication controllers lying around after deployment configurations evolve.

fixes #6731
/cc https://bugzilla.redhat.com/show_bug.cgi?id=1312433
Part of https://trello.com/c/Mljldox7/643-5-deployments-downstream-support-for-upstream-features

@Kargakis @deads2k PTAL

@stevekuznetsov stevekuznetsov changed the title added cleanup policy for deployments [wip] added cleanup policy for deployments Apr 29, 2016
@deads2k
Copy link
Contributor

deads2k commented Apr 29, 2016

determine why @deads2k convinced me I didn't need an informer for RCs: right now the impl doesn't watch for RC changes and if we're going to watch and index on them we might as well use a IndexerInformer, right?

You need notifications on add, not update or delete.

@deads2k
Copy link
Contributor

deads2k commented Apr 29, 2016

determine correct policy for when caches don't contain the object we are looking for - should we allow the controller to use the client calls for a fetch?

No, the work queue will be retriggered when it observes the next add.

@deads2k
Copy link
Contributor

deads2k commented Apr 29, 2016

determine how many of the controller options (number of workers, max retries, resync period) should bubble up and how far? Into MasterConfig options?

Probably none of those to start. Retry forever on a very slow interval isn't too bad.

@deads2k
Copy link
Contributor

deads2k commented Apr 29, 2016

determine how to best test this - test-cmd with a forced config change?

You can manually kick off new deployments. You should have unit tests for your sync loop and probably an integration test for a series actual cleanups.

@stevekuznetsov
Copy link
Contributor Author

@deads2k: the man with all the answers. coming to a theatre near you ...

// TODO: determine what to do here - should we allow manual fetch using the client calls?
}

deploymentConfig := obj.(*deployapi.DeploymentConfig)
Copy link
Contributor

@ironcladlou ironcladlou Apr 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If deploymentConfig.Spec.RevisionHistoryLimit == nil does everything beyond this point become unnecessary work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I can exit early if the work to fetch the deployments is going to be time-intensive. Currently that nil check occurs at the

if actual, max := len(deployments), deploymentConfig.Spec.RevisionHistoryLimit; max != nil && actual > *max

line

@ironcladlou
Copy link
Contributor

ironcladlou commented Apr 29, 2016

Here are the knobs for oadm prune deployments:

--keep-complete=5: Per DeploymentConfig, specify the number of deployments whose status is complete that will be preserved whose replica size is 0.
--keep-failed=1: Per DeploymentConfig, specify the number of deployments whose status is failed that will be preserved whose replica size is 0.
--keep-younger-than=1h0m0s: Specify the minimum age of a deployment for it to be considered a candidate for pruning.

Is the proposed API here intentionally kept simpler than the prune capabilities? I can imagine how, provided equivalent functionality at the API/controller level, we could use set limit ranges (or something) for these fields in hosted environments, eliminating the need for external jobs that periodically do pruning.

@stevekuznetsov
Copy link
Contributor Author

Is the proposed API here intentionally kept simpler

@Kargakis pointed me to an upstream implementation of this concept, which used the reduced API. I'm not opposed to making it more complicated especially as the extra complexity is simple to add on top of the controller framework.

@stevekuznetsov stevekuznetsov force-pushed the skuznets/deployment-cleanup branch 2 times, most recently from 6b83514 to c7739df Compare April 29, 2016 20:07
@0xmichalis
Copy link
Contributor

I don't think we need a separate controller for this

@liggitt
Copy link
Contributor

liggitt commented Apr 29, 2016

I don't think we need a separate controller for this

I don't think rolling this into the existing controller would make the system easier to reason about

@0xmichalis
Copy link
Contributor

@ironcladlou RevisionHistoryLimit supersedes prune deployments, that's true.

@0xmichalis
Copy link
Contributor

I don't think rolling this into the existing controller would make the system easier to reason about

Why not? Cleanup should be best-effort at best.

@deads2k
Copy link
Contributor

deads2k commented Apr 29, 2016

I don't think we need a separate controller for this

Separating this out from the main controller:

  1. lowers risk for adding this change, any fixes we need for it, and any new features we want for it (see @ironcladlou's list as a for instance)
  2. does not increase cost since we'll be sharing existing informers and caches
  3. keeps separate concerns separate
  4. makes smaller controllers which are easier for people to reason about, especially new contributors.

Since this controller doesn't rely on any knowledge that is contained in the existing controller, I don't see a reason to combine them.

@deads2k
Copy link
Contributor

deads2k commented Apr 29, 2016

Why not? Cleanup should be best-effort at best.

It should be best effort, but it should also be separate. I want it to be impossible for someone writing bad cleanup code to mess up one of our core controllers. I'm not seeing why I would want to link these together. I don't see any shared information.

@0xmichalis
Copy link
Contributor

So we already have 5 controllers for handling deployments. Coupled with the fact that the main controller already has all the information available to cleanup older deployments, I dont think I will ever consider having an additional controller just for this. It adds unnecessary maintainance and performance burden. The upstream controller has handled cleaning up fine so far.

@0xmichalis
Copy link
Contributor

lowers risk for adding this change, any fixes we need for it, and any new features we want for it (see @ironcladlou's list as a for instance)

This is really just a filter.

does not increase cost since we'll be sharing existing informers and caches

I don't have any data available to back this up but I would guess an additional process is more expensive than none. What is the cost of launching a new controller? What impact does it have to the whole system?

keeps separate concerns separate

This is really a concern for the main controller.

makes smaller controllers which are easier for people to reason about, especially new contributors

We are in the middle of integrating with upstream deployments. What happens if a deployment is eligible to be handled in the upstream controller? Both controllers will try to clean it up? What happens if a cluster is running only kubernetes deployments? We are launching a separate process for nothing?

@0xmichalis
Copy link
Contributor

0xmichalis commented Apr 29, 2016

Are there any other examples of controllers handling different parts of the spec for a single resource?

@smarterclayton
Copy link
Contributor

So..... I had a question.

Why wouldn't we just run oc prune deployments in a container, in a bash for loop? And also run oc prune builds. And oc prune images? Implementing this other controller.... seems... like more work than using the thing that already does this.

@deads2k
Copy link
Contributor

deads2k commented Apr 29, 2016

We split service account maintenance tasks across multiple controllers to make each one easier to reason about.

@deads2k
Copy link
Contributor

deads2k commented Apr 29, 2016

Why wouldn't we just run oc prune deployments in a container, in a bash for loop? And also run oc prune builds. And oc prune images? Implementing this other controller.... seems... like more work than using the thing that already does this.

I don't have a vested interest either way, but I'd guess that making use of existing caches to avoid full lists of every RC and DC in the system is preferable. It also keeps it closer to being in sync.

@0xmichalis
Copy link
Contributor

So..... I had a question.

Why wouldn't we just run oc prune deployments in a container, in a bash for loop? And also run oc prune builds. And oc prune images? Implementing this other controller.... seems... like more work than using the thing that already does this.

Eventually we need the field for a faithful conversion with upstream. But a controller for running a loop that deletes replication controllers with zero replicas seems like an overkill.

@smarterclayton
Copy link
Contributor

I don't think that's measurable in the scheme of things, and this is way
more expensive from a dev perspective.

On Apr 29, 2016, at 6:29 PM, David Eads [email protected] wrote:

Why wouldn't we just run oc prune deployments in a container, in a bash for
loop? And also run oc prune builds. And oc prune images? Implementing this
other controller.... seems... like more work than using the thing that
already does this.

I don't have a vested interest either way, but I'd guess that making use of
existing caches to avoid full lists of every RC and DC in the system is
preferable. It also keeps it closer to being in sync.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#8691 (comment)

@0xmichalis
Copy link
Contributor

Cleaning up after making sure the deploymentconfig has been reconciled is what makes sense to me. Having the cleanup running asynchronously in a separate process is a source of bugs waiting to happen.

@smarterclayton
Copy link
Contributor

smarterclayton commented Apr 29, 2016 via email

@stevekuznetsov
Copy link
Contributor Author

Utility method with side-effects ... we should try not to do this.

@stevekuznetsov
Copy link
Contributor Author

will squash on green [test]s

@stevekuznetsov
Copy link
Contributor Author

Test passed, @Kargakis ready for a look

// if cleanup policy is set but no successful deployments have happened, there will be
// no active deployment. We can consider all of the deployments in this case except for
// the latest one
return deployments[:len(deployments)-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to make sure the last deployment is indeed the latest one. Otherwise, you are disregarding a possible candidate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sort above accomplishes this, and it only exists there for the express purpose of this line

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a dc with version=5 and there are only 4 deployments (new is yet to be created). Last deployment is not the latest and is subject to cleanup policy.

@0xmichalis
Copy link
Contributor

Utility method with side-effects ... we should try not to do this.

Good catch

@0xmichalis
Copy link
Contributor

LGTM [merge]

@stevekuznetsov
Copy link
Contributor Author

Good catch

Only took four hours of head scratching and questioning my sanity to find it, too ...

@0xmichalis
Copy link
Contributor

errors in package "github.com/openshift/origin/pkg/deploy/api/v1":
output for "v1/conversion_generated.go" differs; first existing/expected diff: 
  "if in.RevisionHistoryLimit != nil {\n\t\tin, out := &in.RevisionHistoryLimit, &out.RevisionHistoryLimit"
  "out.RevisionHistoryLimit = in.RevisionHistoryLimit\n\tout.Test = in.Test\n\tout.Paused = in.Paused\n\tout."

goroutine 1 [running]:
github.com/openshift/origin/vendor/github.com/golang/glog.stacks(0xa11600, 0x0, 0x0, 0x0)
    /data/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/golang/glog/glog.go:766 +0xb8
github.com/openshift/origin/vendor/github.com/golang/glog.(*loggingT).output(0x9f6880, 0xc800000003, 0xc82ed6a0c0, 0x9dc5bc, 0xd, 0x52, 0x0)
    /data/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/golang/glog/glog.go:717 +0x259
github.com/openshift/origin/vendor/github.com/golang/glog.(*loggingT).printf(0x9f6880, 0xc800000003, 0x80bfd0, 0x9, 0xc8219afcf0, 0x1, 0x1)
    /data/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/golang/glog/glog.go:655 +0x1d4
github.com/openshift/origin/vendor/github.com/golang/glog.Fatalf(0x80bfd0, 0x9, 0xc8219afcf0, 0x1, 0x1)
    /data/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/golang/glog/glog.go:1153 +0x5d
main.main()
    /data/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/tools/genconversion/conversion.go:82 +0x365
make: *** [verify] Error 1

@stevekuznetsov
Copy link
Contributor Author

@Kargakis conversions seem to generate fine locally... hmmm

The new RevisionHistoryLimit field on a DeploymentConfig
allows users to indicate how many old ReplicationControllers
they would like OpenShift to keep around to allow for roll-
backs. This field intentionally is less powerful than those
in `oadm prune` as it drives an administrative cluster task
rather than a user-focused task like `prune`.

Signed-off-by: Steve Kuznetsov <[email protected]>
@stevekuznetsov
Copy link
Contributor Author

Had to get a more up to date rebase in, @Kargakis genned code should be fine now

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 95d647c

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/6168/)

@stevekuznetsov
Copy link
Contributor Author

flakes on #9195 and an e2e app container not starting... @Kargakis I think this is ready

@0xmichalis
Copy link
Contributor

[merge]

@0xmichalis
Copy link
Contributor

#9195 [merge]

@0xmichalis
Copy link
Contributor

#8701 [merge]

@0xmichalis
Copy link
Contributor

#9775 [merge]

@openshift-bot
Copy link
Contributor

openshift-bot commented Jul 14, 2016

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/6268/) (Image: devenv-rhel7_4586)

@0xmichalis
Copy link
Contributor

@stevekuznetsov you are going to hit all flakes there are :)

#8427 [merge]

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 95d647c

@openshift-bot openshift-bot merged commit 2579f79 into openshift:master Jul 14, 2016
@0xmichalis
Copy link
Contributor

🎉

@stevekuznetsov
Copy link
Contributor Author

🎊 🎆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add clean-up policy for deployments
8 participants