added cleanup policy for deployments #8691

stevekuznetsov · 2016-04-29T19:19:39Z

This PR adds past deployment cleanup to the DeploymentController which ensures that we don't have too many replication controllers lying around after deployment configurations evolve.

fixes #6731
/cc https://bugzilla.redhat.com/show_bug.cgi?id=1312433
Part of https://trello.com/c/Mljldox7/643-5-deployments-downstream-support-for-upstream-features

@Kargakis @deads2k PTAL

deads2k · 2016-04-29T19:41:06Z

determine why @deads2k convinced me I didn't need an informer for RCs: right now the impl doesn't watch for RC changes and if we're going to watch and index on them we might as well use a IndexerInformer, right?

You need notifications on add, not update or delete.

deads2k · 2016-04-29T19:41:31Z

determine correct policy for when caches don't contain the object we are looking for - should we allow the controller to use the client calls for a fetch?

No, the work queue will be retriggered when it observes the next add.

deads2k · 2016-04-29T19:42:32Z

determine how many of the controller options (number of workers, max retries, resync period) should bubble up and how far? Into MasterConfig options?

Probably none of those to start. Retry forever on a very slow interval isn't too bad.

deads2k · 2016-04-29T19:43:49Z

determine how to best test this - test-cmd with a forced config change?

You can manually kick off new deployments. You should have unit tests for your sync loop and probably an integration test for a series actual cleanups.

stevekuznetsov · 2016-04-29T19:49:18Z

@deads2k: the man with all the answers. coming to a theatre near you ...

ironcladlou · 2016-04-29T19:50:04Z

pkg/deploy/controller/deploymentcleanup/controller.go

+		// TODO: determine what to do here - should we allow manual fetch using the client calls?
+	}
+
+	deploymentConfig := obj.(*deployapi.DeploymentConfig)


If deploymentConfig.Spec.RevisionHistoryLimit == nil does everything beyond this point become unnecessary work?

Yes. I can exit early if the work to fetch the deployments is going to be time-intensive. Currently that nil check occurs at the

if actual, max := len(deployments), deploymentConfig.Spec.RevisionHistoryLimit; max != nil && actual > *max

line

ironcladlou · 2016-04-29T20:01:52Z

Here are the knobs for oadm prune deployments:

--keep-complete=5: Per DeploymentConfig, specify the number of deployments whose status is complete that will be preserved whose replica size is 0.
--keep-failed=1: Per DeploymentConfig, specify the number of deployments whose status is failed that will be preserved whose replica size is 0.
--keep-younger-than=1h0m0s: Specify the minimum age of a deployment for it to be considered a candidate for pruning.

Is the proposed API here intentionally kept simpler than the prune capabilities? I can imagine how, provided equivalent functionality at the API/controller level, we could use set limit ranges (or something) for these fields in hosted environments, eliminating the need for external jobs that periodically do pruning.

stevekuznetsov · 2016-04-29T20:04:20Z

Is the proposed API here intentionally kept simpler

@Kargakis pointed me to an upstream implementation of this concept, which used the reduced API. I'm not opposed to making it more complicated especially as the extra complexity is simple to add on top of the controller framework.

0xmichalis · 2016-04-29T20:09:37Z

I don't think we need a separate controller for this

liggitt · 2016-04-29T20:11:46Z

I don't think we need a separate controller for this

I don't think rolling this into the existing controller would make the system easier to reason about

0xmichalis · 2016-04-29T20:12:22Z

@ironcladlou RevisionHistoryLimit supersedes prune deployments, that's true.

0xmichalis · 2016-04-29T20:13:39Z

I don't think rolling this into the existing controller would make the system easier to reason about

Why not? Cleanup should be best-effort at best.

deads2k · 2016-04-29T20:15:40Z

I don't think we need a separate controller for this

Separating this out from the main controller:

lowers risk for adding this change, any fixes we need for it, and any new features we want for it (see @ironcladlou's list as a for instance)
does not increase cost since we'll be sharing existing informers and caches
keeps separate concerns separate
makes smaller controllers which are easier for people to reason about, especially new contributors.

Since this controller doesn't rely on any knowledge that is contained in the existing controller, I don't see a reason to combine them.

deads2k · 2016-04-29T20:17:09Z

Why not? Cleanup should be best-effort at best.

It should be best effort, but it should also be separate. I want it to be impossible for someone writing bad cleanup code to mess up one of our core controllers. I'm not seeing why I would want to link these together. I don't see any shared information.

0xmichalis · 2016-04-29T20:23:48Z

So we already have 5 controllers for handling deployments. Coupled with the fact that the main controller already has all the information available to cleanup older deployments, I dont think I will ever consider having an additional controller just for this. It adds unnecessary maintainance and performance burden. The upstream controller has handled cleaning up fine so far.

0xmichalis · 2016-04-29T20:35:36Z

lowers risk for adding this change, any fixes we need for it, and any new features we want for it (see @ironcladlou's list as a for instance)

This is really just a filter.

does not increase cost since we'll be sharing existing informers and caches

I don't have any data available to back this up but I would guess an additional process is more expensive than none. What is the cost of launching a new controller? What impact does it have to the whole system?

keeps separate concerns separate

This is really a concern for the main controller.

makes smaller controllers which are easier for people to reason about, especially new contributors

We are in the middle of integrating with upstream deployments. What happens if a deployment is eligible to be handled in the upstream controller? Both controllers will try to clean it up? What happens if a cluster is running only kubernetes deployments? We are launching a separate process for nothing?

0xmichalis · 2016-04-29T20:40:48Z

Are there any other examples of controllers handling different parts of the spec for a single resource?

smarterclayton · 2016-04-29T21:11:08Z

So..... I had a question.

Why wouldn't we just run oc prune deployments in a container, in a bash for loop? And also run oc prune builds. And oc prune images? Implementing this other controller.... seems... like more work than using the thing that already does this.

deads2k · 2016-04-29T22:26:32Z

We split service account maintenance tasks across multiple controllers to make each one easier to reason about.

deads2k · 2016-04-29T22:29:31Z

Why wouldn't we just run oc prune deployments in a container, in a bash for loop? And also run oc prune builds. And oc prune images? Implementing this other controller.... seems... like more work than using the thing that already does this.

I don't have a vested interest either way, but I'd guess that making use of existing caches to avoid full lists of every RC and DC in the system is preferable. It also keeps it closer to being in sync.

0xmichalis · 2016-04-29T23:09:51Z

So..... I had a question.

Why wouldn't we just run oc prune deployments in a container, in a bash for loop? And also run oc prune builds. And oc prune images? Implementing this other controller.... seems... like more work than using the thing that already does this.

Eventually we need the field for a faithful conversion with upstream. But a controller for running a loop that deletes replication controllers with zero replicas seems like an overkill.

smarterclayton · 2016-04-29T23:14:59Z

I don't think that's measurable in the scheme of things, and this is way
more expensive from a dev perspective.

On Apr 29, 2016, at 6:29 PM, David Eads [email protected] wrote:

Why wouldn't we just run oc prune deployments in a container, in a bash for
loop? And also run oc prune builds. And oc prune images? Implementing this
other controller.... seems... like more work than using the thing that
already does this.

I don't have a vested interest either way, but I'd guess that making use of
existing caches to avoid full lists of every RC and DC in the system is
preferable. It also keeps it closer to being in sync.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#8691 (comment)

0xmichalis · 2016-04-29T23:25:02Z

Cleaning up after making sure the deploymentconfig has been reconciled is what makes sense to me. Having the cleanup running asynchronously in a separate process is a source of bugs waiting to happen.

smarterclayton · 2016-04-29T23:30:11Z

We already have a process, pattern, and documentation for this. Prune should respect the policy, but I see little reason to reinvent this wheel until we are much further along. If you solve pruning images/builds/deployments on a schedule, you solve an actual problem our customers have now.

stevekuznetsov · 2016-07-11T18:30:01Z

Utility method with side-effects ... we should try not to do this.

stevekuznetsov · 2016-07-11T18:48:40Z

will squash on green [test]s

stevekuznetsov · 2016-07-11T20:11:46Z

Test passed, @Kargakis ready for a look

0xmichalis · 2016-07-11T22:08:51Z

pkg/deploy/util/util.go

+		// if cleanup policy is set but no successful deployments have happened, there will be
+		// no active deployment. We can consider all of the deployments in this case except for
+		// the latest one
+		return deployments[:len(deployments)-1]


You need to make sure the last deployment is indeed the latest one. Otherwise, you are disregarding a possible candidate.

The sort above accomplishes this, and it only exists there for the express purpose of this line

You have a dc with version=5 and there are only 4 deployments (new is yet to be created). Last deployment is not the latest and is subject to cleanup policy.

0xmichalis · 2016-07-12T14:32:43Z

Utility method with side-effects ... we should try not to do this.

Good catch

0xmichalis · 2016-07-12T14:37:10Z

LGTM [merge]

stevekuznetsov · 2016-07-12T14:38:44Z

Good catch

Only took four hours of head scratching and questioning my sanity to find it, too ...

0xmichalis · 2016-07-12T15:54:19Z

errors in package "github.com/openshift/origin/pkg/deploy/api/v1":
output for "v1/conversion_generated.go" differs; first existing/expected diff: 
  "if in.RevisionHistoryLimit != nil {\n\t\tin, out := &in.RevisionHistoryLimit, &out.RevisionHistoryLimit"
  "out.RevisionHistoryLimit = in.RevisionHistoryLimit\n\tout.Test = in.Test\n\tout.Paused = in.Paused\n\tout."

goroutine 1 [running]:
github.com/openshift/origin/vendor/github.com/golang/glog.stacks(0xa11600, 0x0, 0x0, 0x0)
    /data/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/golang/glog/glog.go:766 +0xb8
github.com/openshift/origin/vendor/github.com/golang/glog.(*loggingT).output(0x9f6880, 0xc800000003, 0xc82ed6a0c0, 0x9dc5bc, 0xd, 0x52, 0x0)
    /data/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/golang/glog/glog.go:717 +0x259
github.com/openshift/origin/vendor/github.com/golang/glog.(*loggingT).printf(0x9f6880, 0xc800000003, 0x80bfd0, 0x9, 0xc8219afcf0, 0x1, 0x1)
    /data/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/golang/glog/glog.go:655 +0x1d4
github.com/openshift/origin/vendor/github.com/golang/glog.Fatalf(0x80bfd0, 0x9, 0xc8219afcf0, 0x1, 0x1)
    /data/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/golang/glog/glog.go:1153 +0x5d
main.main()
    /data/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/tools/genconversion/conversion.go:82 +0x365
make: *** [verify] Error 1

stevekuznetsov · 2016-07-12T17:22:02Z

@Kargakis conversions seem to generate fine locally... hmmm

The new RevisionHistoryLimit field on a DeploymentConfig allows users to indicate how many old ReplicationControllers they would like OpenShift to keep around to allow for roll- backs. This field intentionally is less powerful than those in `oadm prune` as it drives an administrative cluster task rather than a user-focused task like `prune`. Signed-off-by: Steve Kuznetsov <[email protected]>

stevekuznetsov · 2016-07-12T17:33:39Z

Had to get a more up to date rebase in, @Kargakis genned code should be fine now

openshift-bot · 2016-07-12T17:36:05Z

Evaluated for origin test up to 95d647c

openshift-bot · 2016-07-12T23:15:42Z

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/6168/)

stevekuznetsov · 2016-07-13T13:24:01Z

flakes on #9195 and an e2e app container not starting... @Kargakis I think this is ready

0xmichalis · 2016-07-13T13:38:25Z

[merge]

0xmichalis · 2016-07-13T17:20:57Z

#9195 [merge]

0xmichalis · 2016-07-14T08:05:23Z

#8701 [merge]

0xmichalis · 2016-07-14T09:49:26Z

#9775 [merge]

openshift-bot · 2016-07-14T11:10:29Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/6268/) (Image: devenv-rhel7_4586)

0xmichalis · 2016-07-14T11:24:05Z

@stevekuznetsov you are going to hit all flakes there are :)

#8427 [merge]

openshift-bot · 2016-07-14T11:25:28Z

Evaluated for origin merge up to 95d647c

0xmichalis · 2016-07-14T12:41:00Z

🎉

stevekuznetsov · 2016-07-14T13:04:32Z

🎊 🎆

stevekuznetsov changed the title ~~added cleanup policy for deployments~~ [wip] added cleanup policy for deployments Apr 29, 2016

stevekuznetsov added component/apps needs-api-review labels Apr 29, 2016

ironcladlou reviewed Apr 29, 2016
View reviewed changes

stevekuznetsov force-pushed the skuznets/deployment-cleanup branch 2 times, most recently from 6b83514 to c7739df Compare April 29, 2016 20:07

stevekuznetsov force-pushed the skuznets/deployment-cleanup branch from c7739df to 5d912d8 Compare April 29, 2016 20:12

stevekuznetsov force-pushed the skuznets/deployment-cleanup branch from bc4b617 to 4f3b46d Compare July 11, 2016 15:51

stevekuznetsov force-pushed the skuznets/deployment-cleanup branch from 9dd67c8 to e2158b0 Compare July 11, 2016 20:11

0xmichalis reviewed Jul 11, 2016
View reviewed changes

stevekuznetsov force-pushed the skuznets/deployment-cleanup branch from e2158b0 to e147ea8 Compare July 12, 2016 14:30

stevekuznetsov force-pushed the skuznets/deployment-cleanup branch from e147ea8 to 95d647c Compare July 12, 2016 17:33

openshift-bot merged commit 2579f79 into openshift:master Jul 14, 2016

added cleanup policy for deployments #8691

added cleanup policy for deployments #8691

Conversation

stevekuznetsov commented Apr 29, 2016 • edited by 0xmichalis Loading

deads2k commented Apr 29, 2016

deads2k commented Apr 29, 2016

deads2k commented Apr 29, 2016

deads2k commented Apr 29, 2016

stevekuznetsov commented Apr 29, 2016

ironcladlou Apr 29, 2016 • edited Loading

Choose a reason for hiding this comment

stevekuznetsov Apr 29, 2016

Choose a reason for hiding this comment

ironcladlou commented Apr 29, 2016 • edited Loading

stevekuznetsov commented Apr 29, 2016

0xmichalis commented Apr 29, 2016

liggitt commented Apr 29, 2016

0xmichalis commented Apr 29, 2016

0xmichalis commented Apr 29, 2016

deads2k commented Apr 29, 2016

deads2k commented Apr 29, 2016

0xmichalis commented Apr 29, 2016

0xmichalis commented Apr 29, 2016

0xmichalis commented Apr 29, 2016 • edited Loading

smarterclayton commented Apr 29, 2016

deads2k commented Apr 29, 2016

deads2k commented Apr 29, 2016

0xmichalis commented Apr 29, 2016

smarterclayton commented Apr 29, 2016

0xmichalis commented Apr 29, 2016

smarterclayton commented Apr 29, 2016 via email

stevekuznetsov commented Jul 11, 2016

stevekuznetsov commented Jul 11, 2016

stevekuznetsov commented Jul 11, 2016

0xmichalis Jul 11, 2016

Choose a reason for hiding this comment

stevekuznetsov Jul 11, 2016

Choose a reason for hiding this comment

0xmichalis Jul 12, 2016

Choose a reason for hiding this comment

0xmichalis commented Jul 12, 2016

0xmichalis commented Jul 12, 2016

stevekuznetsov commented Jul 12, 2016

0xmichalis commented Jul 12, 2016

stevekuznetsov commented Jul 12, 2016

stevekuznetsov commented Jul 12, 2016

openshift-bot commented Jul 12, 2016

openshift-bot commented Jul 12, 2016

stevekuznetsov commented Jul 13, 2016

0xmichalis commented Jul 13, 2016

0xmichalis commented Jul 13, 2016

0xmichalis commented Jul 14, 2016

0xmichalis commented Jul 14, 2016

openshift-bot commented Jul 14, 2016 • edited Loading

0xmichalis commented Jul 14, 2016

openshift-bot commented Jul 14, 2016

0xmichalis commented Jul 14, 2016

stevekuznetsov commented Jul 14, 2016

stevekuznetsov commented Apr 29, 2016 •

edited by 0xmichalis

Loading

ironcladlou Apr 29, 2016 •

edited

Loading

ironcladlou commented Apr 29, 2016 •

edited

Loading

0xmichalis commented Apr 29, 2016 •

edited

Loading

openshift-bot commented Jul 14, 2016 •

edited

Loading