Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Split controllers into thin workqueue controllers and "receivers". #1110

Closed
wants to merge 1 commit into from

Conversation

mattmoor
Copy link
Member

@mattmoor mattmoor commented Jun 8, 2018

This is an experimental refactor to illustrate a proposal. I don't plan to make this a monolithic commit, but wanted to illustrate what things might look like in this new world.

The background of this change is that as ./pkg/controller/... has grown, it's become more difficult to follow. In addition, the way in which we watch assorted objects is inconsistent which may mean there are varying degrees of subtle bugs waiting to bite us (related: #823). This change isn't intended to solve all these problems, but to hopefully be a step in the right direction.

The general idea is illustrated in this diagram:
proposal_ refactoring our controllers

Legend:

  • Bold lines are simply to indicate the primary resources a given receiver watches.
  • Dotted lines reflect "watches" that we should probably have, but do not today.
  • solid lines indicate the secondary / sub-resources a given receiver watches.

pkg/controller/...

In this proposal, these directories would stop holding business logic and become thin workqueues for handling events we are sensitive to. So how does it work?

We would define a fairly boilerplate (possibly codegen-able) pkg/controller/foo for each type foo we are watching. Today this contains the same boilerplate skeleton we've been using. It also defines a trivial Receiver interface to deliver these events to suitable Receivers, e.g.

package foo

type Receiver {
   SyncFoo(f *v37.Foo) error
}

At construction, each controller iterates over a list of potential receivers to determine viability through:

for _, rcvr := range possibleReceivers {
  if fr, ok := rcvr.(*Receiver); ok {
    controller.receivers = append(controller.receivers, fr)
  }
}

if len(controller.receivers) == 0 {
  log.Fatalf("This controller matched no receivers.  Remove it if no longer needed.")
}

As events are received, the controller will delegate reconciliation to the matching receivers:

// This is an example, we could also fan out goroutines with errgroup, or something else.
for _, fr := range c.receivers {
  if err := fr.SyncFoo(foo); err != nil {
    ... 
  }
}

pkg/receiver/...

We will have pkg/receiver/bar 1:1 with today's notion of controllers, so {revision, route, service, configuration}. Each bar implements some set of foo.Receiver interfaces:

  • foo == bar, the bar receiver generally always implements bar.Receiver (e.g. pkg/receiver/revision implements revision.Receiver).
  • foo != bar, the bar receiver may also implement foo.Receiver for other resources it wants to observe (e.g. pkg/receiver/revision implements deployment.Receiver).

Within pkg/receiver/bar/... in this PR, the files are organized as follows:

  • receiver.go: defines the receiver type, interface assertions, and New() constructor.
  • foo.go: the portion of the receiver implementation that implements foo.Receiver, by convention SyncFoo is at the top of the file.
  • helper.go: holds misc shared utilities within the package.
  • I also simply moved the existing sibling files we had in the controllers.

cc @evankanderson @vaikas-google @grantr

@google-prow-robot google-prow-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 8, 2018
@google-prow-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mattmoor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-prow-robot google-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 8, 2018
@google-prow-robot
Copy link

@mattmoor: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-knative-serving-go-coverage b0cf75f link /test pull-knative-serving-go-coverage
pull-knative-serving-unit-tests b0cf75f link /test pull-knative-serving-unit-tests
pull-knative-serving-build-tests b0cf75f link /test pull-knative-serving-build-tests
pull-knative-serving-integration-tests b0cf75f link /test pull-knative-serving-integration-tests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

return err
}

for _, dr := range c.receivers {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the effects of this loop are:

  • All receivers must succeed to mark this item successful in the workqueue.
  • If a single receiver returns an error, all following receivers are skipped and the item is retried later (using the default workqueue retry rules).
  • Receivers that return success before a later receiver returns an error will be tried again with the same item (using the default workqueue retry rules).
  • Only one receiver may process an item at a time.

I'm concerned about the conflation of success/failure of all receivers into the failure state of a single queue item. Issues that come to mind include:

  • If a receiver consistently fails to process an event successfully, receivers later in the list never see that event.
  • Later receivers must wait until all prior receivers have finished before processing an event, leading to higher reconcile latencies for those receivers.

Randomizing or reversing the order of the list would mitigate these issues somewhat, but I suspect they'd still cause hard-to-debug problems occasionally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I specifically included this ~disclaimer in the pseudo-code of the body above for this reason:

// This is an example, we could also fan out goroutines with errgroup, or something else.
for _, fr := range c.receivers {

It would certainly mean a coarser grain of retries to go this route, but I'm unsure if that's a thing we should optimize for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, errgroup would fix the starvation and latency issues. The retries thing feels vaguely incorrect but I agree it doesn't seem like a blocker right now.

It's possible that in larger clusters the additional reconciles would cause unacceptable apiserver load, but at that point there's nothing stopping the internals from changing to one workqueue per receiver. The Receiver interface remains the same.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any controller across the kube ecosystem that has succeeded in having shared workqueues, since it becomes really difficult to reason about the state of the caches relative to the action. Generally the most successful patterns are where you try to get all of your work into a single action (level driven) and queue up synchronizations of that. I've seen that scale to the largest kube clusters in the world, so you're unlikely to go too far.

Copy link
Contributor

@vaikas vaikas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick question about the model before I dig deeper.


for _, dr := range c.receivers {
// Don't modify the informer's copy, and give each receiver a fresh copy.
cp := original.DeepCopy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is this really what want here? Seems like if a receiver (say the first one) modifies the object, then by copying the original, we're now sending an out of date object to the next receiver (say the second one) and if they try to update it, seems like the update will be rejected by k8s because the object is out of date, or the receiver will have to explicitly fetch a newer version of the object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more important in the case of parallelism (see the other thread about using goroutines here).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like it would be more correct to give each receiver the resource key and a lookup/update function. The controller (that name seems incorrect given this new pattern -- maybe "observer"?) could then perform caching, fetches, and possibly even merging as needed.

Unfortunately, if all of our controllers are actually using GetControllerOf, they'll probably all need to fetch the object in order to figure out if they should be acting on it.

Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather see some sort of explicit registration of type -> observed object, rather than the implicit format we have here (mostly as I noticed that e.g. route implements a Receiver for 3-4 types, but only one of those is type-asserted).

revision.NewController(kubeClient, elaClient, kubeInformerFactory, elaInformerFactory, buildInformerFactory, cfg, &revControllerConfig, logger),
route.NewController(kubeClient, elaClient, kubeInformerFactory, elaInformerFactory, cfg, *controllerConfig, autoscaleEnableScaleToZero, logger),
service.NewController(kubeClient, elaClient, kubeInformerFactory, elaInformerFactory, cfg, *controllerConfig, logger),
// The receivers are what implement our core logic. Each of these subscribe to some subset of the resources for which we
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Receiver" feels like a funny name for core logic.

What about "Implementation", "Director", "Manager", or "Producer"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally up for a naming 🚲🏠 once we settle on if/what we're naming :)

receivers := []interface{}{
revrcv.New(kubeClient, elaClient, kubeInformerFactory, elaInformerFactory, cfg, &revControllerConfig, logger),
rtrcv.New(kubeClient, elaClient, kubeInformerFactory, elaInformerFactory, cfg, *controllerConfig, autoscaleEnableScaleToZero, logger),
cfgrcv.New(kubeClient, elaClient, buildClient, kubeInformerFactory, elaInformerFactory, cfg, *controllerConfig, logger),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit/note: it feels to me like a common "Options" struct would be useful here for configuring these receivers. In particular, it feels like there are small differences in the arguments (e.g. the configuration takes a buildClient as the 3rd argument, but the others take additional arguments around the 6th position, with logger always last) which makes this harder to use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, totally agree. We see the same thing with the controller constructors today (initially we were trying to use a common ctor and a loop to cut down on this boilerplate, but ultimately unrolled and diverged).

Following on @mdemirhan's initial refactoring, we should perhaps define controller.Options with the common arguments taken by controller.Base and a majority of these argument lists would shrink dramatically. I can open an issue to track this discussion.

if err != nil {
logger.Fatalf("Error creating Ingress controller: %v", err)
}
controllers := []controller.Interface{rtc, svc, config, rc, dc, ec, bc, ic}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there may be two different goals here:

  1. Reduce the number of informers/watches managed by the controller.
  2. Normalize the process of having changes delivered to the business logic.

I'm presuming the priority is on #2, and not on #1?

if dr, ok := rcv.(Receiver); ok {
controller.receivers = append(controller.receivers, dr)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest doing this block above all the informer setup; that way there's no need to do any cleanup in the error case. (In the possible future where we wanted to be able to flag enable certain features/directors.)


// syncHandler compares the actual state with the desired, and attempts to
// converge the two. It then updates the Status block of the Build
// resource with the current status of the resource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still accurate?

"If the Deployment controller is no longer needed it should be removed.")
}
return controller, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like there's a lot of duplicated code for these controllers/observers. I'm guessing that's so we can more easily see what's going on, and that the final code will be factored so that each observer is ~2 dozen lines?

Receiver interface
Type/lister functions
Registration of constructor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evankanderson Yes, in fact each is a copy/paste and search/replace of a stripped down implementation. I think the goal of these should be: so boilerplate we stop reviewing them, which to me says: "codegen" (thus the caption on the diagram). If we can reach that point through other means, even better :)

return err
}

revClient := c.ElaClientSet.ServingV1alpha1().Revisions(config.Namespace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it should go through the Revision observer, rather than doing direct gets against the clientset. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, I worry about going full ORM here -- maybe standardizing the update interface more formally is sufficient?

)

// SyncRevision implements revsion.Receiver
func (c *Receiver) SyncRevision(revision *v1alpha1.Revision) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight packaging note -- I'm not sure that I like splitting the controller/manager logic across multiple files in this way. I feel like it becomes harder to actually grok what the controller is doing.

I do like defining the "Receiver" interfaces to create some common interfaces for managing updates for different types. Passing in some sort of RevisionObservation rather than the concrete Revision object would allow us to provide additional information later if needed, such as the previous state of the object from the EventHandler's UpdateFunc, or timing information.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This organization was to try and improve the clarity of the entrypoints to reconciliation. We can avoid this in follow up.

// TODO(mattmoor): Move controller deps into receiver
"github.com/knative/serving/pkg/controller"
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"helper.go" seems like a kitchen sink -- I'd rather see these methods distributed to be close to their users, or next to the definition of the Receiver struct if most appropriate.

}

// Receiver implements ingress.Receiver
var _ ingress.Receiver = (*Receiver)(nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it also implements configuration.Receiver?

Though I think these type assertions are optional anyway, right?

@mattmoor
Copy link
Member Author

@evankanderson FYI, this was meant to be illustrative in its completeness for the purposes of high-level RFC. It is not my intention to check this in as a monolith, but to take the parts of it that folks find valuable and increment towards this sort of model (probably the same way I built this up with lots of functional checkpoints). Otherwise I'll live in merge hell until I put everyone else into merge hell, and nobody wants that :)

Totally ack your nits, and I'll try to keep these in mind if/when we pursue parts of this in follow up. Apologies for not making my intentions with this change clearer, and thanks for all of the feedback.

@mattmoor
Copy link
Member Author

@mdemirhan suggested that we may want to add a context.Context as the first argument to the SyncFoo methods.

@mattmoor mattmoor closed this Jun 13, 2018
skonto pushed a commit to skonto/serving that referenced this pull request May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants