Skip to content
This repository has been archived by the owner on May 6, 2022. It is now read-only.

Proposal for Broker relist API #1086

Closed
arschles opened this issue Aug 1, 2017 · 47 comments
Closed

Proposal for Broker relist API #1086

arschles opened this issue Aug 1, 2017 · 47 comments

Comments

@arschles
Copy link
Contributor

arschles commented Aug 1, 2017

Proposal: Broker Sync API

This document is a proposal for adding the API surface area and implementation to service-catalog for adding the to sync an individual broker server's resources.

It is a formalization of issue #705, which introduces most of these ideas.

Problem

There exists a need in the service-catalog system to keep the Kubernetes resources (i.e. ServiceClass) in sync with the broker servers that the system knows about (via Broker resources). This proposal calls this action "syncing".

As of this writing (8/1/2017), the service-catalog controller manager polls all broker servers on the same hard-coded time duration to keep resources up to date.

We need to add the ability for a user to sync a specific, individual broker via two methods:

  • On a specific frequency, defined by a time duration
  • Manually

High Level Solution

We propose the following additions to Broker resources:

  • spec.syncBehavior - either Duration or Manual
    • In the case that Duration is specified, the spec.syncDuration field may be set (see below)
    • In the case that Manual is specified, the broker's sync subresource (i.e /brokers/sync, a sibling to the current /brokers/status) will be used to trigger any future sync
    • If this field is omitted, Duration is the default
  • spec.syncDuration - the frequency, expressed in 1m5s format, by which the controller re-syncs with the broker server. If this field is omitted, the default is 15m. This field is only valid if spec.syncBehavior is set to Duration
  • spec.relistRequests - a strictly-increasing integer counter that can be manually incremented by a user to manually trigger a re-sync
  • A new sync subresource (i.e. similar to /brokers/status) - When this subresource is requested with a PUT, the referenced broker's spec.relistRequests field will be incremented, causing a re-sync (see the previous item)

Additional Details

Duration Sync

If a Broker resource has a spec.syncBehavior set to Duration, an operator may still trigger a manual re-sync. Regardless of how a re-sync was executed, the controller will always measure Duration based on the time since the last successful re-sync.

Sync Frequency Flag

The controller currently has a command line flag to specify the global sync frequency (as a time duration). This flag should be deprecated and ignored by the controller now, and removed before our beta release. We may optionally add a new flag to specify the default value to use when spec.syncDuration is missing but required.

Porcelain Command (kubectl plugin)

Additionally, since this proposal adds a /brokers/sync subresource, we will need to add a new "porcelain" command, implemented as a kubectl plugin (plugin work is ongoing in PR #840), that exposes the PUT call on thus subresource.

Failure Handling

Currently, when the controller encounters a failure, it begins trying to re-sync on an exponential backoff. This behavior should continue regardless of whether it is in Duration or Manual sync mode.

Error Reporting

Currently, when the controller does an initial or re-sync operation, and it encounters an error, it writes an error message to the Broker resource's conditions field. It should continue to do so regardless of manual or automatic resync

Implementation Notes

This section contains details on how we would go about implementing this specific proposal. It can be amended to accommodate changes to the proposal.

Validations

The following validations should be added to the service-catalog API server:

  • spec.syncBehavior must be either Duration or Manual
  • spec.syncDuration must not be set if spec.syncBehavior is set to Manual
  • spec.syncDuration must be a valid value if it is set. Note that the Go time.ParseDuration function is suitable for use in checking and parsing this value
  • spec.relistRequests must be greater than the value it was prior

Testing

  • Tests, including the fuzzer tests, should ensure that spec.syncBehavior defaults to Duration and spec.syncDuration defaults to 15m.

Work Items

We believe that the tasks represented to implement this proposal cleanly separate into the following pull requests (PRs) in order:

  1. A PR that deprecates and ignores the controller command line flag, while defaulting all brokers to spec.syncBehavior = Duration and spec.syncDuration = 15m
  2. A PR that adds the new API fields, validations and controller implementation that checks the new API fields
    • Note that the /brokers/sync subresource won't be implemented at this point, but spec.syncBehavior = Manual will still be possible because the controller still will pay attention to spec.relistRequests and users can still manually increase spec.relistRequests
  3. A PR that adds the new /brokers/sync subresource. This makes spec.syncBehavior = Manual possible, but not convenient because the "porcelain CLI command" ('kubectl` plugin) won't yet be available
  4. A PR that adds the "porcelain CLI command". This makes spec.syncBehavior = Manual easy to use

Requesting Reviews From

@pmorie
Copy link
Contributor

pmorie commented Aug 1, 2017

cc @eriknelson

@pmorie
Copy link
Contributor

pmorie commented Aug 1, 2017

We may optionally add a new flag to specify the default value to use when spec.syncDuration is missing but required.

I suggest that this not be in scope for the initial implementation of this proposal

@duglin
Copy link
Contributor

duglin commented Aug 2, 2017

just so I fully grok it, what is the type of /brokers/sync ?

@jmrodri
Copy link
Contributor

jmrodri commented Aug 2, 2017

spec.syncBehavior - either Duration or Manual

Just to be clear while the syncBehavior is set to Duration, I should still be able to use the /broker/sync endpoint to force a manual sync now (or as soon as possible). Which could mean that a sync triggered by a duration is either missed or skipped (depending on how long syncs take).

But I definitely need to be able to manually trigger AND have it sync on a regular interval.

@jmrodri
Copy link
Contributor

jmrodri commented Aug 2, 2017

adding the to sync an individual broker server's resources.

TYPO: adding sync to an individual...

@jmrodri
Copy link
Contributor

jmrodri commented Aug 2, 2017

exposes the PUT call on thus subresource.

TYPO: thus -> this

@jmrodri
Copy link
Contributor

jmrodri commented Aug 2, 2017

Additionally, since this proposal adds a /brokers/sync subresource, we will need to add a new "porcelain" command, implemented as a kubectl plugin (plugin work is ongoing in PR #840), that exposes the PUT call on thus subresource.

Does the PUT take any arguments or expect any input? Or is it enough to be a PUT /brokers/sync?

@eriknelson
Copy link
Contributor

eriknelson commented Aug 2, 2017

But I definitely need to be able to manually trigger AND have it sync on a regular interval

This would be my question: if syncBehavior is set to Duration, it's still possible to PUT /brokers/sync to force a sync?

Also, is Duration defined as a timeout from last sync, I.E., a manual sync will reset that timer?

@pmorie
Copy link
Contributor

pmorie commented Aug 2, 2017

Also, is Duration defined as a timeout from last sync, I.E., a manual sync will reset that timer?

Yes

@jmrodri
Copy link
Contributor

jmrodri commented Aug 2, 2017

We may optionally add a new flag to specify the default value to use when spec.syncDuration is missing but required.

While this might be not part of this PR (based on @pmorie comment). Won't this be a separate work item?

  1. A PR that adds the option to specify a default value to the "porcelain CLI command".

Or something like that.

@pmorie
Copy link
Contributor

pmorie commented Aug 2, 2017

A PR that adds the option to specify a default value to the "porcelain CLI command".

I'm not sure what you mean here, @jmrodri - can you clarify?

@pmorie pmorie changed the title Proposal for Syncing Brokers Proposal for Broker relist API Aug 2, 2017
@jmrodri
Copy link
Contributor

jmrodri commented Aug 2, 2017

@pmorie work item #4 states to add the porcelain command. The proposal also mentioned an "optional" parameter which we deemed not part of this PR. But that optional parameter would still need work to add. So it would seem that would need its own work item. Or we can remove that entire sentence from this proposal.

@jmrodri
Copy link
Contributor

jmrodri commented Aug 2, 2017

Overall @arschles @pmorie this looks good, assuming the manual trigger can happen even with duration set.

@vaikas
Copy link
Contributor

vaikas commented Aug 2, 2017

I'm curious why /brokers/sync is not feasible. What are the blockers for it.
Also, spec.syncDuration is that the duration after successful sync. Seems like if a sync is set to 1 day, we make a call and it fails, will we retry some number of times (exp backoff, etc.), or do we simply try once and then wait until the next day.

Furthermore, I'd like to at least address what happens on error cases and how they are communicated to the user? For example, upon initial broker registration if a broker fails for one or another reason, what should happen here? Are we going to set the status to failed? Same if for example we get malformed response from the broker what should happen here.

@MHBauer
Copy link
Contributor

MHBauer commented Aug 2, 2017

I don't like the word sync as it implies to me two way communication. The platform could be pushing updates to the broker. I know that's not possible with the osbapi.

with respect to global frequency and validations, I don't understand why to get rid of the flag and not use it for the default.
If we're going to delete the flag, then specifying duration without providing a duration should be an error. There should never be any defaulting.

Agree on describing the error cases.

@liggitt
Copy link

liggitt commented Aug 2, 2017

A new sync subresource (i.e. similar to /brokers/status) - When this subresource is requested with a PUT, the referenced broker's status will be cleared and the controller will automatically re-sync with the broker server specified in the Broker resource

There are a few problems with using the cleared status as a signal to update:

  • Representing user intent ("sync now") only in the status doesn't match the spec/status split used elsewhere
  • This approach loses the information contained in the status while the sync is occurring
  • This approach does not make it possible to ensure that a sync occurred after the requested sync. For example:
    • t=1, user modifies the remote broker
    • t=2, user puts to /brokers/sync, which clears the status
    • t=3, controller observes cleared status, starts sync, and captures broker info
    • t=4, user modifies the remote broker again
    • t=5, user puts to /brokers/sync, which clears the status (but is a no-op since the status is already cleared)
    • t=6, controller updates status with results of sync started at t=3

@jmrodri
Copy link
Contributor

jmrodri commented Aug 2, 2017

@MHBauer the original issue #705 used the term relist. Is that better than sync or something entirely different?

@pmorie
Copy link
Contributor

pmorie commented Aug 2, 2017

Alright, so I learned something today about Kubernetes, thanks to @liggitt, and I think we need to reformat some of the particulars around manual syncing (as well as whether we should be using a checksum).

More information coming soon.

@pmorie
Copy link
Contributor

pmorie commented Aug 2, 2017

I think we need to talk about this in SIG architecture, because we're butting up against a problem other APIs have which is unsolved.

Here's the generic formulation of the problem:

What is the right API construction to allow:

  • a controller to know that it has reconciled the latest version of an object's spec
  • a user to manually indicate that an object should be re-reconciled

@pmorie
Copy link
Contributor

pmorie commented Aug 2, 2017

So, there is a field in ObjectMeta called Generation that represents a specific version of an object's spec.

This field is used to solve the 'have i reconciled this version of this object's spec yet?' problem as follows:

  • The resource's REST strategy bumps the Generation field when the the spec changes
  • The resource's status has a field called ObservedGeneration that contains the last successfully reconciled value of Generation
  • When a controller reconciles an object, it can check to see whether the Generation and Status.ObservedGeneration match; if they match there is no work to do.

This seems like a much better solution than using a checksum. However, with the Broker resource, there is an additional wrinkle that we need to be able to indicate that a manual reconciliation has been requested.

I think that we should go to SIG architecture to discuss this. @liggitt and I both agree that we would bet on this being the ultimate guidance:

  1. Use Generation and ObservedGeneration to detect spec changes
  2. When a resource requires that a user express the need for manual re-reconcilation:
    1. That resource should have a field called ManualResync whose value is meaningless, but that can be manipulated by the user (or the subresource) in order to change the spec
    2. Changing ManualResync forces the generation to increment, which forces the controller to re-reconcile.

I don't think this needs to block work from happening on this subject, and we should continue with the planned design, but we should hold off on the manual sync part until we get some guidance from SIG architecture (next Monday).

@liggitt - does this faithfully recreate what we discussed?

@MHBauer
Copy link
Contributor

MHBauer commented Aug 2, 2017

@jmrodri relist, refresh, rereconcile, re-something.

@MHBauer
Copy link
Contributor

MHBauer commented Aug 2, 2017

The updated understanding here, #1086 (comment) & #1086 (comment) matches my understanding of what the appropriate way to do this is.
The ManualResync field seems tricky, but the combination of a field with the spec 'generation' makes sense.

@MHBauer
Copy link
Contributor

MHBauer commented Aug 2, 2017

I may have been thinking of the ResourceVersion field. I'll have to investigate some more.

@duglin
Copy link
Contributor

duglin commented Aug 2, 2017

using something like the Generation field makes sense but I'd like to understand why checksum couldn't be used for the same purpose - meaning, just clear it to force a reconciliation.

@duglin
Copy link
Contributor

duglin commented Aug 10, 2017

On Aug 7th we discussed this and as part of that we touched on the checksum vs generation issue. Also #1095 was created - however the minutes do not show that we formally agreed, or disagree, with that approach. I suspect we do but I'll add it to an upcoming call just to get final confirmation from the group.

@pmorie
Copy link
Contributor

pmorie commented Aug 10, 2017

@duglin

using something like the Generation field makes sense but I'd like to understand why checksum couldn't be used for the same purpose - meaning, just clear it to force a reconciliation.

Checksum is in the status, which users don't have write access to.

@duglin
Copy link
Contributor

duglin commented Aug 10, 2017

yea, that comment was written before we chatted about Generation on the call. However, does it make sense for Generation to be outside of Status? It seems to me that that property is more of an internally processing thing and letting a user directly touch it seems a bit scary. I think @arschles's idea of having some kind of URL/route/poke-mechanism would be nicer. This way we also hide the specifics of how a reconcile is managed from the user - meaning, we can switch between checksum and Generation, or anything else, at will but the UX remains the same.

@pmorie
Copy link
Contributor

pmorie commented Aug 10, 2017

@duglin Generation is in object meta and not user-settable, I wonder if we have our wires crossed.

@duglin
Copy link
Contributor

duglin commented Aug 10, 2017

ah good, for some reason I thought it was outside of Status and therefore in user-data, didn't think about the meta stuff.

@arschles
Copy link
Contributor Author

I have updated the OP to add details on failure handling and error reporting

@eriknelson
Copy link
Contributor

I'll be coordinating with @pmorie on an implementation for this.

@pmorie
Copy link
Contributor

pmorie commented Aug 17, 2017

Circling back before the discussion in today's meeting.

Here's an option which I've alluded to but haven't finished writing up anywhere yet:

  • Add a RelistRequests field which is of type int to the BrokerSpec
  • When the user wants to relist the broker manually, they can bump the value of this field, which will trigger re-reconcile of the resource in the controller
  • If we want, we can make it so this field's value can only increase, or can only increase by one

Advantages of this approach:

  • No new subresource (and associated machinery code) required
  • Easy for users to understand
  • No need to allow ordinary users to clear the status

@vaikas
Copy link
Contributor

vaikas commented Aug 17, 2017

I think 'easy for users to understand' is a bit of a matter of opinion. For example in the GCP to reset a VM instance, you do a put to .../reset which resets the machine.
When a broker is set to automatically refresh, do we bump this field to indicate those relistrequests too?
Do we have any precedence in k8s for something like this?
I'm trying to understand (and might have missed) what is the downside of having a new subresource. You could certainly just have the subresource handler mechanically do the RelistRequests if that makes things easier to implement.

@pmorie
Copy link
Contributor

pmorie commented Aug 17, 2017

@vaikas-google

Yeah, we could definitely have the subresource bump RelistRequests. My main point is that I don't think we should have to grant a user the ability to wipe out status on the broker resource to force a relist. Giving users that ability is dangerous because they could easily interfere with the normal operation of the resource without realizing it ("oops, I wrote a script that clears the status every 10 minutes and now I can't add a new broker successfully because the controller's broker work queue never empties before the controller's resync interval elapses").

The only drawback I see of having a subresource that bumps RelistRequests is that it's extra code, that's all.

@vaikas
Copy link
Contributor

vaikas commented Aug 17, 2017

Agree on status wiping not being something I'm thrilled about. Just trying to understand why the original proposal of /sync (with possibility of rename to relist, or whatevs.) is not what we should be pursuing :)

@duglin
Copy link
Contributor

duglin commented Aug 17, 2017

my 2 cents...

  • erasing the status as the way to force a reconcile feels a bit heavy/extreme since it would drop any potential other data in there people may want to see. Granted today there isn't much there aside from the Conditions, but still....
  • as @vaikas-google said, we can make /sync map to whatever mechanism we want under the covers (including the counter idea @pmorie has proposed)
  • I just find one PUT to be an easier UX than GET, field++, PUT

@pmorie
Copy link
Contributor

pmorie commented Aug 17, 2017

Sounds like we're converging gradually on a subresource that bumps a RelistRequests field in the spec

@duglin
Copy link
Contributor

duglin commented Aug 17, 2017

I think so, my only lingering question is around the newly proposed "generated" flag (or whatever its called), and I'm wondering whether we can use that instead of having to create a new property which would seem to have similar effect on what action we take. For example, could the PUT to /sync tweak the "generated" property or the "observedGenerated" property to get us what we want? Perhaps just zero-ing out the observedGenerated one? I guess it depends on the semantics we want for reconcile() if a reconcile() is already underway. What's kube's normal m.o. on this?

@pmorie
Copy link
Contributor

pmorie commented Aug 17, 2017

I still need to finish the write-up for generation... I think that will make things clear. Going to try to do that for meeting today.

@duglin
Copy link
Contributor

duglin commented Aug 17, 2017

that would be cool - I'd like to have us get that one behind us since I think it touches a lot of places.

@pmorie
Copy link
Contributor

pmorie commented Aug 17, 2017

Actually i'm looking through the generation issue and I think I can just answer your questions here @duglin

I think so, my only lingering question is around the newly proposed "generated" flag (or whatever its called), and I'm wondering whether we can use that instead of having to create a new property which would seem to have similar effect on what action we take.

There are actually two pieces of the generation proposal:

  • The API server sets the Generation field of the object metadata when the spec changes
  • There will be a new field in the status called ReconciledGeneration

We can't clear the ReconciledGeneration flag because it's on status and it would mean that we would have to give users permission to change status to relist the broker.

Does that make sense?

@duglin
Copy link
Contributor

duglin commented Aug 17, 2017

when you say "give the user permission to change the status", wouldn't that only be true if we asked them to touch the field directly? If we modified the field via a PUT /sync then its not the user touching it, rather its the http handler (or controller). IOW, wouldn't Kube's RBAC (or whatever would block updates to the Status) see it as touching the /sync resource and not /Status ?

@arschles
Copy link
Contributor Author

arschles commented Aug 17, 2017

As of this writing, the original post of this issue is up-to-date and includes specification for spec.relistRequests and the purpose and implementation of a PUT .../sync subresource.

@eriknelson
Copy link
Contributor

It's a little strange to me we've kept relist from the original design pitch on only relistRequests. Should the field be named syncRequests to remain consistent with the other fields and subresource?

@duglin
Copy link
Contributor

duglin commented Nov 7, 2017

Can we close this now?

@MHBauer
Copy link
Contributor

MHBauer commented Jun 15, 2018

As implemented, this totally ignores the global field set on the CLI. It is also now set as a default in the apiserver, and not from the controller.

Set here:
https://github.com/kubernetes-incubator/service-catalog/blob/master/pkg/apis/servicecatalog/v1beta1/defaults.go#L43-L45

But not using this value:
https://github.com/kubernetes-incubator/service-catalog/blob/master/pkg/controller/controller.go#L99

@MHBauer
Copy link
Contributor

MHBauer commented Aug 1, 2018

/close

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants