Skip to content
This repository has been archived by the owner on May 6, 2022. It is now read-only.

Allow the same user to edit an instance #1872

Closed
wants to merge 3 commits into from

Conversation

duglin
Copy link
Contributor

@duglin duglin commented Mar 24, 2018

Unblocks #1755

Signed-off-by: Doug Davis [email protected]

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 24, 2018
Copy link
Contributor

@carolynvs carolynvs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -256,9 +257,23 @@ func validateServiceInstanceUpdate(instance *sc.ServiceInstance) field.ErrorList
// internalValidateServiceInstanceUpdateAllowed ensures there is not a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to update this comment to reflect how changes by the same user are treated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch- fixed

@n3wscott
Copy link
Contributor

I am wondering how we deal with the situation where a user updates the spec while the broker is committing the last update from that user. Do we have to worry about a race condition where two updates come in at the same time?

I am surprised there are no required changes to the reconciliation loops. I wonder if the update should reset the backoff for the current queued request if there is one so they do not have to wait the max timeout for an instance in a bad state. We just don't have any test coverage of this case in the current test framework...

@duglin
Copy link
Contributor Author

duglin commented Mar 26, 2018

I am wondering how we deal with the situation where a user updates the spec while the broker is committing the last update from that user. Do we have to worry about a race condition where two updates come in at the same time?

Won't the Generation stuff cover this?

@n3wscott
Copy link
Contributor

Won't the Generation stuff cover this?

I honestly do not know at the moment, I would have to look at the code again. @kibbles-n-bytes just refactored a bunch of this so I would like his input here. I don't know if the lock was fundamental to know which spec/generation was the thing we sent to the broker. Especially for async? I am just not sure...

@duglin
Copy link
Contributor Author

duglin commented Mar 26, 2018

One of life's mysteries :-)

@kibbles-n-bytes
Copy link
Contributor

@n3wscott On the reconcilation-loop-side of things, there are a couple things we should do.

  1. We should reset the instance rate limiter when the generation has changed.
  2. During provision, we need to initiate orphan mitigation if the resource has its DeprovisionStatus as Required when a generation bump has occurred.

As far as unlocking the spec for updates, though, same-user or not we have a bit of an issue... If an update request is in-flight, and someone updates the spec, the next status update will fail, so we'll lose all information of what the broker responded with. With a locked spec we can resend the request and (as long as the secret parameters haven't changed) we'll get the same response back. The odds of this response being different from the last one are small (if it was a retry), but it does hurt us in that we have no way of verifying what the true state of the world is, so our ExternalProperties could become wrong.

}

if ctx != nil {
if user, _ := genericapirequest.UserFrom(ctx); user != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have any idea if this works at all with our apiserver? Does this code only work when originating-identity is turned on? Is that the only case we're concerned with. I think so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the lock is on all the time. This does seem to work with our apiserver since the test appears to pass :-)


if ctx != nil {
if user, _ := genericapirequest.UserFrom(ctx); user != nil {
newUID = user.GetUID()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be getting the UID of the new user directly from the ServiceInstance rather than from the context
The new user is set in the ServiceInstance as part of preparing the update, which happens before validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would certainly be easier! :-) trying it now....

@duglin
Copy link
Contributor Author

duglin commented Mar 27, 2018

@staebler thanks for the comment - I've updated it and it much smaller change now - thanks

@nilebox
Copy link
Contributor

nilebox commented Mar 28, 2018

Adding do-not-merge label since it's related to #1755 that we haven't agreed on yet.

@nilebox
Copy link
Contributor

nilebox commented Mar 28, 2018

  1. We should reset the instance rate limiter when the generation has changed.

@kibbles-n-bytes IIRC you have merged this change recently?

  1. During provision, we need to initiate orphan mitigation if the resource has its DeprovisionStatus as Required when a generation bump has occurred.

When a generation bump has occured, the updated spec will be picked up only after the current operation has cleaned up, i.e. either succeeded or failed and orphan mitigation finished. This code is part of #1765

So I think we have already fixed these 2 issues in master (unless there is some bug in implementation).

newUID = new.Spec.UserInfo.UID
}

disableLocking := utilfeature.DefaultFeatureGate.Enabled(scfeatures.DisableInstanceLocking)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if disableInstanceLocking is true, we can just return true at the beginning of this func

Doug Davis added 2 commits March 28, 2018 13:27
@duglin
Copy link
Contributor Author

duglin commented Mar 28, 2018

Added a feature gate to disable all locks, but if we are locking we allow edits from the same user.

@pmorie
Copy link
Contributor

pmorie commented Mar 28, 2018

I'm liking this as a way forward on this subject

@@ -73,4 +79,5 @@ var defaultServiceCatalogFeatureGates = map[utilfeature.Feature]utilfeature.Feat
AsyncBindingOperations: {Default: false, PreRelease: utilfeature.Alpha},
NamespacedServiceBroker: {Default: false, PreRelease: utilfeature.Alpha},
ResponseSchema: {Default: false, PreRelease: utilfeature.Alpha},
DisableInstanceLocking: {Default: false, PreRelease: utilfeature.Alpha},
Copy link
Contributor

@n3wscott n3wscott Mar 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend changing this feature flag to be called InstanceLocking (or something else) and default it to true.

The reason why is there are so many bugs in the history of feature flags that relate to people not understanding what it means to false a disableFoo flag. People are super bad at understanding the negative in the flag name. It is cleaner to just call it a flag for a feature and enable (true) flag means on, disable (false) means off. The feature is Instance Locking, and we default the feature on (true) (enabled).

Copy link
Contributor

@pmorie pmorie Mar 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @n3wscott, how about OriginatingIdentityLocking, since this will eventually apply to bindings too?

@duglin
Copy link
Contributor Author

duglin commented Mar 28, 2018

@pmorie @n3wscott ok I renamed the flag. See how it looks now.

if old.Generation != new.Generation && old.Status.CurrentOperation != "" {
errors = append(errors, field.Forbidden(field.NewPath("spec"), "Another update for this service instance is in progress"))

// If the OriginatingIdentityLocking feature is not set then only allow the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong comment, the condition is reverse: is not set -> is set

newUID = new.Spec.UserInfo.UID
}

if old.Generation != new.Generation && old.Status.CurrentOperation != "" && oldUID != newUID {
Copy link
Contributor

@nilebox nilebox Mar 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@duglin I'm not sure if this condition is sufficient for you. The old.Status.CurrentOperation is being set at the beginning of reconciling the new generation. This means that:

  1. User B could slip through his changes before User A's changes are picked up by controller-manager
  2. If the controller-manager is down (while apiserver is not), there could be a lot of spec changes from different users collected in the meantime.

The condition for locking should not be based on whether there is an operation in progress. It should be based on whether controller has finished processing the current generation (i.e. either succeeded, or failed and won't retry).

So I would suggest to change the condition to something more strict.
I suggest to copy the isServiceInstanceProcessedAlready method from controller and change the condition to

if old.Generation != new.Generation && oldUID != newUID && !isServiceInstanceProcessedAlready(old) { ...

and copy (will also need to change v1beta1 -> sc package)
https://github.com/kubernetes-incubator/service-catalog/blob/e15b73719911853d3755b71f3d8b26b21296d0a3/pkg/controller/controller_instance.go#L818-L826

P.S. I know that the condition was written this way before, but given that we decided to change it, it's worth fixing it as well.

@nilebox
Copy link
Contributor

nilebox commented Mar 29, 2018

@duglin see my comments above. once you fix them, I am happy with merging this change and rejecting mine.

@@ -58,6 +58,17 @@ const (
// owner: @luksa
// alpha: v0.1.12
ResponseSchema utilfeature.Feature = "ResponseSchema"

// OriginatingIdentityLocking controls whether we lock OSB API resources
// that are being updated while we are still processing the update request.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processing the update request -> haven't finished processing the current spec
(and rewrite the whole comment to reflect that)

@nilebox
Copy link
Contributor

nilebox commented Mar 29, 2018

@kibbles-n-bytes
Copy link
Contributor

kibbles-n-bytes commented Mar 30, 2018

@nilebox

@kibbles-n-bytes IIRC you have merged this change recently?

That was for the instance polling queue, but the regular instance reconciliation queue also has a similar issue here.

When a generation bump has occured, the updated spec will be picked up only after the current operation has cleaned up, i.e. either succeeded or failed and orphan mitigation finished. This code is part of #1765

As far as I can tell, not if there's a spec change while a request is in flight. In that case, the API server update with the response from that request would fail, and the reconciler would reattempt reconcileServiceInstanceAdd from the top, preparing the new ObservedGeneration and reattempting with the new parameters.

@kibbles-n-bytes
Copy link
Contributor

kibbles-n-bytes commented Mar 30, 2018

Well, actually, if we merge this with current master we may not actually have the issue with the instance reconciliation queue; I haven't tested it, but I'm pretty sure the rate limiter for regular reconciliation is actually broken today lol.

The worker will call c.instanceQueue.AddRateLimited(key) if the reconciler returns an error, but at the same time the status update to record that error in our instance's conditions will trigger instanceUpdate, which calls c.instanceQueue.Add(key), circumventing the rate limiter.

@kibbles-n-bytes
Copy link
Contributor

(The other issues I believe are still valid, though.)

For the instance update issue, I've been talking with Doug offline, and it seems like changing our status updates to do a GET -> edit -> PUT for the resources should soothe my concerns in 99.9% of situations. This way, failures to update resource statuses would be minimized, and our surface area for getting into a wonky state is vastly reduced. For provisioning, though, I think we should still be defensive, because we have the ability to fix ourselves with orphan mitigation.

@nilebox
Copy link
Contributor

nilebox commented Mar 30, 2018

@kibbles-n-bytes I will need to look at the code again, but my understanding is that we store all inline parameters in the InProgressProperties before sending an OSB request, as part of "record start of the new operation" method. So changing the spec after invoking recording InProgressProperties has no effect.

So you're right of course that we can fail to record ExternalProperties, but that's how distributed systems work - there is no "exactly once" delivery, it's either "at least once" or "at most once", and Service Catalog should always do "at least once".

I don't see how unlocking makes things worse there, and will need to read the code again next week.

@nilebox
Copy link
Contributor

nilebox commented Mar 30, 2018

@kibbles-n-bytes also even if this issue is valid, we should have a test for reproducing it.

@carolynvs carolynvs removed the LGTM1 label Apr 2, 2018
@kibbles-n-bytes
Copy link
Contributor

kibbles-n-bytes commented Apr 2, 2018

So you're right of course that we can fail to record ExternalProperties, but that's how distributed systems work - there is no "exactly once" delivery, it's either "at least once" or "at most once", and Service Catalog should always do "at least once".

I don't see how unlocking makes things worse there, and will need to read the code again next week.

@nilebox It makes it worse because the number of circumstances that would cause us to fail to record ExternalProperties grow dramatically if we unlock the spec. Currently there are two situations this could happen: if the controller goes down, or if communication with the API server fails. Both of these are infrequent and we would resend the same request during the next reconciliation, so no real issue.

With the spec unlocked, updating the spec during a reconciliation in which we're communicating with the broker would cause this behavior every time. Unless something has changed from what I remember, spec updates will cause the resource version of the instance to change, which will cause the API server to reject the status update request when we go to record the result of the operation. And then the new reconciliation would pick up and have no idea as to whether we had or hadn't actually sent a request for the resource.

I see two options out of this:

  1. Make the status update not fail, which will mean we should GET the latest instance and then update that status to match the intended one

AND/OR

  1. Make the reconciler smarter about the state we're in, and detect that if its Ready reason is set to UpdateInstanceRequestInFlight but its spec parameters/plan don't match the InProgressProperties parameters/plan, we (re)try with the old parameters/plan first to get a definitive result before moving on to the new spec.

@kibbles-n-bytes also even if this issue is valid, we should have a test for reproducing it.

Do you mean we should have a test for it today? it's not an issue currently, but will be once the unlock is introduced. The integration test would involve blocking the broker's response until the test controller makes an update to the instance spec, which currently couldn't be implemented due to the lock.

@nilebox
Copy link
Contributor

nilebox commented Apr 3, 2018

@kibbles-n-bytes after examining the code, I see the problem, and I think it's a bug in the existing code. The current behavior is the following

  1. Prepare OSB request via prepareRequestHelper method, which always reads parameters from the spec.
  2. If current operation is not set, set it (and copy the properties we are going to send to InProgressProperties:
if instance.Status.CurrentOperation == "" {
		instance, err = c.recordStartOfServiceInstanceOperation(instance, v1beta1.ServiceInstanceOperationProvision, inProgressProperties)
		...
}

which means that InProgressProperties we store in status might be stale, and even after succeeding to send an OSB request with new parameters, we will copy a stale value to ExternalProperties!

My proposal is to make both OSB request preparation and updating InProgressProperties conditional on CurrentOperation. This way we can guarantee that whatever we have recorded as part of recordStartOfServiceInstanceOperation() will be sent to the broker, until/unless we explicitly reset the current operation.

While agree that this bug becomes more critical with removing locks, I still think it's independent and needs fixing ASAP (probably as a separate PR).

@nilebox
Copy link
Contributor

nilebox commented Apr 5, 2018

I suggest to split this PR into small PRs as follow-ups to #1916 and #1917

@carolynvs carolynvs added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed do-not-merge labels Jun 12, 2018
@k8s-ci-robot
Copy link
Contributor

@duglin: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jberkhahn
Copy link
Contributor

Due to lack of activity, we're closing this. If you still have an interest in this, please reopen with these changes.

@jberkhahn jberkhahn closed this Apr 22, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants