Allow the same user to edit an instance #1872

duglin · 2018-03-24T15:37:22Z

Unblocks #1755

Signed-off-by: Doug Davis [email protected]

carolynvs

LGTM

carolynvs · 2018-03-26T15:45:57Z

pkg/apis/servicecatalog/validation/instance.go

@@ -256,9 +257,23 @@ func validateServiceInstanceUpdate(instance *sc.ServiceInstance) field.ErrorList
 // internalValidateServiceInstanceUpdateAllowed ensures there is not a


You may want to update this comment to reflect how changes by the same user are treated.

good catch- fixed

n3wscott · 2018-03-26T16:23:35Z

I am wondering how we deal with the situation where a user updates the spec while the broker is committing the last update from that user. Do we have to worry about a race condition where two updates come in at the same time?

I am surprised there are no required changes to the reconciliation loops. I wonder if the update should reset the backoff for the current queued request if there is one so they do not have to wait the max timeout for an instance in a bad state. We just don't have any test coverage of this case in the current test framework...

duglin · 2018-03-26T16:26:00Z

I am wondering how we deal with the situation where a user updates the spec while the broker is committing the last update from that user. Do we have to worry about a race condition where two updates come in at the same time?

Won't the Generation stuff cover this?

n3wscott · 2018-03-26T16:35:04Z

Won't the Generation stuff cover this?

I honestly do not know at the moment, I would have to look at the code again. @kibbles-n-bytes just refactored a bunch of this so I would like his input here. I don't know if the lock was fundamental to know which spec/generation was the thing we sent to the broker. Especially for async? I am just not sure...

duglin · 2018-03-26T17:07:35Z

One of life's mysteries :-)

kibbles-n-bytes · 2018-03-26T21:13:01Z

@n3wscott On the reconcilation-loop-side of things, there are a couple things we should do.

We should reset the instance rate limiter when the generation has changed.
During provision, we need to initiate orphan mitigation if the resource has its DeprovisionStatus as Required when a generation bump has occurred.

As far as unlocking the spec for updates, though, same-user or not we have a bit of an issue... If an update request is in-flight, and someone updates the spec, the next status update will fail, so we'll lose all information of what the broker responded with. With a locked spec we can resend the request and (as long as the secret parameters haven't changed) we'll get the same response back. The odds of this response being different from the last one are small (if it was a retry), but it does hurt us in that we have no way of verifying what the true state of the world is, so our ExternalProperties could become wrong.

MHBauer · 2018-03-26T21:19:13Z

pkg/apis/servicecatalog/validation/instance.go

+	}
+
+	if ctx != nil {
+		if user, _ := genericapirequest.UserFrom(ctx); user != nil {


do we have any idea if this works at all with our apiserver? Does this code only work when originating-identity is turned on? Is that the only case we're concerned with. I think so.

I assume the lock is on all the time. This does seem to work with our apiserver since the test appears to pass :-)

staebler · 2018-03-27T16:18:39Z

pkg/apis/servicecatalog/validation/instance.go

+
+	if ctx != nil {
+		if user, _ := genericapirequest.UserFrom(ctx); user != nil {
+			newUID = user.GetUID()


We should be getting the UID of the new user directly from the ServiceInstance rather than from the context
The new user is set in the ServiceInstance as part of preparing the update, which happens before validation.

that would certainly be easier! :-) trying it now....

duglin · 2018-03-27T20:59:06Z

@staebler thanks for the comment - I've updated it and it much smaller change now - thanks

nilebox · 2018-03-28T10:11:22Z

Adding do-not-merge label since it's related to #1755 that we haven't agreed on yet.

nilebox · 2018-03-28T10:59:56Z

We should reset the instance rate limiter when the generation has changed.

@kibbles-n-bytes IIRC you have merged this change recently?

During provision, we need to initiate orphan mitigation if the resource has its DeprovisionStatus as Required when a generation bump has occurred.

When a generation bump has occured, the updated spec will be picked up only after the current operation has cleaned up, i.e. either succeeded or failed and orphan mitigation finished. This code is part of #1765

So I think we have already fixed these 2 issues in master (unless there is some bug in implementation).

pmorie · 2018-03-28T20:18:11Z

pkg/apis/servicecatalog/validation/instance.go

+		newUID = new.Spec.UserInfo.UID
+	}
+
+	disableLocking := utilfeature.DefaultFeatureGate.Enabled(scfeatures.DisableInstanceLocking)


if disableInstanceLocking is true, we can just return true at the beginning of this func

Signed-off-by: Doug Davis <[email protected]>

duglin · 2018-03-28T20:29:15Z

Added a feature gate to disable all locks, but if we are locking we allow edits from the same user.

pmorie · 2018-03-28T21:31:58Z

I'm liking this as a way forward on this subject

n3wscott · 2018-03-28T22:03:41Z

pkg/features/features.go

@@ -73,4 +79,5 @@ var defaultServiceCatalogFeatureGates = map[utilfeature.Feature]utilfeature.Feat
 	AsyncBindingOperations:  {Default: false, PreRelease: utilfeature.Alpha},
 	NamespacedServiceBroker: {Default: false, PreRelease: utilfeature.Alpha},
 	ResponseSchema:          {Default: false, PreRelease: utilfeature.Alpha},
+	DisableInstanceLocking:  {Default: false, PreRelease: utilfeature.Alpha},


I would recommend changing this feature flag to be called InstanceLocking (or something else) and default it to true.

The reason why is there are so many bugs in the history of feature flags that relate to people not understanding what it means to false a disableFoo flag. People are super bad at understanding the negative in the flag name. It is cleaner to just call it a flag for a feature and enable (true) flag means on, disable (false) means off. The feature is Instance Locking, and we default the feature on (true) (enabled).

Good point @n3wscott, how about OriginatingIdentityLocking, since this will eventually apply to bindings too?

Signed-off-by: Doug Davis <[email protected]>

duglin · 2018-03-28T23:40:45Z

@pmorie @n3wscott ok I renamed the flag. See how it looks now.

nilebox · 2018-03-29T04:51:31Z

pkg/apis/servicecatalog/validation/instance.go

-	if old.Generation != new.Generation && old.Status.CurrentOperation != "" {
-		errors = append(errors, field.Forbidden(field.NewPath("spec"), "Another update for this service instance is in progress"))
+
+	// If the OriginatingIdentityLocking feature is not set then only allow the


Wrong comment, the condition is reverse: is not set -> is set

nilebox · 2018-03-29T05:05:49Z

pkg/apis/servicecatalog/validation/instance.go

+			newUID = new.Spec.UserInfo.UID
+		}
+
+		if old.Generation != new.Generation && old.Status.CurrentOperation != "" && oldUID != newUID {


@duglin I'm not sure if this condition is sufficient for you. The old.Status.CurrentOperation is being set at the beginning of reconciling the new generation. This means that:

User B could slip through his changes before User A's changes are picked up by controller-manager

If the controller-manager is down (while apiserver is not), there could be a lot of spec changes from different users collected in the meantime.

The condition for locking should not be based on whether there is an operation in progress. It should be based on whether controller has finished processing the current generation (i.e. either succeeded, or failed and won't retry).

So I would suggest to change the condition to something more strict.
I suggest to copy the isServiceInstanceProcessedAlready method from controller and change the condition to

if old.Generation != new.Generation && oldUID != newUID && !isServiceInstanceProcessedAlready(old) { ...

and copy (will also need to change v1beta1 -> sc package)
https://github.com/kubernetes-incubator/service-catalog/blob/e15b73719911853d3755b71f3d8b26b21296d0a3/pkg/controller/controller_instance.go#L818-L826

P.S. I know that the condition was written this way before, but given that we decided to change it, it's worth fixing it as well.

nilebox · 2018-03-29T05:06:45Z

@duglin see my comments above. once you fix them, I am happy with merging this change and rejecting mine.

nilebox · 2018-03-29T05:08:58Z

pkg/features/features.go

@@ -58,6 +58,17 @@ const (
 	// owner: @luksa
 	// alpha: v0.1.12
 	ResponseSchema utilfeature.Feature = "ResponseSchema"
+
+	// OriginatingIdentityLocking controls whether we lock OSB API resources
+	// that are being updated while we are still processing the update request.


processing the update request -> haven't finished processing the current spec
(and rewrite the whole comment to reflect that)

nilebox · 2018-03-29T06:47:30Z

@duglin what about ServiceBinding resources? They have the same issue.
See https://github.com/kubernetes-incubator/service-catalog/blob/master/pkg/apis/servicecatalog/validation/binding.go#L217-L226

kibbles-n-bytes · 2018-03-30T17:33:46Z

@nilebox

@kibbles-n-bytes IIRC you have merged this change recently?

That was for the instance polling queue, but the regular instance reconciliation queue also has a similar issue here.

When a generation bump has occured, the updated spec will be picked up only after the current operation has cleaned up, i.e. either succeeded or failed and orphan mitigation finished. This code is part of #1765

As far as I can tell, not if there's a spec change while a request is in flight. In that case, the API server update with the response from that request would fail, and the reconciler would reattempt reconcileServiceInstanceAdd from the top, preparing the new ObservedGeneration and reattempting with the new parameters.

kibbles-n-bytes · 2018-03-30T17:44:42Z

Well, actually, if we merge this with current master we may not actually have the issue with the instance reconciliation queue; I haven't tested it, but I'm pretty sure the rate limiter for regular reconciliation is actually broken today lol.

The worker will call c.instanceQueue.AddRateLimited(key) if the reconciler returns an error, but at the same time the status update to record that error in our instance's conditions will trigger instanceUpdate, which calls c.instanceQueue.Add(key), circumventing the rate limiter.

kibbles-n-bytes · 2018-03-30T17:52:05Z

(The other issues I believe are still valid, though.)

For the instance update issue, I've been talking with Doug offline, and it seems like changing our status updates to do a GET -> edit -> PUT for the resources should soothe my concerns in 99.9% of situations. This way, failures to update resource statuses would be minimized, and our surface area for getting into a wonky state is vastly reduced. For provisioning, though, I think we should still be defensive, because we have the ability to fix ourselves with orphan mitigation.

nilebox · 2018-03-30T22:00:07Z

@kibbles-n-bytes I will need to look at the code again, but my understanding is that we store all inline parameters in the InProgressProperties before sending an OSB request, as part of "record start of the new operation" method. So changing the spec after invoking recording InProgressProperties has no effect.

So you're right of course that we can fail to record ExternalProperties, but that's how distributed systems work - there is no "exactly once" delivery, it's either "at least once" or "at most once", and Service Catalog should always do "at least once".

I don't see how unlocking makes things worse there, and will need to read the code again next week.

nilebox · 2018-03-30T22:29:38Z

@kibbles-n-bytes also even if this issue is valid, we should have a test for reproducing it.

kibbles-n-bytes · 2018-04-02T17:31:07Z

So you're right of course that we can fail to record ExternalProperties, but that's how distributed systems work - there is no "exactly once" delivery, it's either "at least once" or "at most once", and Service Catalog should always do "at least once".

I don't see how unlocking makes things worse there, and will need to read the code again next week.

@nilebox It makes it worse because the number of circumstances that would cause us to fail to record ExternalProperties grow dramatically if we unlock the spec. Currently there are two situations this could happen: if the controller goes down, or if communication with the API server fails. Both of these are infrequent and we would resend the same request during the next reconciliation, so no real issue.

With the spec unlocked, updating the spec during a reconciliation in which we're communicating with the broker would cause this behavior every time. Unless something has changed from what I remember, spec updates will cause the resource version of the instance to change, which will cause the API server to reject the status update request when we go to record the result of the operation. And then the new reconciliation would pick up and have no idea as to whether we had or hadn't actually sent a request for the resource.

I see two options out of this:

Make the status update not fail, which will mean we should GET the latest instance and then update that status to match the intended one

AND/OR

Make the reconciler smarter about the state we're in, and detect that if its Ready reason is set to UpdateInstanceRequestInFlight but its spec parameters/plan don't match the InProgressProperties parameters/plan, we (re)try with the old parameters/plan first to get a definitive result before moving on to the new spec.

@kibbles-n-bytes also even if this issue is valid, we should have a test for reproducing it.

Do you mean we should have a test for it today? it's not an issue currently, but will be once the unlock is introduced. The integration test would involve blocking the broker's response until the test controller makes an update to the instance spec, which currently couldn't be implemented due to the lock.

nilebox · 2018-04-03T00:33:30Z

@kibbles-n-bytes after examining the code, I see the problem, and I think it's a bug in the existing code. The current behavior is the following

Prepare OSB request via prepareRequestHelper method, which always reads parameters from the spec.
If current operation is not set, set it (and copy the properties we are going to send to InProgressProperties:

if instance.Status.CurrentOperation == "" {
		instance, err = c.recordStartOfServiceInstanceOperation(instance, v1beta1.ServiceInstanceOperationProvision, inProgressProperties)
		...
}

which means that InProgressProperties we store in status might be stale, and even after succeeding to send an OSB request with new parameters, we will copy a stale value to ExternalProperties!

My proposal is to make both OSB request preparation and updating InProgressProperties conditional on CurrentOperation. This way we can guarantee that whatever we have recorded as part of recordStartOfServiceInstanceOperation() will be sent to the broker, until/unless we explicitly reset the current operation.

While agree that this bug becomes more critical with removing locks, I still think it's independent and needs fixing ASAP (probably as a separate PR).

nilebox · 2018-04-05T14:36:46Z

I suggest to split this PR into small PRs as follow-ups to #1916 and #1917

k8s-ci-robot · 2018-07-18T21:14:18Z

@duglin: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jberkhahn · 2019-04-22T20:24:57Z

Due to lack of activity, we're closing this. If you still have an interest in this, please reopen with these changes.

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 24, 2018

carolynvs approved these changes Mar 26, 2018

View reviewed changes

carolynvs added the LGTM1 label Mar 26, 2018

duglin force-pushed the unlockUser branch from 97fb08c to a323ae3 Compare March 26, 2018 15:49

MHBauer reviewed Mar 26, 2018

View reviewed changes

staebler reviewed Mar 27, 2018

View reviewed changes

nilebox added the do-not-merge label Mar 28, 2018

pmorie reviewed Mar 28, 2018

View reviewed changes

duglin force-pushed the unlockUser branch from ce81589 to 00ea8f4 Compare March 28, 2018 20:26

Doug Davis added 2 commits March 28, 2018 13:27

Allow the same user to edit an instance

ed006ae

Signed-off-by: Doug Davis <[email protected]>

store under in instance

f25cb23

Signed-off-by: Doug Davis <[email protected]>

duglin force-pushed the unlockUser branch from 00ea8f4 to 2f1fc0c Compare March 28, 2018 20:28

n3wscott reviewed Mar 28, 2018

View reviewed changes

duglin force-pushed the unlockUser branch from 2f1fc0c to 46712fd Compare March 28, 2018 23:38

add a feature gate to disable instance locking

c3f08dc

Signed-off-by: Doug Davis <[email protected]>

duglin force-pushed the unlockUser branch from 46712fd to c3f08dc Compare March 28, 2018 23:40

nilebox reviewed Mar 29, 2018

View reviewed changes

duglin mentioned this pull request Mar 30, 2018

Bindings need locks - as long as Instances do #1895

Closed

carolynvs removed the LGTM1 label Apr 2, 2018

This was referenced Apr 3, 2018

Handle InProgressProperties for retries properly #1903

Closed

Keep InProgressProperties up-to-date #1916

Merged

Add feature flag for making instance/binding locks optional #1917

Merged

carolynvs added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed do-not-merge labels Jun 12, 2018

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 18, 2018

carlory mentioned this pull request Nov 14, 2018

should allow service catalog client to update a binding's spec field if the reconciler didn't send a binding request to a broker #2494

Closed

jberkhahn closed this Apr 22, 2019

		@@ -256,9 +257,23 @@ func validateServiceInstanceUpdate(instance *sc.ServiceInstance) field.ErrorList
		// internalValidateServiceInstanceUpdateAllowed ensures there is not a

Allow the same user to edit an instance #1872

Allow the same user to edit an instance #1872

Conversation

duglin commented Mar 24, 2018 • edited Loading

carolynvs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

n3wscott commented Mar 26, 2018

duglin commented Mar 26, 2018

n3wscott commented Mar 26, 2018

duglin commented Mar 26, 2018

kibbles-n-bytes commented Mar 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

duglin commented Mar 27, 2018

nilebox commented Mar 28, 2018

nilebox commented Mar 28, 2018

Choose a reason for hiding this comment

duglin commented Mar 28, 2018

pmorie commented Mar 28, 2018

n3wscott Mar 28, 2018 • edited Loading

Choose a reason for hiding this comment

pmorie Mar 28, 2018 • edited Loading

Choose a reason for hiding this comment

duglin commented Mar 28, 2018

Choose a reason for hiding this comment

nilebox Mar 29, 2018 • edited Loading

Choose a reason for hiding this comment

nilebox commented Mar 29, 2018 • edited Loading

Choose a reason for hiding this comment

nilebox commented Mar 29, 2018 • edited Loading

kibbles-n-bytes commented Mar 30, 2018 • edited Loading

kibbles-n-bytes commented Mar 30, 2018 • edited Loading

kibbles-n-bytes commented Mar 30, 2018

nilebox commented Mar 30, 2018

nilebox commented Mar 30, 2018

kibbles-n-bytes commented Apr 2, 2018 • edited Loading

nilebox commented Apr 3, 2018 • edited Loading

nilebox commented Apr 5, 2018

k8s-ci-robot commented Jul 18, 2018

jberkhahn commented Apr 22, 2019

duglin commented Mar 24, 2018 •

edited

Loading

n3wscott Mar 28, 2018 •

edited

Loading

pmorie Mar 28, 2018 •

edited

Loading

nilebox Mar 29, 2018 •

edited

Loading

nilebox commented Mar 29, 2018 •

edited

Loading

nilebox commented Mar 29, 2018 •

edited

Loading

kibbles-n-bytes commented Mar 30, 2018 •

edited

Loading

kibbles-n-bytes commented Mar 30, 2018 •

edited

Loading

kibbles-n-bytes commented Apr 2, 2018 •

edited

Loading

nilebox commented Apr 3, 2018 •

edited

Loading