Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global and local operators ignore locks and contend resources #2508

Closed
nicolaferraro opened this issue Feb 6, 2020 · 11 comments
Closed

Global and local operators ignore locks and contend resources #2508

nicolaferraro opened this issue Feb 6, 2020 · 11 comments
Assignees
Labels
design kind/feature Categorizes issue or PR as related to a new feature. needs discussion

Comments

@nicolaferraro
Copy link
Contributor

nicolaferraro commented Feb 6, 2020

Bug Report

What did you do?

Installed an operator with WATCH_NAMESPACE= (nothing) in global namespace and a local operator (WATCH_NAMESPACE=default) in default namespace.

Both operators try to interact with my custom resource. There's a lock configmap in the namespace, but the global operator just ignores it.

What did you expect to see?
The local operator takes precedence and wins, the global one ignores the namespace.

What did you see instead? Under which circumstances?
They try to interact in parallel on the same resource and generate lots of conflicts and misbehavior in the operand.

Environment

  • operator-sdk version: v0.15.0
  • go version: go version go1.13.3 linux/amd64

21a93ca

  • Kubernetes version information:

Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.6", GitCommit:"abdda3f9fefa29172298a2e42f5102e777a8ec25", GitTreeState:"clean", BuildDate:"2019-05-08T13:53:53Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-03-25T15:45:25Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}

  • Kubernetes cluster kind:

OpenShift 4.2

  • Are you writing your operator in ansible, helm, or go?

Go

Possible Solution

The leader package may expose some API's to check the existence of a lock. The same package may expose an API to let a local operator steal a (global owned) lock gracefully.

Additional context

Looking for a way to solve this problem or overcome it.

@camilamacedo86
Copy link
Contributor

camilamacedo86 commented Feb 7, 2020

Hi @nicolaferraro,

Really thanks for your input. Just to try to make easier we address it, following the summary of the scenario.

WHAT : Concurrence errors are faced to dealing with resources
WHEN: The same operator is deployed twice in the cluster. Note that the operator is cluster-scoped. However, one is configured to WATCH all cluster when the other just the operator's namespace.

So, I am afraid that it is not supported currently. However, I am flagging this one as RFE and needs discussion for we check if we can address it. It probably would require re-design.

@camilamacedo86 camilamacedo86 added kind/feature Categorizes issue or PR as related to a new feature. needs discussion design labels Feb 7, 2020
@camilamacedo86
Copy link
Contributor

@joelanford, @dmesser I think is good you beware of this one.

@joelanford
Copy link
Member

@nicolaferraro You're likely running into this issue because the global operator and the local operator are in different namespaces and therefore create separate locks in their respective namespaces.

The intention of the locking mechanism is that the operator instances that are operating in the same scope (i.e. watching the same namespaces), should share the same lock. So you could run a deployment for your operator that has 3 replicas, and only 1 would run at a time.

It is not recommended to have different operator instances running that have overlapping watches because of the exact issue you're running into.

Can you explain what your use case is for running a global and local operator that overlap on their watched namespaces? If you have a global operator already deployed that watches all namespaces, what is the purpose of running an additional local operator that watches just one of those namespaces?

@nicolaferraro
Copy link
Contributor Author

It's not a use case, but some people recently experimented "deployments rolling like crazy" and they discovered that the cause was a double operator.

In particular, in one case the user installed "knative-camel-sources" from the operator hub, not knowing that that package depends on "camel-k", so it also install "camel-k" globally under the hood (and you're not aware of this fact if you don't carefully review the install plan).

Later, the same user wanted to try "camel-k" but didn't install it via operator hub, instead she instantiated the operator on the namespace directly (that install mode is available as alternative to operator hub, mostly because operator hub is not common in vanilla Kube or Openshift 3.x).

So the new operator was created in the local namespace because the user didn't know there was already one.

Since this is technically possible, I'd like to avoid conflicts and "crazy deployments" and gracefully signal the problem via warning events.

I was thinking to extend the lock API to add a method that can check if there's a local operator that already "owns" a namespace, to be used as preliminary check by the global operator only. That should prevent at least the situation described above.

If that is ok, I can provide a PR.

@joelanford
Copy link
Member

Before moving forward on discussion of implementation options, I'd like to get some feedback on this issue from the OLM maintainers. It does seems like there could be a UX improvement that could be made here, just not sure exactly what the best approach would be.

I was thinking to extend the lock API to add a method that can check if there's a local operator that already "owns" a namespace, to be used as preliminary check by the global operator only. That should prevent at least the situation described above.

This might work if the global operator starts after the local operator. But if the global operator starts first and gets a lock (which would be in its namespace), it is unlikely that the local operator would know about that since it isn't aware of anything outside of its namespace, and furthermore may not even have permissions to access objects in other namespaces to even try to make that check.

/cc @ecordell @kevinrizza @njhale

@joelanford joelanford self-assigned this Feb 7, 2020
@kevinrizza
Copy link
Member

Yeah from the perspective of OLM it's something we are actively looking to improve from a UX perspective. Right now (especially from the UI) there isn't a good indicator as to what dependencies are installed when you create a subscription. It also seems like there's some good feedback here around how to provide safeguards around installing things in multiple places (especially with watches like this).

@nicolaferraro
Copy link
Contributor Author

Before moving forward on discussion of implementation options, I'd like to get some feedback on this issue from the OLM maintainers. It does seems like there could be a UX improvement that could be made here, just not sure exactly what the best approach would be.

I was thinking to extend the lock API to add a method that can check if there's a local operator that already "owns" a namespace, to be used as preliminary check by the global operator only. That should prevent at least the situation described above.

This might work if the global operator starts after the local operator. But if the global operator starts first and gets a lock (which would be in its namespace), it is unlikely that the local operator would know about that since it isn't aware of anything outside of its namespace, and furthermore may not even have permissions to access objects in other namespaces to even try to make that check.

/cc @ecordell @kevinrizza @njhale

To be clear, the approach I was proposing is:

  • Local operator in namespace ns1 needs to take the lock at startup in ns1 and the lock is automatically deleted when the operator is deleted (current approach, no changes)
  • Global operator does not take any lock in ns1, but programmatically (via this proposed API method) or automatically (embedded inside SDK internals) checks the existence of a local lock before touching any resource in ns1.

This guarantees that the two operators eventually don't overlap and local operator always take precedence, no matter when it starts.

I understand that the solution does not cover cases like two operators running externally to ns1 but having ns1 in scope, but that seems a particular situation that is difficult to setup (especially by accident).

Given current approach to locking which is based on OwnerReferences, a resource cannot be owned by another resource in a different namespace (there's no namespace field in references), I see little possibility to do something better without changing the locking mechanism.

@dmesser
Copy link
Contributor

dmesser commented Feb 7, 2020

The scenario described here is what OLM was designed to avoid - intersecting sets of watch namespace of two Operators. When installing it manually in a namespace the user bypassed that safeguard. I don't think we should re-implement this safeguard at the SDK level but rather explain that this is exactly what OLM is for.

@nicolaferraro
Copy link
Contributor Author

I agree that OLM prevents this problem from happening. But is OLM a prerequisite for using operator-sdk?

@dmesser
Copy link
Contributor

dmesser commented Feb 10, 2020

OLM is not a pre-requisite to use the SDK. But lifecycling the Operator is not in the scope of the SDK.

@joelanford
Copy link
Member

Closing as not a bug. Happy to continue the discussion to make sure we incorporate this (and any other feedback).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design kind/feature Categorizes issue or PR as related to a new feature. needs discussion
Projects
None yet
Development

No branches or pull requests

5 participants