-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work around Kubernetes bug causing garbage collection across namespaces to happen when ownerReferences are set #3986
Comments
IIUC the GC would delete the Secret in the cluster namespace and the copied |
I'm afraid that we are removing a very serious, but rare, bug for mostly harmless, but affecting 100% of users, one. IIUC, all |
#3992 proposes approach number 3 (don't set the ownerRef + garbage collect orphan secrets). I feel like that's a very acceptable trade-off? edit: seems easy enough to also implement gc on operator startup to cover the operator not running case. |
@sebgl, I think I wasn't clear by just quoting - I was referring only to solution 1, in which we would ignore the orphaned secret. As to your PR (approach 3), I'm all for it 👍 |
This Kubernetes bug leads to very dangerous situations for ECK users currently.
As we explained in ECK documentation, the following may happen:
elastic
user password from one namespace to another, in order to use it in other applicationsdelete
retention policy, all Elasticsearch data gets immediately deleted. Even if they use theretain
detention policy, it is likely that new PVCs will get created along with empty fresh PVsThis bug will be fixed in Kubernetes 1.20.
In general, we try to not work around Kubernetes bugs in ECK, and instead encourage people to upgrade their Kubernetes version, but here the impact is dangerous enough to deserve being addressed in ECK directly.
I see several solutions to this problem:
1. Don't set ownerReferences on Secrets user are likely to copy around
I think there are 2 Secrets users can be tempted to copy across namespaces:
elastic
user passwordWe could also potentially add secrets of other stack components to that list:
Those additional secrets are also impacted by the GC bug, but will not cause Elasticsearch data loss.
In those secrets, we could decide to not set an ownerReference. Which means they will remain orphan Kubernetes resources if the user decides to delete the Elasticsearch cluster.
Overall it means replacing the ownerReference bug with another bug (not cleaning secrets on deletion), but that second bug is much less dangerous.
2. Set the ownerReference of those Secrets to target a different, harmless resource
To be experimented, not sure if it works.
We could decide to bind the ownerReference of those secrets to a different resource. And make that different resource have an ownerReference to the Elasticsearch resource. So the parent-child tree looks as follow:
That intermediate resource could either be an existing resource we already manage, or be a new placeholder resource. For example: an empty configMap whose only purpose is to work around the ownerRef bug.
When the k8s garbage collector decides to cleanup parent resources, I think it may (to be tested) limit to deleting that intermediate configmap which we don't care about much. And not touch other siblings in the dependency tree (i.e. PersistentVolumeClaims).
3. Don't set an ownerRef on those secrets, and handle the garbage collection in ECK
We already garbage collect some secrets related to resource associations (e.g. credentials for the Kibana user across Elasticsearch namespace + Kibana namespace).
We could do the same with those orphan secrets.
This may be trickier than it sounds, since we need to ensure we properly deal with cache inconsistencies (for example: deleting the Secret while we don't see the Elasticsearch resource yet in the cache).
It's also not a "perfect" solution in the sense that deleting ECK would preserve the orphan secrets.
Order of operations. A reasonable plan to handle the above:
The text was updated successfully, but these errors were encountered: