Optimizing Gatekeeper policy with large inventory #563
Replies: 1 comment 3 replies
-
Hi Paul! This is the main page I'm aware of WRT Rego performance optimization: https://www.openpolicyagent.org/docs/latest/policy-performance/ Here is the entry point for Rego code that the constraint framework generates, in case that affects optimizations (note the use of the As far as I can tell, Rego does not perform any kind of indexing over cached data itself, merely Rego code. Happy to be corrected if wrong there. Another idea could be to write a Writing your own TargetHandler would trade the infrastructure complexity of maintaining a separate external data cache, watches, etc. with the complexity of maintaining a Gatekeeper fork. It's not clear to me which one overall would be less complexity over time. |
Beta Was this translation helpful? Give feedback.
-
Hi, folks.
I have implemented a Gatekeeper Constraint Template that allows us to create Constraints that prevent tenants of a Kubernetes cluster from creating Pods if they have too many Pods in a Pending state, or too many Pods in a Running state, etc.
In order to get the counts, the Rego must (as best I can tell) be written such that it enumerates the entire pod inventory. On our large multi-tenant clusters that can have several hundred namespaces and tens of thousands of Pods, this runs very slowly, taking several seconds to execute in the best case, and a few tens of seconds to execute in the worst case (and thus falls afoul of our validating webhook timeout).
I'd like to understand how best to optimize this. Here's a snippet of the policy I have now:
As best I can tell, the evaluation and assignment of
project_pods
is what takes the longest.But as you can see, the objective is not to actually look at any specific details of the Pods (aside from seeing what phase they're in). The goal is to just count the pods and trigger a violation if those counts exceed limits defined in the Constraint.
So I'm considering setting up an external data provider that provides these pod counts directly. The external data provider would periodically (probably once a minute) fetch the list of all pods from the kube-apiserver and then generate the list of counts per-project and per-namespace, then provide those (cached) values as key-value pairs that can be retrieved by Gatekeeper. Gatekeeper would further cache the provider API responses for a minute or two, so the provider would not need to be particularly powerful. Then the policy could be simplified to simply do an external data provider lookup for the counts needed, making the policy run in constant time.
Of course, the downside to this approach is the complexity of adding and maintaining an external data provider. There's also the (not trivial) extra load added to the kube-apiserver and etcd of a controller out there querying all the Pods every minute. I could perhaps make the controller establish a watch on Pods and just increment/decrement counters at runtime to make that more efficient.
I would prefer not to go to all this trouble if there's a Rego trick that would optimize this sort of computation. Any suggestions?
Aside: you know ResourceQuotas exist, right? why not just deploy ResourceQuotas in every namespace to control this?
Yes, I'm aware of ResourceQuotas. The problem is, the quotas for pod counts are just a static count. You can't set more complex limits like "you may have only 10 Pending pods in this namespace, after which no new Pods may be created; you may run up to 50 Pods if they are all running; You may not create any more Pods if you have more than 5 Pods in CrashLoopBackOff" -- these sorts of rules are not expressable in a ResourceQuota, but are essential for us to keep our multi-tenant on-prem bare-metal clusters healthy.
Beta Was this translation helpful? Give feedback.
All reactions