Strimzi makes repeatable LIST requests causing instsability of Kubernetes Control Plane #7793

marseel · 2022-12-13T12:25:01Z

Describe the bug
strimzi-cluster-operator/0.31.1 does not use Kubernetes API Server cache for listing resources. All list calls go directly to Etcd, which puts significant load on Etcd causing Kubernetes Control Plane instability.

Example logs from Kubernetes API Server:

"HTTP" verb="LIST" URI="/api/v1/namespaces/<PII>/secrets?labelSelector=..." latency="1.492406837s" userAgent="strimzi-cluster-operator/0.31.1" <PII>  apf_pl="workload-low" apf_fs="service-accounts" resp=200

Similarly, Strimzi is also listing configmaps/pods/persistentvolumeclaims/services/...

Short term mitigation:
For each LIST/GET request set resourceVersion=0 to use Kubernetes API Server cache. This will allow requests to be served from Kubernetes API Server cache without interaction with Etcd.

Long term solution:
Migrate to use List and Watch pattern.
Relevant documentation: https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes
https://cloud.google.com/kubernetes-engine/docs/concepts/planning-scalability#use_list_and_watch_pattern_instead_of_periodic_listing

Expected behavior
By default, strimzi should use Kubernetes API Server cache and ideally List and Watch pattern instead of repeatable LIST calls.

Environment (please complete the following information):

Strimzi version: 0.31.1
Installation method: N/A
Kubernetes cluster: 1.23
Infrastructure: GKE

Additional context
Related issue that I've opened in Fabric8: fabric8io/kubernetes-client#4670

The text was updated successfully, but these errors were encountered:

marseel · 2022-12-13T13:13:46Z

More precisely, for example LIST requests for pods are in the format of:

/api/v1/namespaces/namespace-name/pods?labelSelector=strimzi.io/cluster=cluster-name,strimzi.io/name=some-name,strimzi.io/kind=Kafka

These type of requests make full LIST call to Etcd and then Kubernetes API Server makes filtering. I've observed 10 QPS, which is quite significant for such expensive calls (not including LISTs for other resources like secrets etc) especially when there are tens of thousands of pods in cluster.

marseel added the bug label Dec 13, 2022

marseel changed the title ~~Strimzi makes repeatable causing instsability of Kubernetes Control Plane~~ Strimzi makes repeatable LIST requests causing instsability of Kubernetes Control Plane Dec 13, 2022

marseel mentioned this issue Dec 13, 2022

Set resourceVersion=0 for GET/LIST calls by default fabric8io/kubernetes-client#4670

Closed

strimzi locked and limited conversation to collaborators Dec 13, 2022

scholzj converted this issue into discussion #7794 Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Strimzi makes repeatable LIST requests causing instsability of Kubernetes Control Plane #7793

Strimzi makes repeatable LIST requests causing instsability of Kubernetes Control Plane #7793

marseel commented Dec 13, 2022

marseel commented Dec 13, 2022 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

Strimzi makes repeatable LIST requests causing instsability of Kubernetes Control Plane #7793

Strimzi makes repeatable LIST requests causing instsability of Kubernetes Control Plane #7793

Comments

marseel commented Dec 13, 2022

marseel commented Dec 13, 2022 • edited Loading

This issue was moved to a discussion.

marseel commented Dec 13, 2022 •

edited

Loading