[VC-35377] Reduce memory by excluding Helm release Secrets and some standard Secret types #554

wallrj · 2024-08-09T14:28:47Z

This PR aims to solve the problem of large startup memory use in the agent.
The agent lists all Secrets when it starts up and this balloons the startup memory usage, because all the data of all the Secrets is returned from the API server in one SecretList which must be held in memory while it is unmarshalled.
This is a symptom of the List then Watch behaviour of the client-go informer abstraction. (See "Why does agent request all the Secrets from the K8S API server when it starts up?" below.)

The venafi-kubernetes-agent lists, watches and caches all Secret resources in the cluster.
It sends to Venafi some metadata of all Secrets
and the data for tls.crt and ca.crt files if those keys exist in the Secret.

jetstack-secure/pkg/datagatherer/k8s/fieldfilter.go

Lines 12 to 26 in 473277b

    
           // SecretSelectedFields is the list of fields sent from Secret objects to the 
        
           // backend 
        
           var SecretSelectedFields = []string{ 
        
           	"kind", 
        
           	"apiVersion", 
        
           	"metadata.annotations", 
        
           	"metadata.name", 
        
           	"metadata.namespace", 
        
           	"metadata.ownerReferences", 
        
           	"metadata.selfLink", 
        
           	"metadata.uid", 
        
           	"type", 
        
           	"/data/tls.crt", 
        
           	"/data/ca.crt", 
        
           }

Supplying a field selector to the client causes the server to only return the resources which match the selector.
Which reduces the size of the response and reduces the size of the SecretList which is in memory after the initial listing.

In this PR I add a field selector to exclude Helm release Secrets and all the standard Kubernetes Secret types which are not expected to contain tls.crt or ca.crt files.
I wanted to exclude Helm Secrets in particular because they are often large and often numerous.

This approach was also considered for cert-manager cainjector but rejected in favour of using the metadata-only API. See

Add support for restricting the secrets watch list in cainjector cert-manager/cert-manager#5174

There is a chance that this change will cause the agent to ignore some public TLS certificates which it previously reported. This is because users might add tls.crt and ca.crt files to kubernetes.io/dockerconfig Secrets and @hawksight demonstrates this in the comments below.

Why does agent request all the Secrets from the K8S API server when it starts up?

To populate its in-memory cache. It sents the following requests:

GET /api/v1/secrets?limit=500&resourceVersion=0
GET /api/v1/secrets?allowWatchBookmarks=true&resourceVersion=1360&timeoutSeconds=493&watch=true

The first request is a List request, which returns all the Secret resources, to populate the cache.
The second request is a Watch request, which streams all changes to all Secrets in the cluster, so that the agent can keep its cache up-to-date.

ℹ️ This list-then-watch logic is provided by the Kubernetes Go client library.
The API is called a Reflector and is located in the k8s.io/client-go/tools/cache package.
See https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes

ℹ️ The limit=500 parameter is actually ignored when resourceVersion=0.
This is a known contradiction and cause of confusion.
See kubernetes/kubernetes#102672

Alternatives considered

Use metadata-only API

We could in theory list only the metadata of the Secrets, inspect the field-owner metadata for the tls.crt and ca.crt keys and then get the full data of only those Secrets. But this would be more complex to implement and would be obsolete as soon as WatchList feature becomes GA.
And when WatchList is usable the Agent can efficiently get the data of all Secrets and do deeper inspection of all the files for public certificates....not just the tls.crt and ca.crt files.

Use the WatchList feature

In future, the agent can use the WatchList feature but it is not yet stable.
This will remove the initial requests to List all resources and instead the agent will be able to stream and incrementally decode all the resources that it discovers which should dramatically reduce the startup memory.

Only list `Opaque` and `kubernets.io/tls` Secret types

Because those are the Secret types which are commonly contain a tls.crt and ca.crt file.

Rejected for the following reasons:

Field selectors do not support "set" operators, so it is not possible to write a selector that only selects Secrets of type Opaque OR kubernetes.io/tls. See:

fields.Selector should support more operator like 'in' kubernetes/kubernetes#32946

This would exclude Secrets having a bespoke type but which contain tls.crt and ca.crt keys.

Testing

# examples/one-shot-secrets.yaml
organization_id: "my-organization"
cluster_id: "my_cluster"
period: 1m
data-gatherers:
- kind: "k8s-dynamic"
  name: "k8s/secrets.v1"
  config:
    resource-type:
      version: v1
      resource: secrets

export TIME='{ 
     "cpu_seconds_user": %U,
    "cpu_seconds_system": %S,
    "time_elapsed": "%E",
    "cpu_percent": "%P",
    "text_kilobytes": %X,
    "data_kilobytes": %D,
    "rss_max_kilobytes": %M,
    "fs_inputs": %I,
    "fs_outputs": %O,
    "page_faults_major": %F,
    "page_faults_minor": %R,
    "swaps": %W
}'
git checkout origin/master
make build
/usr/bin/time -o resources.master.json builds/preflight agent -c examples/one-shot-secret.yaml --one-shot  --output-path output.master.json

git checkout memory-optimization
make build
/usr/bin/time -o resources.branch.json builds/preflight agent -c examples/one-shot-secret.yaml --one-shot  --output-path output.branch.json

$ diff -u resources.{master,branch}.json
--- resources.master.json       2024-08-13 09:58:55.272480838 +0100
+++ resources.branch.json       2024-08-13 10:00:16.592478779 +0100
@@ -1,14 +1,14 @@
 {
-    "cpu_seconds_user": 0.08,
-    "cpu_seconds_system": 0.04,
-    "time_elapsed": "0:00.13",
-    "cpu_percent": "89%",
+    "cpu_seconds_user": 0.03,
+    "cpu_seconds_system": 0.00,
+    "time_elapsed": "0:00.12",
+    "cpu_percent": "27%",
     "text_kilobytes": 0,
     "data_kilobytes": 0,
-    "rss_max_kilobytes": 41116,
-    "fs_inputs": 16,
-    "fs_outputs": 16,
+    "rss_max_kilobytes": 33892,
+    "fs_inputs": 0,
+    "fs_outputs": 8,
     "page_faults_major": 0,
-    "page_faults_minor": 5073,
+    "page_faults_minor": 1316,
     "swaps": 0
 }

Given a cluster containing the following Secrets:

$ kubectl get secrets -A
NAMESPACE     NAME                                                            TYPE                             DATA   AGE
kube-system   bootstrap-token-abcdef                                          bootstrap.kubernetes.io/token    6      5h24m
kube-system   sh.helm.release.v1.metrics-server.v1                            helm.sh/release.v1               1      5h24m
kube-system   sh.helm.release.v1.metrics-server.v2                            helm.sh/release.v1               1      5h21m
kube-system   sh.helm.release.v1.metrics-server.v3                            helm.sh/release.v1               1      5h19m
kube-system   sh.helm.release.v1.metrics-server.v4                            helm.sh/release.v1               1      5h18m
kube-system   sh.helm.release.v1.metrics-server.v5                            helm.sh/release.v1               1      5h15m
prometheus    default-grafana                                                 Opaque                           3      5h24m
prometheus    default-kube-prometheus-st-admission                            Opaque                           3      5h24m
prometheus    prometheus-default-kube-prometheus-st-prometheus                Opaque                           1      5h24m
prometheus    prometheus-default-kube-prometheus-st-prometheus-tls-assets-0   Opaque                           1      5h24m
prometheus    prometheus-default-kube-prometheus-st-prometheus-web-config     Opaque                           1      5h24m
prometheus    sh.helm.release.v1.default.v1                                   helm.sh/release.v1               1      5h24m
prometheus    sh.helm.release.v1.default.v2                                   helm.sh/release.v1               1      5h21m
prometheus    sh.helm.release.v1.default.v3                                   helm.sh/release.v1               1      5h19m
prometheus    sh.helm.release.v1.default.v4                                   helm.sh/release.v1               1      5h18m
prometheus    sh.helm.release.v1.default.v5                                   helm.sh/release.v1               1      5h15m
venafi        agent-credentials                                               Opaque                           2      5h23m
venafi        cert-manager-webhook-ca                                         Opaque                           3      5h17m
venafi        sh.helm.release.v1.cert-manager-csi-driver-spiffe.v1            helm.sh/release.v1               1      5h17m
venafi        sh.helm.release.v1.cert-manager.v1                              helm.sh/release.v1               1      5h18m
venafi        sh.helm.release.v1.tlspk-monitoring.v1                          helm.sh/release.v1               1      5h15m
venafi        venafi-credentials                                              Opaque                           2      5h23m
venafi        venafi-image-pull-secret                                        kubernetes.io/dockerconfigjson   1      5h23m

…ypes other than tls and Opaque Signed-off-by: Richard Wall <[email protected]>

Signed-off-by: Richard Wall <[email protected]>

maelvls · 2024-08-09T16:33:39Z

Your changes are sensible and I don't think anything will break as a consequence.

I can confirm that the Agent no longer receives/unmarshals Helm secrets. To double check that, I used mitmproxy and looked at the responses to the requests sent to the API server:

Before:

GET https://me:58453/api/v1/secrets?limit=500&resourceVersion=0 HTTP/2.0
accept: application/json
user-agent: preflight/v0.0.0 (darwin/arm64) kubernetes/$Format

cert-manager-webhook-ca                  Opaque
example                                  kubernetes.io/tls
sh.helm.release.v1.cert-manager.v1       helm.sh/release.v1
sh.helm.release.v1.cert-manager.v2       helm.sh/release.v1
cert-manager-webhook-ca                  Opaque
sh.helm.release.v1.cert-manager.v1       helm.sh/release.v1
apikey                                   Opaque
sh.helm.release.v1.venafi-connection.v1  helm.sh/release.v1

After:

GET https://localhost:58453/api/v1/secrets?fieldSelector=%2Ctype%21%3Dkubernetes.io%2Fservice-account-token%2Ctype%21%3Dkubernetes.io%2Fdockercfg%2Ctype%21%3Dkubernetes.io%2Fdockerconfigjson%2Ctype%21%3Dkubernetes.io%2Fbasic-auth%2Ctype%21%3Dkubernetes.io%2Fssh-auth%2Ctype%21%3Dbootstrap.kubernetes.io%2Ftoken%2Ctype%21%3Dhelm.sh%2Frelease.v1&limit=500&resourceVersion=0 HTTP/2.0
accept: application/json
user-agent: preflight/v0.0.0 (darwin/arm64) kubernetes/$Format

cert-manager-webhook-ca  Opaque
example                  kubernetes.io/tls
cert-manager-webhook-ca  Opaque
apikey                   Opaque

This should dramatically reduce the memory used at boot time when the Agent unmarshals the whole response at once.

--- resources.branch.json       2024-08-09 15:22:15.129798018 +0100
+++ resources.master.json       2024-08-09 15:23:22.344022251 +0100
@@ -1,14 +1,14 @@

I got a little confused by the ordering of your diff; shouldn't the master branch be - and your branch be + such that it reads "before/after"?

listing all Secrets places great load on the K8S API server

Do you think using selectors decreases the memory usage (and CPU usage) of the API server when the Agent lists the secrets?

My belief was that the API server still had to keep all the secrets in memory before filtering them, meaning that the memory load on the API server would be the same with and without the field selectors.

maelvls · 2024-08-09T16:50:48Z

Additional info on how to use mitmproxy with the agent:

First, install kubectl-incluster. It helps transforming kubeconfigs in a way that works with mitmproxy; I use that to inject mitmproxy's CA into the kubeconfig so that the Agent doesn't complain about not trusting the proxy.

Get the script that prevents mitmproxy from blocking on long-lasting requests:

curl -L https://raw.githubusercontent.com/maelvls/kubectl-incluster/main/watch-stream.py >/tmp/watch-stream.py

Then, set HTTPS_PROXY and KUBECONFIG:

export HTTPS_PROXY=http://localhost:9090
export KUBECONFIG=/tmp/kube
KUBECONFIG= HTTPS_PROXY= kubectl incluster --replace-ca-cert ~/.mitmproxy/mitmproxy-ca-cert.pem | sed 's|127.0.0.1|me|' >/tmp/kube

Then, make sure to have in your /etc/hosts the following line:

127.0.0.1 me

(Read https://maelvls.dev/go-ignores-proxy-localhost/ to learn why)

Create a key pair for the Private Key JWT authentication:

venctl iam service-account agent create --name "temp" \
  --vcp-region US \
  --output json \
  --output-file agent-credentials.json \
  --api-key $APIKEY

Finally, run the agent:

go run . agent -c examples/one-shot-secrets.yaml \
  --client-id $(jq -r .client_id agent-credentials.json) \
  --private-key-path <(jq -r .private_key agent-credentials.json) \
  -p 5m

hawksight · 2024-08-12T17:14:42Z

@wallrj - would we want to make this filtering optional, and off by default?

I think that discovering all identities regardless of their secret type is a key part of the product.
Yes we want performance and in large clusters this might be a good configuration option.

But from a security point of view, you don't want to filter any out? You actually want to find all those certs that might have been hidden on purpose. For example I was able to just apply whatever secret type I wanted here:

apiVersion: v1
data:
  tls.crt: <REDACT>
  tls.key: <REDACT>
kind: Secret
metadata:
  name: definitely-not-a-cert
  namespace: default
type: my-company.io/durp

It seems some types require certain fields, but I'm wondering if it's easy to just hide a cert in some other secret. In those cases we would be missing certs and not giving the other visibility?

maelvls · 2024-08-13T09:28:19Z

I think that discovering all identities regardless of their secret type is a key part of the product. It's easy to just hide a cert in some other secret. In those cases we would be missing certs and not giving the other visibility?

Thanks for pointing it out. Here is the selector Richard went with:

fieldSelector=
  type!=kubernetes.io/service-account-token,
  type!=kubernetes.io/dockercfg,
  type!=kubernetes.io/dockerconfigjson,
  type!=kubernetes.io/basic-auth,
  type!=kubernetes.io/ssh-auth,
  type!=bootstrap.kubernetes.io/token,
  type!=helm.sh

Richard went with a query parameter that selects the "open" set defined by the complement of the set of excluded types. This is a good thing as it doesn't exclude unknown types that may appear in the future. In your example, the custom type my-company.io/durp would be picked up by the agent.

Is it easy to just hide a cert in some other secret? In those cases we would be missing certs and not giving the other visibility?

Imagining that a person really wants to hide a cert, you are right, they could very well use one of the excluded types above, e.g. helm.sh, and hide the certificate in there by using a special key such as mycert. It should work as "types" don't prevent people from adding extra keys (example: the type kubernetes.io/tls lets you add additional keys).

But I don't think people aim to hide certificates; why would they?

Unfortunately, there isn't a good way of filtering out types based on what they contain server-side. For example, we can't include a dockercfg in which there is a tls.crt key or some random cert key containing a PEM-encoded certificate, unless we do it on the client-side, which is what we are trying to avoid doing.

I think excluding these types is a risk worth taking compared to the downside of a huge memory consumption (more than 3 GiB) in the 30 first seconds after starting the agent.

Side note: if we were to parse the YAML contained in these helm.sh secrets, we would surely find certificates... but they would obviously appear elsewhere in a secret or configmap, so the agent would pick them up anyways.

maelvls · 2024-08-13T09:30:39Z

By the way, I think we should also look at the other resources (besides secrets). On large clusters with tons of Pods, listing all the pods is also a problem... But I can't see a good way of filtering out "some" pods server-side (the only filtering that can happen is client-side, e.g. only report pods that mount secrets)

hawksight · 2024-08-13T09:55:37Z

echo "{}" > dockerconfig.json
k create secret docker-registry no-looky-here --from-file=.dockerconfigjson=dockerconfig.json --from-file=tls.crt=a-cert.crt

For example:

> k get secret no-looky-here
NAME            TYPE                             DATA   AGE
no-looky-here   kubernetes.io/dockerconfigjson   2      23m

hawksight · 2024-08-13T10:42:01Z

Thanks @wallrj & @maelvls. As discussed at the standup, and as Mael states above, there's nothing to stop someone adding extra keys to those kubernetes.io/xxx secret types as in the comment above.

Thats said, right now, someone could actually just change the data field key name and we also wouldn't detect that because lets say token doesn't match tls.crt, ca.crt or ca.crt?

» k create secret docker-registry real-docker-config-promise --from-file=.dockerconfigjson=dockerconfig.json --from-file=token=a-cert.crt
secret/real-docker-config-promise created
» k get secret real-docker-config-promise -o yaml | yq '.data | keys'
- .dockerconfigjson
- token

So actually, regardless of the type, it should be fairly easy to hide a cert from our agent, if someone really wanted to.
Regarding performance vs coverage, lets talk with @achuchev on that topic and put that in another thread.

Technically your PR looks like it solves the problem in mind.
I do think it would be beneficial to make it configurable now and set the default to what you have here:

ignoreSecretTypes = [ "kubernetes.io/service-account-token", "kubernetes.io/dockercfg", "kubernetes.io/dockerconfigjson", "kubernetes.io/basic-auth", "kubernetes.io/ssh-auth", "type", "bootstrap.kubernetes.io/token", "helm.sh/release.v1"]

It could be useful for testing purposes (without code change), and to give customers the option to include / exclude secret types depending on their performance needs.

wallrj · 2024-08-13T11:59:56Z

@maelvls

Do you think using selectors decreases the memory usage (and CPU usage) of the API server when the Agent lists the secrets?

I believe it should, for the reasons given in the WatchList design document. But I haven't measured it and therefore I've removed that as a motivation from the PR description.

wallrj · 2024-08-13T12:05:15Z

@hawksight

there's nothing to stop someone adding extra keys to those kubernetes.io/xxx secret types as in the comment above

I take your point, but it seems unlikely that someone would add a tls.crt or ca.crt file to a dockerconfig Secret, and even if they did, would it show up in the TLSPK webui?

The UI does show leaf (not CA) Certificates found in any type of Kubernetes Secret, since:

https://github.com/jetstack/preflight-platform/pull/4448

And with the following config and dockerconfigjson Secret containing the leaf cert from example.com, the cert does show up in the UI.

organization_id: "richardw"
cluster_id: "richardw-1"
period: 1m
data-gatherers:
- kind: "k8s-dynamic"
  name: "k8s/secrets"
  config:
    include-namespaces: ["default"]
    resource-type:
      version: v1
      resource: secrets
server: "https://api.venafi.cloud/"
venafi-cloud:
  uploader_id: "uploader-id-1"
  upload_path: "/v1/tlspk/upload/clusterdata"

apiVersion: v1
kind: Secret
metadata:
  generateName: test-
type: kubernetes.io/dockerconfigjson
stringData:
  .dockerconfigjson: |
    {
    "auths": {
      "registry.example.com": {
        "username": "user-1",
        "password": "password-1"
      }
    }
    }
  tls.crt: |
    -----BEGIN CERTIFICATE-----
    MIIHbjCCBlagAwIBAgIQB1vO8waJyK3fE+Ua9K/hhzANBgkqhkiG9w0BAQsFADBZ
    MQswCQYDVQQGEwJVUzEVMBMGA1UEChMMRGlnaUNlcnQgSW5jMTMwMQYDVQQDEypE
    aWdpQ2VydCBHbG9iYWwgRzIgVExTIFJTQSBTSEEyNTYgMjAyMCBDQTEwHhcNMjQw
    MTMwMDAwMDAwWhcNMjUwMzAxMjM1OTU5WjCBljELMAkGA1UEBhMCVVMxEzARBgNV
    BAgTCkNhbGlmb3JuaWExFDASBgNVBAcTC0xvcyBBbmdlbGVzMUIwQAYDVQQKDDlJ
    bnRlcm5ldMKgQ29ycG9yYXRpb27CoGZvcsKgQXNzaWduZWTCoE5hbWVzwqBhbmTC
    oE51bWJlcnMxGDAWBgNVBAMTD3d3dy5leGFtcGxlLm9yZzCCASIwDQYJKoZIhvcN
    AQEBBQADggEPADCCAQoCggEBAIaFD7sO+cpf2fXgCjIsM9mqDgcpqC8IrXi9wga/
    9y0rpqcnPVOmTMNLsid3INbBVEm4CNr5cKlh9rJJnWlX2vttJDRyLkfwBD+dsVvi
    vGYxWTLmqX6/1LDUZPVrynv/cltemtg/1Aay88jcj2ZaRoRmqBgVeacIzgU8+zmJ
    7236TnFSe7fkoKSclsBhPaQKcE3Djs1uszJs8sdECQTdoFX9I6UgeLKFXtg7rRf/
    hcW5dI0zubhXbrW8aWXbCzySVZn0c7RkJMpnTCiZzNxnPXnHFpwr5quqqjVyN/aB
    KkjoP04Zmr+eRqoyk/+lslq0sS8eaYSSHbC5ja/yMWyVhvMCAwEAAaOCA/IwggPu
    MB8GA1UdIwQYMBaAFHSFgMBmx9833s+9KTeqAx2+7c0XMB0GA1UdDgQWBBRM/tAS
    TS4hz2v68vK4TEkCHTGRijCBgQYDVR0RBHoweIIPd3d3LmV4YW1wbGUub3Jnggtl
    eGFtcGxlLm5ldIILZXhhbXBsZS5lZHWCC2V4YW1wbGUuY29tggtleGFtcGxlLm9y
    Z4IPd3d3LmV4YW1wbGUuY29tgg93d3cuZXhhbXBsZS5lZHWCD3d3dy5leGFtcGxl
    Lm5ldDA+BgNVHSAENzA1MDMGBmeBDAECAjApMCcGCCsGAQUFBwIBFhtodHRwOi8v
    d3d3LmRpZ2ljZXJ0LmNvbS9DUFMwDgYDVR0PAQH/BAQDAgWgMB0GA1UdJQQWMBQG
    CCsGAQUFBwMBBggrBgEFBQcDAjCBnwYDVR0fBIGXMIGUMEigRqBEhkJodHRwOi8v
    Y3JsMy5kaWdpY2VydC5jb20vRGlnaUNlcnRHbG9iYWxHMlRMU1JTQVNIQTI1NjIw
    MjBDQTEtMS5jcmwwSKBGoESGQmh0dHA6Ly9jcmw0LmRpZ2ljZXJ0LmNvbS9EaWdp
    Q2VydEdsb2JhbEcyVExTUlNBU0hBMjU2MjAyMENBMS0xLmNybDCBhwYIKwYBBQUH
    AQEEezB5MCQGCCsGAQUFBzABhhhodHRwOi8vb2NzcC5kaWdpY2VydC5jb20wUQYI
    KwYBBQUHMAKGRWh0dHA6Ly9jYWNlcnRzLmRpZ2ljZXJ0LmNvbS9EaWdpQ2VydEds
    b2JhbEcyVExTUlNBU0hBMjU2MjAyMENBMS0xLmNydDAMBgNVHRMBAf8EAjAAMIIB
    fQYKKwYBBAHWeQIEAgSCAW0EggFpAWcAdABOdaMnXJoQwzhbbNTfP1LrHfDgjhuN
    acCx+mSxYpo53wAAAY1b0vxkAAAEAwBFMEMCH0BRCgxPbBBVxhcWZ26a8JCe83P1
    JZ6wmv56GsVcyMACIDgpMbEo5HJITTRPnoyT4mG8cLrWjEvhchUdEcWUuk1TAHYA
    fVkeEuF4KnscYWd8Xv340IdcFKBOlZ65Ay/ZDowuebgAAAGNW9L8MAAABAMARzBF
    AiBdv5Z3pZFbfgoM3tGpCTM3ZxBMQsxBRSdTS6d8d2NAcwIhALLoCT9mTMN9OyFz
    IBV5MkXVLyuTf2OAzAOa7d8x2H6XAHcA5tIxY0B3jMEQQQbXcbnOwdJA9paEhvu6
    hzId/R43jlAAAAGNW9L8XwAABAMASDBGAiEA4Koh/VizdQU1tjZ2E2VGgWSXXkwn
    QmiYhmAeKcVLHeACIQD7JIGFsdGol7kss2pe4lYrCgPVc+iGZkuqnj26hqhr0TAN
    BgkqhkiG9w0BAQsFAAOCAQEABOFuAj4N4yNG9OOWNQWTNSICC4Rd4nOG1HRP/Bsn
    rz7KrcPORtb6D+Jx+Q0amhO31QhIvVBYs14gY4Ypyj7MzHgm4VmPXcqLvEkxb2G9
    Qv9hYuEiNSQmm1fr5QAN/0AzbEbCM3cImLJ69kP5bUjfv/76KB57is8tYf9sh5ik
    LGKauxCM/zRIcGa3bXLDafk5S2g5Vr2hs230d/NGW1wZrE+zdGuMxfGJzJP+DAFv
    iBfcQnFg4+1zMEKcqS87oniOyG+60RMM0MdejBD7AS43m9us96Gsun/4kufLQUTI
    FfnzxLutUV++3seshgefQOy5C/ayi8y1VTNmujPCxPCi6Q==
    -----END CERTIFICATE-----

Signed-off-by: Richard Wall <[email protected]>

hawksight · 2024-08-14T14:00:43Z

@wallrj I think my "objection" can be taken offline. I think my focus is on complete coverage, but it seems that we already have gaps that can be exploited and therefore it can be a separate conversation.

I was about to suggest something but see you have already done it here, you must have read my mind!

Thanks @wallrj for making the effort to have this configurable.

Signed-off-by: Richard Wall <[email protected]>

…cesFieldSelector Signed-off-by: Richard Wall <[email protected]>

wallrj · 2024-08-14T16:05:47Z

deploy/charts/venafi-kubernetes-agent/README.md

Generated by make update-helm-docs.

wallrj · 2024-08-14T16:06:51Z

deploy/charts/venafi-kubernetes-agent/values.yaml

@hawksight I added ignoredSecretTypes to the Helm chart values, as you suggested.

wallrj · 2024-08-14T16:07:15Z

docs/datagatherers/k8s-dynamic.md

@@ -64,7 +64,7 @@ resource referenced in the `kind` for that datagatherer.
 There is an example `ClusterRole` and `ClusterRoleBinding` which can be found in
 [`./deployment/kubernetes/base/00-rbac.yaml`](./deployment/kubernetes/base/00-rbac.yaml).

-# Secrets
+## Secrets


Drive-by fix. This was wrong heading level.

wallrj · 2024-08-14T16:13:26Z

pkg/datagatherer/k8s/dynamic.go

@@ -52,6 +54,7 @@ func (c *ConfigDynamic) UnmarshalYAML(unmarshal func(interface{}) error) error {
 		} `yaml:"resource-type"`
 		ExcludeNamespaces []string `yaml:"exclude-namespaces"`
 		IncludeNamespaces []string `yaml:"include-namespaces"`
+		FieldSelectors    []string `yaml:"field-selectors"`


It's a bit weird that the yaml field name hints are added in two places, but I decided to follow the existing pattern.
Perhaps the reason is so that for example, the configuration file can contain strings which are unmarshalled and then converted into some other rich type?

wallrj · 2024-08-14T16:17:15Z

pkg/datagatherer/k8s/dynamic.go

+	for i, selectorString := range c.FieldSelectors {
+		if selectorString == "" {
+			errors = append(errors, fmt.Sprintf("invalid field selector %d: must not be empty", i))
+		}
+		_, err := fields.ParseSelector(selectorString)
+		if err != nil {
+			errors = append(errors, fmt.Sprintf("invalid field selector %d: %s", i, err))
+		}
+	}


The selectors are validated here but the validation is not very good.
ParseSelector will happily parse strings like "type != foo" into

fields.Requirement { Op: "!=", Field: "type " // with trailing space Value: " foo" // with leading space }

And then later the client-go Reflector will fail with an error like "unknown field: type " but you won't be able to see the trailing space, because it's not quoted.

wallrj · 2024-08-14T16:19:40Z

pkg/datagatherer/k8s/dynamic.go

@@ -150,7 +164,15 @@ func (c *ConfigDynamic) newDataGathererWithClient(ctx context.Context, cl dynami
 		return nil, err
 	}
 	// init shared informer for selected namespaces
-	fieldSelector := generateFieldSelector(c.ExcludeNamespaces)
+	fieldSelector := generateExcludedNamespacesFieldSelector(c.ExcludeNamespaces)


I gave this function a much more specific name to avoid any confusion about how it fits in with the user-supplied field-selectors.
Also changed the return type to fields.Selector so that the user-supplied selectors can be easily ANDed to it.

wallrj · 2024-08-14T16:21:58Z

pkg/datagatherer/k8s/dynamic.go

-// generateFieldSelector creates a field selector string from a list of
-// namespaces to exclude.
-func generateFieldSelector(excludeNamespaces []string) string {
-	fieldSelector := fields.Nothing()


I think there was a bug in this function. This initial fields.Nothing() caused the returned selector to always have a trailing comma which was ugly when I appended the user-supplied field selectors.

wallrj · 2024-08-14T16:23:26Z

pkg/datagatherer/k8s/dynamic_test.go

@@ -105,8 +105,12 @@ func sortGatheredResources(list []*api.GatheredResource) {
 func TestNewDataGathererWithClientAndDynamicInformer(t *testing.T) {
 	ctx := context.Background()
 	config := ConfigDynamic{
-		IncludeNamespaces:    []string{"a"},
+		ExcludeNamespaces:    []string{"kube-system"},


The validate function prevents you having both Include and Exclude namespaces, so I removed the IncludeNamespaces in this test because I wanted to see how the exclude namespace field selector got merged with the user-supplied field selectors.

wallrj · 2024-08-14T16:24:18Z

pkg/datagatherer/k8s/dynamic_test.go

@@ -300,19 +342,19 @@ func TestGenerateFieldSelector(t *testing.T) {
 			ExcludeNamespaces: []string{
 				"kube-system",
 			},
-			ExpectedFieldSelector: "metadata.namespace!=kube-system,",
+			ExpectedFieldSelector: "metadata.namespace!=kube-system",


This is the trailing comma bug that I referred to in an earlier comment.

achuchev · 2024-08-15T07:51:07Z

deploy/charts/venafi-kubernetes-agent/values.yaml

+  - kubernetes.io/dockercfg
+  - kubernetes.io/dockerconfigjson
+  - kubernetes.io/basic-auth
+  - kubernetes.io/ssh-auth,


Is the trailing comma expected here?

@achuchev

Thanks @achuchev Signed-off-by: Richard Wall <[email protected]>

wallrj · 2024-08-22T10:33:47Z

Here's the difference you'll see in the rendered Helm manifests as a result of this feature:

$ diff -u \
  <(helm template oci://registry.venafi.cloud/charts/venafi-kubernetes-agent --version 0.1.49) \
  <(helm template deploy/charts/venafi-kubernetes-agent/)

Pulled: registry.venafi.cloud/charts/venafi-kubernetes-agent:0.1.49
Digest: sha256:0cc0244761fbc8fb48b3c7a94e58c4f7611b24432481f1ecf95ab4aa399033b9
--- /dev/fd/63  2024-08-22 11:27:00.292015674 +0100
+++ /dev/fd/62  2024-08-22 11:27:00.292015674 +0100
@@ -106,6 +106,14 @@
         resource-type:
           version: v1
           resource: secrets
+        field-selectors:
+        - type!=kubernetes.io/service-account-token
+        - type!=kubernetes.io/dockercfg
+        - type!=kubernetes.io/dockerconfigjson
+        - type!=kubernetes.io/basic-auth
+        - type!=kubernetes.io/ssh-auth
+        - type!=bootstrap.kubernetes.io/token
+        - type!=helm.sh/release.v1
     - kind: "k8s-dynamic"
       name: "k8s/certificates"
       config:

maelvls

I've done a cursory reading of the changes and found nothing out of the ordinary.

I haven't tested the changes myself either through running the binary locally, or through the helm chart.

wallrj · 2024-08-22T14:36:04Z

To give an idea of the size of Helm release secrets, I installed every version of cert-manager since 1.4 and realised that Helm does prune all but the last 10 Secrets:

curl -fsSL https://charts.jetstack.io/index.yaml \
| yq '.entries.cert-manager[] | select(.annotations["artifacthub.io/prerelease"] == "false" and .version == "v1.*" ) | .version' \
| tac  \
| while read version; do \
  helm upgrade cert-manager cert-manager \
     --repo https://charts.jetstack.io \
      --install \
      --create-namespace \
      --namespace=cert-manager \
      --set installCRDs=true \
      --wait \
      --version "${version}"; \
  done

$ kubectl get secret --field-selector=type=helm.sh/release.v1 -A
NAMESPACE      NAME                                            TYPE                 DATA   AGE
cert-manager   sh.helm.release.v1.cert-manager.v58             helm.sh/release.v1   1      46m
cert-manager   sh.helm.release.v1.cert-manager.v59             helm.sh/release.v1   1      46m
cert-manager   sh.helm.release.v1.cert-manager.v60             helm.sh/release.v1   1      45m
cert-manager   sh.helm.release.v1.cert-manager.v61             helm.sh/release.v1   1      45m
cert-manager   sh.helm.release.v1.cert-manager.v62             helm.sh/release.v1   1      45m
cert-manager   sh.helm.release.v1.cert-manager.v63             helm.sh/release.v1   1      44m
cert-manager   sh.helm.release.v1.cert-manager.v64             helm.sh/release.v1   1      44m
cert-manager   sh.helm.release.v1.cert-manager.v65             helm.sh/release.v1   1      43m
cert-manager   sh.helm.release.v1.cert-manager.v66             helm.sh/release.v1   1      43m
cert-manager   sh.helm.release.v1.cert-manager.v67             helm.sh/release.v1   1      43m
venafi         sh.helm.release.v1.venafi-kubernetes-agent.v1   helm.sh/release.v1   1      26h
venafi         sh.helm.release.v1.venafi-kubernetes-agent.v2   helm.sh/release.v1   1      3h27m
venafi         sh.helm.release.v1.venafi-kubernetes-agent.v3   helm.sh/release.v1   1      3h21m

$ kubectl get secret --field-selector=type=helm.sh/release.v1 -A -o json  | dd of=/dev/null
7437+1 records in
7437+1 records out
3807929 bytes (3.8 MB, 3.6 MiB) copied, 0.61285 s, 6.2 MB/s

wallrj · 2024-08-22T14:52:20Z

I've done a cursory reading of the changes and found nothing out of the ordinary.

I haven't tested the changes myself either through running the binary locally, or through the helm chart.

Thanks @maelvls

I also rebased on top of #552 and re-ran the hack/e2e/test.sh script on the cluster where I'd installed all the cert-manager releases. It succeeded and the memory dropped from 69MiB to 54MiB as follows:

# Before
$ grep HWM /proc/$(pidof preflight)/status
VmHWM:     72944 kB

# After
$ grep HWM /proc/$(pidof preflight)/status
VmHWM:     57260 kB

wallrj added 2 commits August 9, 2024 15:47

Reduce memory by excluding Helm release Secrets and standard Secret t…

737eb5a

…ypes other than tls and Opaque Signed-off-by: Richard Wall <[email protected]>

Add an example of a one-shot discovery config

6eed6f4

Signed-off-by: Richard Wall <[email protected]>

wallrj force-pushed the memory-optimization branch from b24ffa5 to 6eed6f4 Compare August 9, 2024 14:47

wallrj changed the title ~~WIP: Reduce memory by excluding Helm release Secrets and standard Secret types other than tls and Opaque~~ Reduce memory by excluding Helm release Secrets and standard Secret types other than tls and Opaque Aug 9, 2024

maelvls mentioned this pull request Aug 9, 2024

fix/vc-31703-agent-memory-startup-spikes #525

Merged

wallrj requested a review from maelvls August 9, 2024 16:11

wallrj changed the title ~~Reduce memory by excluding Helm release Secrets and standard Secret types other than tls and Opaque~~ Reduce memory by excluding Helm release Secrets and some standard Secret types Aug 13, 2024

wallrj changed the title ~~Reduce memory by excluding Helm release Secrets and some standard Secret types~~ [VC-35377] Reduce memory by excluding Helm release Secrets and some standard Secret types Aug 13, 2024

wallrj added 2 commits August 14, 2024 14:57

Allow the ignored Secret types to be configured by the Helm chart values

7f45cfb

Signed-off-by: Richard Wall <[email protected]>

make update-helm-docs

e3d1c98

Signed-off-by: Richard Wall <[email protected]>

wallrj added 2 commits August 14, 2024 15:17

Documentation of field selectors

29efc5d

Signed-off-by: Richard Wall <[email protected]>

Add unit-tests and fix a trailing comma bug in generateExcludeNamespa…

15998fc

…cesFieldSelector Signed-off-by: Richard Wall <[email protected]>

wallrj commented Aug 14, 2024

View reviewed changes

achuchev reviewed Aug 15, 2024

View reviewed changes

Fix trailing comma in the Helm values.

72450b9

Thanks @achuchev Signed-off-by: Richard Wall <[email protected]>

maelvls approved these changes Aug 22, 2024

View reviewed changes

wallrj merged commit 803b953 into jetstack:master Aug 22, 2024
4 checks passed

wallrj deleted the memory-optimization branch August 22, 2024 14:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VC-35377] Reduce memory by excluding Helm release Secrets and some standard Secret types #554

[VC-35377] Reduce memory by excluding Helm release Secrets and some standard Secret types #554

wallrj commented Aug 9, 2024 •

edited

Loading

maelvls commented Aug 9, 2024 •

edited

Loading

maelvls commented Aug 9, 2024

hawksight commented Aug 12, 2024

maelvls commented Aug 13, 2024 •

edited

Loading

maelvls commented Aug 13, 2024

hawksight commented Aug 13, 2024 •

edited

Loading

hawksight commented Aug 13, 2024

wallrj commented Aug 13, 2024

wallrj commented Aug 13, 2024 •

edited

Loading

hawksight commented Aug 14, 2024

wallrj Aug 14, 2024

wallrj Aug 14, 2024

wallrj Aug 14, 2024

wallrj Aug 14, 2024

wallrj Aug 14, 2024

wallrj Aug 14, 2024

wallrj Aug 14, 2024

wallrj Aug 14, 2024

wallrj Aug 14, 2024

achuchev Aug 15, 2024

wallrj Aug 16, 2024

wallrj commented Aug 22, 2024 •

edited

Loading

maelvls left a comment

wallrj commented Aug 22, 2024

wallrj commented Aug 22, 2024

	// SecretSelectedFields is the list of fields sent from Secret objects to the
	// backend
	var SecretSelectedFields = []string{
	"kind",
	"apiVersion",
	"metadata.annotations",
	"metadata.name",
	"metadata.namespace",
	"metadata.ownerReferences",
	"metadata.selfLink",
	"metadata.uid",
	"type",
	"/data/tls.crt",
	"/data/ca.crt",
	}

[VC-35377] Reduce memory by excluding Helm release Secrets and some standard Secret types #554

[VC-35377] Reduce memory by excluding Helm release Secrets and some standard Secret types #554

Conversation

wallrj commented Aug 9, 2024 • edited Loading

Why does agent request all the Secrets from the K8S API server when it starts up?

Alternatives considered

Use metadata-only API

Use the WatchList feature

Only list Opaque and kubernets.io/tls Secret types

Testing

maelvls commented Aug 9, 2024 • edited Loading

maelvls commented Aug 9, 2024

hawksight commented Aug 12, 2024

maelvls commented Aug 13, 2024 • edited Loading

maelvls commented Aug 13, 2024

hawksight commented Aug 13, 2024 • edited Loading

hawksight commented Aug 13, 2024

wallrj commented Aug 13, 2024

wallrj commented Aug 13, 2024 • edited Loading

hawksight commented Aug 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wallrj commented Aug 22, 2024 • edited Loading

maelvls left a comment

Choose a reason for hiding this comment

wallrj commented Aug 22, 2024

wallrj commented Aug 22, 2024

wallrj commented Aug 9, 2024 •

edited

Loading

Only list `Opaque` and `kubernets.io/tls` Secret types

maelvls commented Aug 9, 2024 •

edited

Loading

maelvls commented Aug 13, 2024 •

edited

Loading

hawksight commented Aug 13, 2024 •

edited

Loading

wallrj commented Aug 13, 2024 •

edited

Loading

wallrj commented Aug 22, 2024 •

edited

Loading