Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VC-35568: Feature: Send errors to the pod's event to increase visibility #589

Merged
merged 6 commits into from
Oct 18, 2024

Conversation

maelvls
Copy link
Member

@maelvls maelvls commented Oct 7, 2024

Ref: VC-35568

This feature aims at giving the user another way of knowing when the Agent isn't sending data to the VCP API. With this PR, the user will have two ways of noticing that the Agent isn't properly working:

  • By looking at the logs (but hard since the logs are overwhelming (cf. VC-33564),
  • By looking at the Kubernetes events on the Agent pod.

POSSIBLE BREAKING CHANGE 1: Previously, --install-namespace was only needed when using the VenafiConnection mode (i.e., using --venafi-connection). Starting with this PR, all modes will require knowing what the pod's namespace is. For context, the namespace is looked up in /var/run/secrets/kubernetes.io/serviceaccount/namespace if the pod has a service account mounted to it (always the case for the agent) it looks for the contents of POD_NAMESPACE defined in the deployment; otherwise, it uses the contents of --install-namespace (meant for running the agent out-of-cluster for testing purposes).

Is this really a breaking change? Very low chance of breakage since POD_NAMESPACE has been added to the Helm chart regardless of the auth mode.

POSSIBLE BREAKING CHANGE 2: Previously, the namespace would be loaded from the file /var/run/secrets/kubernetes.io/serviceaccount/namespace. Now, it is loaded from the env var POD_NAMESPACE. The Helm chart has been updated with this env var.

Is this really a breaking change? Low chance of breakage since folks will upgrade using the updated Helm chart which will contain POD_NAMESPACE.

Testing

There are no automated tests for this change.

I tested the "events" feature with the following:

I've used the tenant https://ven-tlspk.venafi.cloud/. To access the API key, use the user [email protected] and the password is visible in the page Production Accounts (private to Venafi). Then go to the settings and find the API key.

export APIKEY=...

I've then deployed the agent to a Kind cluster:

kind create cluster
venctl iam service-account agent create --name "$USER temp" \
  --vcp-region US \
  --output json \
  --owning-team $(curl -sS https://api.venafi.cloud/v1/teams -H "tppl-api-key: $APIKEY" | jq '.teams[0].id') \
  --output-file /tmp/agent-credentials.json \
  --api-key $APIKEY
make oci-push-preflight oci_preflight_image_name=ttl.sh/mael/venafi-agent oci_preflight_image_tag=v0.0.0-dev oci_platforms=linux/arm64,linux/amd64
make helm-chart oci_preflight_image_tag=v0.0.0-dev helm_chart_version=0.0.0-dev oci_preflight_image_name=ttl.sh/mael/venafi-agent
helm push _bin/scratch/image/venafi-kubernetes-agent-0.0.0-dev.tgz oci://ttl.sh/mael/charts
helm upgrade -i -n venafi --create-namespace venafi-kubernetes-agent oci://ttl.sh/mael/charts/venafi-kubernetes-agent --version 0.0.0-dev \
  --set config.clusterName="$USER temp" --set config.clientId="$(jq -r .client_id /tmp/agent-credentials.json)"
kubectl create secret generic -n venafi agent-credentials --from-literal=privatekey.pem="$(jq -r .private_key /tmp/agent-credentials.json)" \
  --dry-run=client -o yaml | kubectl apply -f -

Now, cause an error by re-deploying with a broken client ID, for example:

helm upgrade -i -n venafi --create-namespace venafi-kubernetes-agent oci://ttl.sh/mael/charts/venafi-kubernetes-agent --version 0.0.0-dev \
  --set config.clusterName="$USER temp" --set config.clientId=bogus

You should see in the pod's events:

$ k describe -n venafi pod -l app.kubernetes.io/name=venafi-kubernetes-agent
Events:
  Type     Reason      Age   From                     Message
  ----     ------      ----  ----                     -------
  Normal   Scheduled   65s   default-scheduler        Successfully assigned venafi/venafi-kubernetes-agent-585989f47-4kj5g to kind-control-plane
  Normal   Pulled      65s   kubelet                  Container image "ttl.sh/mael/venafi-agent:v0.0.0-dev" already present on machine
  Normal   Created     65s   kubelet                  Created container venafi-kubernetes-agent
  Normal   Started     65s   kubelet                  Started container venafi-kubernetes-agent
  Warning  PushingErr  59s   venafi-kubernetes-agent  retrying in 31.071441118s after error: post to server failed: failed to execute http request to Venafi Control Plane. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"Not found"}
]
  Warning  PushingErr  28s  venafi-kubernetes-agent  retrying in 1m4.430613764s after error: post to server failed: failed to execute http request to Venafi Control Plane. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"Not found"}
]

@maelvls maelvls marked this pull request as draft October 7, 2024 13:19
@maelvls maelvls changed the title VC-35568: feature: send errors to the pod's event to increase visibility VC-35568: Feature: Send errors to the pod's event to increase visibility Oct 7, 2024
@maelvls maelvls force-pushed the VC-35568-events-to-pod branch from dcf7f5b to 5760881 Compare October 7, 2024 14:16
@maelvls maelvls marked this pull request as ready for review October 7, 2024 14:16
@maelvls maelvls force-pushed the VC-35568-events-to-pod branch from 5760881 to 05e8dcb Compare October 7, 2024 14:18
@maelvls maelvls requested a review from wallrj October 11, 2024 10:36
maelvls and others added 2 commits October 11, 2024 16:19
WARNING: starting with this commit, the Agent must be know about the
namespace in which the agent's pod is running:

- If the pod has a service account mounted, then the agent will load the
  namespace from the file
  /var/run/secrets/kubernetes.io/serviceaccount/namespace.
- Otherwise, you can provide the namespace using --install-namespace
  (meant for running the agent out-of-cluster for testing purposes).

Previously, --install-namespace was only needed when using the
VenafiConnection mode (i.e., using --venafi-connection).
pkg/agent/run.go Outdated
broadcaster.StartRecordingToSink(&clientgocorev1.EventSinkImpl{Interface: eventClient.CoreV1().Events(installNS)})
eventRec := broadcaster.NewRecorder(scheme, corev1.EventSource{})
eventf = func(eventType, reason, msg string, args ...interface{}) {
eventRec.Eventf(&corev1.Pod{ObjectMeta: v1.ObjectMeta{Name: podName, Namespace: installNS}}, eventType, reason, msg, args...)
Copy link
Member

@wallrj wallrj Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected to see the events included in the output of kubectl describe pod -l app.kubernetes.io/name=venafi-kubernetes-agent but they were missing.
Perhaps because this metadata does not include the POD UID?

I applied this patch and it allowed me to see the events with kubectl describe pod

diff --git a/deploy/charts/venafi-kubernetes-agent/templates/deployment.yaml b/deploy/charts/venafi-kubernetes-agent/templates/deployment.yaml
index cb780e7..0115c1a 100644
--- a/deploy/charts/venafi-kubernetes-agent/templates/deployment.yaml
+++ b/deploy/charts/venafi-kubernetes-agent/templates/deployment.yaml
@@ -98,6 +98,10 @@ spec:
             valueFrom:
               fieldRef:
                 fieldPath: metadata.name
+          - name: POD_UID
+            valueFrom:
+              fieldRef:
+                fieldPath: metadata.uid
           {{- if .Values.metrics.enabled }}
           ports:
             - containerPort: 8081
diff --git a/pkg/agent/run.go b/pkg/agent/run.go
index 8877c5e..0fa8038 100644
--- a/pkg/agent/run.go
+++ b/pkg/agent/run.go
@@ -21,6 +21,7 @@ import (
        corev1 "k8s.io/api/core/v1"
        v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
        "k8s.io/apimachinery/pkg/runtime"
+       "k8s.io/apimachinery/pkg/types"
        "k8s.io/client-go/kubernetes"
        clientgocorev1 "k8s.io/client-go/kubernetes/typed/core/v1"
        "k8s.io/client-go/tools/record"
@@ -231,7 +232,7 @@ func newEventf(installNS string) (Eventf, error) {
                broadcaster.StartRecordingToSink(&clientgocorev1.EventSinkImpl{Interface: eventClient.CoreV1().Events(installNS)})
                eventRec := broadcaster.NewRecorder(scheme, corev1.EventSource{})
                eventf = func(eventType, reason, msg string, args ...interface{}) {
-                       eventRec.Eventf(&corev1.Pod{ObjectMeta: v1.ObjectMeta{Name: podName, Namespace: installNS}}, eventType, reason, msg, args...)
+                       eventRec.Eventf(&corev1.Pod{ObjectMeta: v1.ObjectMeta{Name: podName, Namespace: installNS, UID: types.UID(os.Getenv("POD_UID"))}}, eventType, reason, msg, args...)
                }
        }

Name:             venafi-kubernetes-agent-7c4ccb8464-87x89
Namespace:        venafi
Priority:         0
Service Account:  venafi-kubernetes-agent
Node:             gke-eu-cluster-1-default-pool-f9b6e891-htsp/10.154.15.205
Start Time:       Fri, 11 Oct 2024 16:56:59 +0100
Labels:           app.kubernetes.io/instance=venafi-kubernetes-agent
                  app.kubernetes.io/name=venafi-kubernetes-agent
                  pod-template-hash=7c4ccb8464
Annotations:      kubectl.kubernetes.io/restartedAt: 2024-10-11T16:32:27+01:00
Status:           Running
IP:               10.40.1.8
IPs:
  IP:           10.40.1.8
Controlled By:  ReplicaSet/venafi-kubernetes-agent-7c4ccb8464
Containers:
  venafi-kubernetes-agent:
    Container ID:  containerd://60455abd6d8003a4a6e4d6a544c145472ba21a7d9a61842696385bc1b3c8fe73
    Image:         ttl.sh/63773370-0bcf-4ac0-bd42-5515616089ff/images/venafi-agent:v1.1.0-6-g1bc78f07f48245-dirty
    Image ID:      ttl.sh/63773370-0bcf-4ac0-bd42-5515616089ff/images/venafi-agent@sha256:37f4b7a5def2d95ffc07a77dc4d4fa3b4788131762c72b0b559124ee99657900
    Port:          8081/TCP
    Host Port:     0/TCP
    Args:
      agent
      -c
      /etc/venafi/agent/config/config.yaml
      --venafi-connection
      venafi-components
      --venafi-connection-namespace
      venafi
      --venafi-cloud
      --enable-metrics
    State:          Running
      Started:      Fri, 11 Oct 2024 16:57:04 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  500Mi
    Requests:
      cpu:      200m
      memory:   200Mi
    Liveness:   http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:  http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAMESPACE:  venafi (v1:metadata.namespace)
      POD_NAME:       venafi-kubernetes-agent-7c4ccb8464-87x89 (v1:metadata.name)
      POD_UID:         (v1:metadata.uid)
    Mounts:
      /etc/venafi/agent/config from config (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pdpg8 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       True
  ContainersReady             True
  PodScheduled                True
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      agent-config
    Optional:  false
  kube-api-access-pdpg8:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason      Age   From               Message
  ----     ------      ----  ----               -------
  Normal   Scheduled   69s   default-scheduler  Successfully assigned venafi/venafi-kubernetes-agent-7c4ccb8464-87x89 to gke-eu-cluster-1-default-pool-f9b6e891-htsp
  Normal   Pulling     68s   kubelet            Pulling image "ttl.sh/63773370-0bcf-4ac0-bd42-5515616089ff/images/venafi-agent:v1.1.0-6-g1bc78f07f48245-dirty"
  Normal   Pulled      65s   kubelet            Successfully pulled image "ttl.sh/63773370-0bcf-4ac0-bd42-5515616089ff/images/venafi-agent:v1.1.0-6-g1bc78f07f48245-dirty" in 3.469s (3.469s including waiting). Image size: 19813240 bytes.
  Normal   Created     65s   kubelet            Created container venafi-kubernetes-agent
  Normal   Started     64s   kubelet            Started container venafi-kubernetes-agent
  Warning  PushingErr  59s                      retrying in 20.99698131s after error: post to server failed: while loading the VenafiConnection venafi/venafi-components: connection is not ready yet (building connection failed): chain element 1 (VCPOAuth) error: Post "https://api.venafi.eu.example/v1/oauth2/v2.0/9a0cab61-2b00-11ee-ba09-0733b0fe5adc/token": dial tcp: lookup api.venafi.eu.example on 34.118.224.10:53: no such host
  Warning  PushingErr  38s                      retrying in 26.917017503s after error: post to server failed: while loading the VenafiConnection venafi/venafi-components: connection is not ready yet (building connection failed): chain element 1 (VCPOAuth) error: Post "https://api.venafi.eu.example/v1/oauth2/v2.0/9a0cab61-2b00-11ee-ba09-0733b0fe5adc/token": dial tcp: lookup api.venafi.eu.example on 34.118.224.10:53: no such host
  Warning  PushingErr  11s                      retrying in 55.902139985s after error: post to server failed: while loading the VenafiConnection venafi/venafi-components: connection is not ready yet (building connection failed): chain element 1 (VCPOAuth) error: Post "https://api.venafi.eu.example/v1/oauth2/v2.0/9a0cab61-2b00-11ee-ba09-0733b0fe5adc/token": dial tcp: lookup api.venafi.eu.example on 34.118.224.10:53: no such host

Interestingly, the part of the kubectl describe output showing the environment variables, does not show the UID from the downward API, but it is supported:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had never tried using eventRec.Eventf using a manually-crafted runtime.Object... in the past, I have always used client-go to get a runtime.Object, and then I'd use eventRec.Eventf on that object to send events.

Do we actually need POD_NAME in the case we provide POD_UID?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. What did you find when you tested it?

pkg/agent/run.go Outdated Show resolved Hide resolved
Copy link
Member

@wallrj wallrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @maelvls

It seems to work, but I was expecting to be able to see the events in the output of kubectl describe pod -n venafi -l app.kubernetes.io/name=venafi-kubernetes-agent.

I will push a few commits to make that work.

And I noticed a bug in the RBAC, which I've also fixed and will push.

@wallrj wallrj force-pushed the VC-35568-events-to-pod branch from 05e8dcb to c1b54d3 Compare October 11, 2024 16:28
@wallrj wallrj force-pushed the VC-35568-events-to-pod branch from c1b54d3 to 601b594 Compare October 11, 2024 16:28
// Usually, the namespace is guessed from the file
// /var/run/secrets/kubernetes.io/serviceaccount/namespace. But since we
// can't realistically set that file in tests, we pass the flag
// --install-namespace in all the tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added POD_NAMESPACE to deployment.yaml.
So perhaps instead of getting the namespace from the file it could get it from the environment.
Then maybe you could just add a t.SetEnv("POD_NAMESPACE", "venafi") to these tests and avoid all the churn in the test code below?

Copy link
Member Author

@maelvls maelvls Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added POD_NAMESPACE to deployment.yaml. So perhaps instead of getting the namespace from the file it could get it from the environment.

Good point, it seems like a better idea. This file-based discovery of the namespace was implicit, and I'd very much prefer clearly see what is being passed to the process in the deployment, like what you did with POD_NAMESPACE, POD_NAME, and POD_UID.

That said, I've already implemented the discovery of the namespace using the file-based approach in the Venafi Connection auth mode. I guess I should migrate that to using POD_NAMESPACE instead too. I'll do that in a follow-up PR, though. I've fixed this in bc2958c.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you'd be able to use t.SetEnv in the parent test scope and have it applied to all the sub-tests.
If that's not possible then perhaps my POD_NAMESPACE suggestion was misleading.
I'm honestly not sure what is the best option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, I could have used t.SetEnv on the top level t.

The reason I haven't done so is that I am somewhat reluctant doing that because it would hide (by hide, I mean it wouldn't be obvious) an important input to these test cases.

Thinking about it, I guess it isn't that big of a deal if I set it on the top level t...

@maelvls
Copy link
Member Author

maelvls commented Oct 14, 2024

I expected to see the events included in the output of kubectl describe pod -l app.kubernetes.io/name=venafi-kubernetes-agent but they were missing. Perhaps because this metadata does not include the POD UID?

I realized that I hadn’t finished testing with Helm. Although the PR description mentions that the event shows up as expected, that wasn’t actually the case. Thanks for pushing commits for fixing this.

@maelvls
Copy link
Member Author

maelvls commented Oct 14, 2024

By the way, Olu and I maintain this unofficial API documentation for the clusterdata endpoint: API Documentation for /v1/tlspk/upload/clusterdata. At the bottom, you can see a list of error messages and how to remediate them.

Before, the namespace used to be guessed by looking up the service
account's `namespace` file at

  /var/run/secrets/kubernetes.io/serviceaccount/namespace

Although this way is "OK" since the agent will always have a service
account token mounted to the pod, we decided that passing the namespace
to the pod using an explicit env var would be better.
@maelvls
Copy link
Member Author

maelvls commented Oct 14, 2024

  • I moved away from using /var/run/secrets/kubernetes.io/serviceaccount/namespace to load the namespace. The namespace is now loaded from POD_NAMESPACE.
  • I finished testing (see instructions in description) and was able to show that the events feature works.

It should be good to go.

broadcaster.StartRecordingToSink(&clientgocorev1.EventSinkImpl{Interface: eventClient.CoreV1().Events(installNS)})
eventRec := broadcaster.NewRecorder(scheme, corev1.EventSource{Component: "venafi-kubernetes-agent", Host: os.Getenv("POD_NODE")})
eventf = func(eventType, reason, msg string, args ...interface{}) {
eventRec.Eventf(&corev1.Pod{ObjectMeta: v1.ObjectMeta{Name: podName, Namespace: installNS, UID: types.UID(os.Getenv("POD_UID"))}}, eventType, reason, msg, args...)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is POD_UID optional? If it is mandatory, shouldn't I disable sending events when it isn't set, similarly as to what I am doing when POD_NAME is empty?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my testing, I found that the UID could be empty string. If it's empty string then kubectl describe pod won't show the events.

Copy link
Member

@wallrj wallrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

I haven't tested it again, but I think the changes look ok.
It looks like using POD_NAMESPACE env var still resulted in a lot of churn,
so perhaps that was bad advice.

@wallrj
Copy link
Member

wallrj commented Oct 16, 2024

I stumbled upon this Kubernetes Event Style guide, and the Events you're adding in this PR seem to match their guidance:

@maelvls
Copy link
Member Author

maelvls commented Oct 18, 2024

I'm somewhat unhappy with the if os.Getenv("POD_NAME") condition since it might not be enough (we might also want to make sure other env vars are set, such as POD_UID).

That said, POD_NAME is the only mandatory field needed on the object for the event to be successfully received, it seems. That said, I'm unsure of which fields of the object need to be present on the event to be "correct".

Regardless, I think this PR is good to go.

@maelvls maelvls merged commit 3e33b61 into master Oct 18, 2024
2 checks passed
@maelvls maelvls deleted the VC-35568-events-to-pod branch October 18, 2024 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants