-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VC-35568: Feature: Send errors to the pod's event to increase visibility #589
Conversation
dcf7f5b
to
5760881
Compare
5760881
to
05e8dcb
Compare
WARNING: starting with this commit, the Agent must be know about the namespace in which the agent's pod is running: - If the pod has a service account mounted, then the agent will load the namespace from the file /var/run/secrets/kubernetes.io/serviceaccount/namespace. - Otherwise, you can provide the namespace using --install-namespace (meant for running the agent out-of-cluster for testing purposes). Previously, --install-namespace was only needed when using the VenafiConnection mode (i.e., using --venafi-connection).
Signed-off-by: Richard Wall <[email protected]>
Signed-off-by: Richard Wall <[email protected]>
Signed-off-by: Richard Wall <[email protected]>
pkg/agent/run.go
Outdated
broadcaster.StartRecordingToSink(&clientgocorev1.EventSinkImpl{Interface: eventClient.CoreV1().Events(installNS)}) | ||
eventRec := broadcaster.NewRecorder(scheme, corev1.EventSource{}) | ||
eventf = func(eventType, reason, msg string, args ...interface{}) { | ||
eventRec.Eventf(&corev1.Pod{ObjectMeta: v1.ObjectMeta{Name: podName, Namespace: installNS}}, eventType, reason, msg, args...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expected to see the events included in the output of kubectl describe pod -l app.kubernetes.io/name=venafi-kubernetes-agent
but they were missing.
Perhaps because this metadata does not include the POD UID?
I applied this patch and it allowed me to see the events with kubectl describe pod
diff --git a/deploy/charts/venafi-kubernetes-agent/templates/deployment.yaml b/deploy/charts/venafi-kubernetes-agent/templates/deployment.yaml
index cb780e7..0115c1a 100644
--- a/deploy/charts/venafi-kubernetes-agent/templates/deployment.yaml
+++ b/deploy/charts/venafi-kubernetes-agent/templates/deployment.yaml
@@ -98,6 +98,10 @@ spec:
valueFrom:
fieldRef:
fieldPath: metadata.name
+ - name: POD_UID
+ valueFrom:
+ fieldRef:
+ fieldPath: metadata.uid
{{- if .Values.metrics.enabled }}
ports:
- containerPort: 8081
diff --git a/pkg/agent/run.go b/pkg/agent/run.go
index 8877c5e..0fa8038 100644
--- a/pkg/agent/run.go
+++ b/pkg/agent/run.go
@@ -21,6 +21,7 @@ import (
corev1 "k8s.io/api/core/v1"
v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
+ "k8s.io/apimachinery/pkg/types"
"k8s.io/client-go/kubernetes"
clientgocorev1 "k8s.io/client-go/kubernetes/typed/core/v1"
"k8s.io/client-go/tools/record"
@@ -231,7 +232,7 @@ func newEventf(installNS string) (Eventf, error) {
broadcaster.StartRecordingToSink(&clientgocorev1.EventSinkImpl{Interface: eventClient.CoreV1().Events(installNS)})
eventRec := broadcaster.NewRecorder(scheme, corev1.EventSource{})
eventf = func(eventType, reason, msg string, args ...interface{}) {
- eventRec.Eventf(&corev1.Pod{ObjectMeta: v1.ObjectMeta{Name: podName, Namespace: installNS}}, eventType, reason, msg, args...)
+ eventRec.Eventf(&corev1.Pod{ObjectMeta: v1.ObjectMeta{Name: podName, Namespace: installNS, UID: types.UID(os.Getenv("POD_UID"))}}, eventType, reason, msg, args...)
}
}
Name: venafi-kubernetes-agent-7c4ccb8464-87x89
Namespace: venafi
Priority: 0
Service Account: venafi-kubernetes-agent
Node: gke-eu-cluster-1-default-pool-f9b6e891-htsp/10.154.15.205
Start Time: Fri, 11 Oct 2024 16:56:59 +0100
Labels: app.kubernetes.io/instance=venafi-kubernetes-agent
app.kubernetes.io/name=venafi-kubernetes-agent
pod-template-hash=7c4ccb8464
Annotations: kubectl.kubernetes.io/restartedAt: 2024-10-11T16:32:27+01:00
Status: Running
IP: 10.40.1.8
IPs:
IP: 10.40.1.8
Controlled By: ReplicaSet/venafi-kubernetes-agent-7c4ccb8464
Containers:
venafi-kubernetes-agent:
Container ID: containerd://60455abd6d8003a4a6e4d6a544c145472ba21a7d9a61842696385bc1b3c8fe73
Image: ttl.sh/63773370-0bcf-4ac0-bd42-5515616089ff/images/venafi-agent:v1.1.0-6-g1bc78f07f48245-dirty
Image ID: ttl.sh/63773370-0bcf-4ac0-bd42-5515616089ff/images/venafi-agent@sha256:37f4b7a5def2d95ffc07a77dc4d4fa3b4788131762c72b0b559124ee99657900
Port: 8081/TCP
Host Port: 0/TCP
Args:
agent
-c
/etc/venafi/agent/config/config.yaml
--venafi-connection
venafi-components
--venafi-connection-namespace
venafi
--venafi-cloud
--enable-metrics
State: Running
Started: Fri, 11 Oct 2024 16:57:04 +0100
Ready: True
Restart Count: 0
Limits:
memory: 500Mi
Requests:
cpu: 200m
memory: 200Mi
Liveness: http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
Readiness: http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
Environment:
POD_NAMESPACE: venafi (v1:metadata.namespace)
POD_NAME: venafi-kubernetes-agent-7c4ccb8464-87x89 (v1:metadata.name)
POD_UID: (v1:metadata.uid)
Mounts:
/etc/venafi/agent/config from config (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pdpg8 (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: agent-config
Optional: false
kube-api-access-pdpg8:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 69s default-scheduler Successfully assigned venafi/venafi-kubernetes-agent-7c4ccb8464-87x89 to gke-eu-cluster-1-default-pool-f9b6e891-htsp
Normal Pulling 68s kubelet Pulling image "ttl.sh/63773370-0bcf-4ac0-bd42-5515616089ff/images/venafi-agent:v1.1.0-6-g1bc78f07f48245-dirty"
Normal Pulled 65s kubelet Successfully pulled image "ttl.sh/63773370-0bcf-4ac0-bd42-5515616089ff/images/venafi-agent:v1.1.0-6-g1bc78f07f48245-dirty" in 3.469s (3.469s including waiting). Image size: 19813240 bytes.
Normal Created 65s kubelet Created container venafi-kubernetes-agent
Normal Started 64s kubelet Started container venafi-kubernetes-agent
Warning PushingErr 59s retrying in 20.99698131s after error: post to server failed: while loading the VenafiConnection venafi/venafi-components: connection is not ready yet (building connection failed): chain element 1 (VCPOAuth) error: Post "https://api.venafi.eu.example/v1/oauth2/v2.0/9a0cab61-2b00-11ee-ba09-0733b0fe5adc/token": dial tcp: lookup api.venafi.eu.example on 34.118.224.10:53: no such host
Warning PushingErr 38s retrying in 26.917017503s after error: post to server failed: while loading the VenafiConnection venafi/venafi-components: connection is not ready yet (building connection failed): chain element 1 (VCPOAuth) error: Post "https://api.venafi.eu.example/v1/oauth2/v2.0/9a0cab61-2b00-11ee-ba09-0733b0fe5adc/token": dial tcp: lookup api.venafi.eu.example on 34.118.224.10:53: no such host
Warning PushingErr 11s retrying in 55.902139985s after error: post to server failed: while loading the VenafiConnection venafi/venafi-components: connection is not ready yet (building connection failed): chain element 1 (VCPOAuth) error: Post "https://api.venafi.eu.example/v1/oauth2/v2.0/9a0cab61-2b00-11ee-ba09-0733b0fe5adc/token": dial tcp: lookup api.venafi.eu.example on 34.118.224.10:53: no such host
Interestingly, the part of the kubectl describe output showing the environment variables, does not show the UID from the downward API, but it is supported:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had never tried using eventRec.Eventf
using a manually-crafted runtime.Object... in the past, I have always used client-go to get a runtime.Object
, and then I'd use eventRec.Eventf
on that object to send events.
Do we actually need POD_NAME
in the case we provide POD_UID
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. What did you find when you tested it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @maelvls
It seems to work, but I was expecting to be able to see the events in the output of kubectl describe pod -n venafi -l app.kubernetes.io/name=venafi-kubernetes-agent
.
I will push a few commits to make that work.
And I noticed a bug in the RBAC, which I've also fixed and will push.
05e8dcb
to
c1b54d3
Compare
Signed-off-by: Richard Wall <[email protected]>
c1b54d3
to
601b594
Compare
pkg/agent/config_test.go
Outdated
// Usually, the namespace is guessed from the file | ||
// /var/run/secrets/kubernetes.io/serviceaccount/namespace. But since we | ||
// can't realistically set that file in tests, we pass the flag | ||
// --install-namespace in all the tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added POD_NAMESPACE
to deployment.yaml.
So perhaps instead of getting the namespace from the file it could get it from the environment.
Then maybe you could just add a t.SetEnv("POD_NAMESPACE", "venafi")
to these tests and avoid all the churn in the test code below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added
POD_NAMESPACE
to deployment.yaml. So perhaps instead of getting the namespace from the file it could get it from the environment.
Good point, it seems like a better idea. This file-based discovery of the namespace was implicit, and I'd very much prefer clearly see what is being passed to the process in the deployment, like what you did with POD_NAMESPACE
, POD_NAME
, and POD_UID
.
That said, I've already implemented the discovery of the namespace using the file-based approach in the Venafi Connection auth mode. I guess I should migrate that to using POD_NAMESPACE
instead too. I'll do that in a follow-up PR, though. I've fixed this in bc2958c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought you'd be able to use t.SetEnv
in the parent test scope and have it applied to all the sub-tests.
If that's not possible then perhaps my POD_NAMESPACE suggestion was misleading.
I'm honestly not sure what is the best option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, I could have used t.SetEnv
on the top level t
.
The reason I haven't done so is that I am somewhat reluctant doing that because it would hide (by hide, I mean it wouldn't be obvious) an important input to these test cases.
Thinking about it, I guess it isn't that big of a deal if I set it on the top level t
...
I realized that I hadn’t finished testing with Helm. Although the PR description mentions that the event shows up as expected, that wasn’t actually the case. Thanks for pushing commits for fixing this. |
By the way, Olu and I maintain this unofficial API documentation for the clusterdata endpoint: API Documentation for /v1/tlspk/upload/clusterdata. At the bottom, you can see a list of error messages and how to remediate them. |
Before, the namespace used to be guessed by looking up the service account's `namespace` file at /var/run/secrets/kubernetes.io/serviceaccount/namespace Although this way is "OK" since the agent will always have a service account token mounted to the pod, we decided that passing the namespace to the pod using an explicit env var would be better.
It should be good to go. |
broadcaster.StartRecordingToSink(&clientgocorev1.EventSinkImpl{Interface: eventClient.CoreV1().Events(installNS)}) | ||
eventRec := broadcaster.NewRecorder(scheme, corev1.EventSource{Component: "venafi-kubernetes-agent", Host: os.Getenv("POD_NODE")}) | ||
eventf = func(eventType, reason, msg string, args ...interface{}) { | ||
eventRec.Eventf(&corev1.Pod{ObjectMeta: v1.ObjectMeta{Name: podName, Namespace: installNS, UID: types.UID(os.Getenv("POD_UID"))}}, eventType, reason, msg, args...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is POD_UID
optional? If it is mandatory, shouldn't I disable sending events when it isn't set, similarly as to what I am doing when POD_NAME
is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my testing, I found that the UID could be empty string. If it's empty string then kubectl describe pod
won't show the events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
I haven't tested it again, but I think the changes look ok.
It looks like using POD_NAMESPACE env var still resulted in a lot of churn,
so perhaps that was bad advice.
I stumbled upon this Kubernetes Event Style guide, and the Events you're adding in this PR seem to match their guidance: |
I'm somewhat unhappy with the That said, Regardless, I think this PR is good to go. |
Ref: VC-35568
This feature aims at giving the user another way of knowing when the Agent isn't sending data to the VCP API. With this PR, the user will have two ways of noticing that the Agent isn't properly working:
POSSIBLE BREAKING CHANGE 1: Previously,
--install-namespace
was only needed when using the VenafiConnection mode (i.e., using--venafi-connection
). Starting with this PR, all modes will require knowing what the pod's namespace is. For context,the namespace is looked up init looks for the contents of/var/run/secrets/kubernetes.io/serviceaccount/namespace
if the pod has a service account mounted to it (always the case for the agent)POD_NAMESPACE
defined in the deployment; otherwise, it uses the contents of--install-namespace
(meant for running the agent out-of-cluster for testing purposes).Is this really a breaking change? Very low chance of breakage since
POD_NAMESPACE
has been added to the Helm chart regardless of the auth mode.POSSIBLE BREAKING CHANGE 2: Previously, the namespace would be loaded from the file
/var/run/secrets/kubernetes.io/serviceaccount/namespace
. Now, it is loaded from the env varPOD_NAMESPACE
. The Helm chart has been updated with this env var.Is this really a breaking change? Low chance of breakage since folks will upgrade using the updated Helm chart which will contain
POD_NAMESPACE
.Testing
There are no automated tests for this change.
I tested the "events" feature with the following:
I've used the tenant https://ven-tlspk.venafi.cloud/. To access the API key, use the user
[email protected]
and the password is visible in the page Production Accounts (private to Venafi). Then go to the settings and find the API key.export APIKEY=...
I've then deployed the agent to a Kind cluster:
Now, cause an error by re-deploying with a broken client ID, for example:
helm upgrade -i -n venafi --create-namespace venafi-kubernetes-agent oci://ttl.sh/mael/charts/venafi-kubernetes-agent --version 0.0.0-dev \ --set config.clusterName="$USER temp" --set config.clientId=bogus
You should see in the pod's events: