Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is Slack integration actually working? #66

Closed
kubebn opened this issue Jan 22, 2024 · 33 comments · Fixed by #69
Closed

Is Slack integration actually working? #66

kubebn opened this issue Jan 22, 2024 · 33 comments · Fixed by #69
Assignees
Labels
enhancement New feature or request

Comments

@kubebn
Copy link

kubebn commented Jan 22, 2024

Hi, I have set values this way:

aks-node-termination-handler:
  image: image
  imagePullPolicy: Always

  args:
    - "-webhook.url=https://myhook"
    - "-webhook.template='node_termination_event{node=\"{{ .Node }}\"} 1'"
  env: []

  priorityClassName: "system-node-critical"
k get pod
NAME                                 READY   STATUS    RESTARTS   AGE
aks-node-termination-handler-4r776   1/1     Running   0          8m16s
aks-node-termination-handler-g6x25   1/1     Running   0          8m16s
aks-node-termination-handler-ncccj   1/1     Running   0          8m16s
aks-node-termination-handler-tgrzr   1/1     Running   0          8m16s
aks-node-termination-handler-wc2kf   1/1     Running   0          8m17s
aks-node-termination-handler-xc6dt   1/1     Running   0          8m16s
---
k get pod aks-node-termination-handler-tgrzr -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
...
  containers:
  - args:
    - -webhook.url=https://myhook
    - -webhook.template='node_termination_event{node="{{ .Node }}"} 1'
    env:
    - name: MY_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName

In the logs I am getting:

aks-node-termination-handler-r49nz aks-node-termination-handler {"error":"error in sending to webhook: StatusCode=400: http result not OK","file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/events/events.go:140","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/events.readEndpoint","level":"error","msg":"error in alerts.Send","time":"2024-01-22T13:47:52Z"}

When I try to send message to the channel via curl:

curl -X POST --data-urlencode "payload={\"channel\": \"#mychannel\", \"username\": \"webhookbot\", \"text\": \"This is posted to #my-channel-here and comes from a bot named webhookbot.\", \"icon_emoji\": \":ghost:\"}" https://myhoook
ok%
image
@maksim-paskal
Copy link
Owner

Hi @kubebn, please try to add your actual Slack payload to your pods to flag-webhook.template, for example:

k get pod aks-node-termination-handler-tgrzr -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
...
  containers:
  - args:
    - -webhook.url=https://myhook
    - '-webhook.template="{\"channel\": \"#mychannel\", \"username\": \"webhookbot\", \"text\": \"This is posted to #mychannel-here and comes from a bot named webhookbot.\", \"icon_emoji\": \":ghost:\"}"'
    env:
    - name: MY_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName

This changes will make POST request to -webhook.url=https://myhook and as a payload will use content of -webhook.template

I don't test it - but I think it should work. Please let me know is it help for you.

@kubebn
Copy link
Author

kubebn commented Jan 22, 2024

Hi @kubebn, please try to add your actual Slack payload to your pods to flag-webhook.template, for example:

k get pod aks-node-termination-handler-tgrzr -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
...
  containers:
  - args:
    - -webhook.url=https://myhook
    - '-webhook.template="{\"channel\": \"#mychannel\", \"username\": \"webhookbot\", \"text\": \"This is posted to #mychannel-here and comes from a bot named webhookbot.\", \"icon_emoji\": \":ghost:\"}"'
    env:
    - name: MY_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName

This changes will make POST request to -webhook.url=https://myhook and as a payload will use content of -webhook.template

I don't test it - but I think it should work. Please let me know is it help for you.

I tried like this:

#values.yaml
  args:
    - '-webhook.url=https://myhook'
    - >
      '-webhook.template="payload={
        "channel": "#mychannel",
        "username": "webhookbot",
        "text": "This is posted to #my-channel-here and comes from a bot named webhookbot.",
        "icon_emoji": ":ghost:"
      }"'
---
  containers:
  - args:
    - -webhook.url=https://myhook
    - '-webhook.template="payload={ "channel": #mychannel", "username":
      webhookbot", "text": "This is posted to #my-channel-here and comes from a bot
      named webhookbot.", "icon_emoji": ":ghost:" }"'
    env:
    - name: MY_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName

It's not working.

aks-node-termination-handler-v544h aks-node-termination-handler {"error":"error in sending to webhook: StatusCode=400: http result not OK","file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/events/events.go:140","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/events.readEndpoint","level":"error","msg":"error in alerts.Send","time":"2024-01-22T17:02:01Z"}

@maksim-paskal
Copy link
Owner

Ok, let's try some simple message - do exacly as YAML below

apiVersion: v1
kind: Pod
metadata:
  annotations:
...
  containers:
  - args:
    - -webhook.url=https://hooks.slack.com/services/T0000000/B0000000000/000000000000000
    - '-webhook.template={"text": "test message"}'

you need -webhook.url - that point's to your Slack webhook URL and second argument you need exacly this value

'-webhook.template={"text": "test message"}'

You don't need to add anything else in -webhook.template like payload= or anything else, just quoted with single quotation mark JSON with single element text inside

@kubebn
Copy link
Author

kubebn commented Jan 22, 2024

    spec:
      containers:
      - args:
        - -webhook.url=https://myhook
        - '-webhook.template={"text": test message"}'

That actually worked, so basically it's a matter of what is inside the message rather than hook itself. Thank you

However, I am wondering if "node_termination_event{node="{{ .Node }}"} 1" template does not work, is there any potential way of using anything with vars at all in the webhook.template?

For example, in aws-spot-instances handler, we actually use webhookTemplate to be aware of what nodes, az, id's, pods are going to evict:

  webhookTemplate: |-
    {
          "fields": [
            {
              "title": "Node",
              "value": "{{ .NodeName }}",
              "short": true
            },
            {
              "title": "InstanceType",
              "value": "{{ .InstanceType }}",
              "short": true
            },
            {
              "title": "AvailabilityZone",
              "value": "{{ .AvailabilityZone }}",
              "short": true
            },
            {
              "title": "InstanceID",
              "value": "{{ .InstanceID }}",
              "short": true
            },
            {
              "title": "Pods",
              "value": "{{ .Pods }}",
              "short": true
            }
          ]

@maksim-paskal maksim-paskal added the enhancement New feature or request label Jan 23, 2024
@maksim-paskal maksim-paskal self-assigned this Jan 23, 2024
@maksim-paskal
Copy link
Owner

@kubebn yes it's interesting idea for aks-node-termination-handler, I will try to implement this changes.

I create dev changes that can close your issue. Please help me to test this new features. Now you can compose your payload as file, you can use this variables

type MessageType struct {
Event types.ScheduledEventsEvent
Template string
NewLine string // Used to making new line in templating results. Readonly.
NodeLabels map[string]string // metadata.labels
NodeName string // metadata.name
InstanceType string // node.kubernetes.io/instance-type
NodeArch string // kubernetes.io/arch
NodeOS string // kubernetes.io/os
NodeRole string // kubernetes.io/role
NodeRegion string // topology.kubernetes.io/region
NodeZone string // topology.kubernetes.io/zone
AzureMeta map[string]string // kubernetes.azure.com/*
}

Please follow the instruction (change -webhook.url to your actual):

# create request json for Slack, file can be templated
cat <<EOF | tee slack-config.json
{
  "channel": "#mychannel",
  "username": "webhookbot",
  "text": "This is message for {{ .NodeName }}, {{ .InstanceType }} from {{ .NodeRegion }}",
  "icon_emoji": ":ghost:"
}
EOF

# create configmap
kubectl -n kube-system create configmap aks-node-termination-handler-files --from-file=slack-config.json

# install/upgrade helm chart
helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
https://github.com/maksim-paskal/aks-node-termination-handler/releases/download/v1.0.9/f095104.tgz \
--set priorityClassName=system-node-critical \
--set image=paskalmaksim/aks-node-termination-handler:dev \
--set imagePullPolicy=Always \
--set configMap.create=false \
--set configMap.name=aks-node-termination-handler-files \
--set 'args[1]=-webhook.url=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX' \
--set 'args[2]=-webhook.template-file=/files/slack-config.json'

@kubebn
Copy link
Author

kubebn commented Jan 23, 2024

kubectl -n kube-system create configmap aks-node-termination-handler-files --from-file=slack-config.json

Tried it. Message did not come to Slack. Likewise, there was no logs regarding webhook error in the pods.

cat slack-config.json
{
    "channel": "#aks-spot-termination",
    "username": "spot-terminator",
    "text": "This is message for {{ .NodeName }}, {{ .InstanceType }} from {{ .NodeRegion }}",
    "icon_emoji": ":ghost:"
}%
---
kubectl -n daemonsets create configmap aks-node-termination-handler-files --from-file=slack-config.json
configmap/aks-node-termination-handler-files created
---
helm upgrade aks-node-termination-handler \
--install \
--namespace daemonsets \
https://github.com/maksim-paskal/aks-node-termination-handler/releases/download/v1.0.9/f095104.tgz \
--set priorityClassName=system-node-critical \
--set image=paskalmaksim/aks-node-termination-handler:dev \
--set imagePullPolicy=Always \
--set configMap.create=false \
--set configMap.name=aks-node-termination-handler-files \
--set 'args[1]=-webhook.url=https://myhook' \
--set 'args[2]=-webhook.template-file=/files/slack-config.json'

Release "aks-node-termination-handler" does not exist. Installing it now.
NAME: aks-node-termination-handler
LAST DEPLOYED: Tue Jan 23 18:10:40 2024
NAMESPACE: daemonsets
STATUS: deployed
REVISION: 1
TEST SUITE: None
---
k get po aks-node-termination-handler-6bw7w -o yaml

  containers:
  - args:
    - ""
    - -webhook.url=https://myhook
    - -webhook.template-file=/files/slack-config.json
    env:
    - name: MY_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
...
    volumeMounts:
    - mountPath: /files
      name: files
      readOnly: true
  volumes:
  - configMap:
      defaultMode: 420
      name: aks-node-termination-handler-files
    name: files
---
aks-node-termination-handler-kvqwq aks-node-termination-handler {"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/events/events.go:107","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/events.readEndpoint","level":"info","msg":"{\"DocumentIncarnation\":1,\"Events\":[{\"EventId\":\"0904504D-A41D-48B0-BC16-74037597BAB7\",\"EventStatus\":\"Scheduled\",\"EventType\":\"Preempt\",\"ResourceType\":\"VirtualMachine\",\"Resources\":[\"aks-lmd8spot1e4d-22030174-vmss_22\"],\"NotBefore\":\"Tue, 23 Jan 2024 18:12:49 GMT\",\"Description\":\"\",\"EventSource\":\"Platform\",\"DurationInSeconds\":-1}]}","time":"2024-01-23T18:12:32Z"}
aks-node-termination-handler-kvqwq aks-node-termination-handler {"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/api/api.go:62","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/api.DrainNode","level":"info","msg":"Draining node aks-lmd8spot1e4d-22030174-vmss00000m","time":"2024-01-23T18:12:32Z"}

P.S. Do you think there would be a possibility of adding pod names which are going to be gracefully evicted as well?

aks-node-termination-handler-kvqwq aks-node-termination-handler {"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/api/api.go:85","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/api.DrainNode.func1","level":"info","msg":"evicting pod monitoring/pushgateway-cleaner-28433850-gpslv\n","time":"2024-01-23T18:12:35Z"}
aks-node-termination-handler-kvqwq aks-node-termination-handler {"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/api/api.go:85","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/api.DrainNode.func1","level":"info","msg":"evicting pod devx/docker-cache-registry-docker-live-system-warmer-28433723-2877x\n","time":"2024-01-23T18:12:35Z"}
aks-node-termination-handler-kvqwq aks-node-termination-handler {"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/api/api.go:85","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/api.DrainNode.func1","level":"info","msg":"evicting pod monitoring/prometheus-prometheus-devops-0\n","time":"2024-01-23T18:12:35Z"}
aks-node-termination-handler-kvqwq aks-node-termination-handler {"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/api/api.go:85","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/api.DrainNode.func1","level":"info","msg":"evicting pod kafka/kafka-cluster-zookeeper-0\n","time":"2024-01-23T18:12:35Z"}
aks-node-termination-handler-kvqwq aks-node-termination-handler {"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/api/api.go:85","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/api.DrainNode.func1","level":"info","msg":"evicting pod kafka/kafka-cluster-kafka-0\n","time":"2024-01-23T18:12:35Z"}
...

@maksim-paskal
Copy link
Owner

I think that problem with helm installation --set 'args[X]= must be started with zero... Please try to make helm installation again with that commands (you must already create ConfigMap with actual payload, as I mentioned in previous message)

# install/upgrade helm chart
helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
https://github.com/maksim-paskal/aks-node-termination-handler/releases/download/v1.0.9/f095104.tgz \
--set priorityClassName=system-node-critical \
--set image=paskalmaksim/aks-node-termination-handler:dev \
--set imagePullPolicy=Always \
--set configMap.create=false \
--set configMap.name=aks-node-termination-handler-files \
--set 'args[0]=-webhook.url=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX' \
--set 'args[1]=-webhook.template-file=/files/slack-config.json'

You want to send in Slack message all pods names that was on node while drain?

P.S. Do you think there would be a possibility of adding pod names which are going to be gracefully evicted as well?

@kubebn
Copy link
Author

kubebn commented Jan 23, 2024

I think that problem with helm installation --set 'args[X]= must be started with zero... Please try to make helm installation again with that commands (you must already create ConfigMap with actual payload, as I mentioned in previous message)

# install/upgrade helm chart
helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
https://github.com/maksim-paskal/aks-node-termination-handler/releases/download/v1.0.9/f095104.tgz \
--set priorityClassName=system-node-critical \
--set image=paskalmaksim/aks-node-termination-handler:dev \
--set imagePullPolicy=Always \
--set configMap.create=false \
--set configMap.name=aks-node-termination-handler-files \
--set 'args[0]=-webhook.url=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX' \
--set 'args[1]=-webhook.template-file=/files/slack-config.json'

You want to send in Slack message all pods names that was on node while drain?

P.S. Do you think there would be a possibility of adding pod names which are going to be gracefully evicted as well?

Hi, yes you are right, wrong args were in place. It worked fine now.

image
You want to send in Slack message all pods names that was on node while drain?

Yes, we use it for visibility. In case if there will be questions like "why this pod was deleted", we could see that it was evicted because it was on a spot instance and it was expected. Example:

image

@maksim-paskal
Copy link
Owner

Yes, it's posible, I add .NodePods to variables, this will output pods in payload as [pod1 pod2 ...] this variable has all pods that was running on node while drain

type MessageType struct {
Event types.ScheduledEventsEvent
Template string
NewLine string // Used to making new line in templating results. Readonly.
NodeLabels map[string]string // metadata.labels
NodeName string // metadata.name
InstanceType string // node.kubernetes.io/instance-type
NodeArch string // kubernetes.io/arch
NodeOS string // kubernetes.io/os
NodeRole string // kubernetes.io/role
NodeRegion string // topology.kubernetes.io/region
NodeZone string // topology.kubernetes.io/zone
NodePods []string // list of pods on node
}

Please test, you need to modify your payload

# create request json for Slack, file can be templated
cat <<EOF | tee slack-config.json
{
  "channel": "#mychannel",
  "username": "webhookbot",
  "text": "This is message for {{ .NodeName }}, {{ .InstanceType }} from {{ .NodeRegion }}, pods {{ .NodePods }}",
  "icon_emoji": ":ghost:"
}
EOF

# delete current configmap
kubectl -n kube-system delete configmap aks-node-termination-handler-files

# create configmap
kubectl -n kube-system create configmap aks-node-termination-handler-files --from-file=slack-config.json

# restart all pods to apply new payload
kubectl -n kube-system delete pods -lapp=aks-node-termination-handler

@kubebn
Copy link
Author

kubebn commented Jan 24, 2024

delete pods -lapp=aks-node-termination-handler

Recreated configmap with:

cat slack-config.json
{
    "channel": "#aks-spot-termination",
    "username": "spot-terminator",
    "text": "This is message for {{ .NodeName }}, {{ .InstanceType }} from {{ .NodeRegion }}, pods {{ .NodePods }}",
    "icon_emoji": ":ghost:"
}%

deleted pods:

k delete pod --all
pod "aks-node-termination-handler-2m29f" deleted
pod "aks-node-termination-handler-57ttj" deleted
pod "aks-node-termination-handler-g9qxj" deleted
pod "aks-node-termination-handler-hk86b" deleted
pod "aks-node-termination-handler-nln4q" deleted
pod "aks-node-termination-handler-rr4z7" deleted

Logs:

aks-node-termination-handler-rvxbp aks-node-termination-handler {"error":"error in client.Do: Post \"https://myhook\": error making roundtrip: context deadline exceeded","file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/events/events.go:174","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/events.sendEvent","level":"error","msg":"error in webhook.SendWebHook","time":"2024-01-24T09:30:40Z"}

P.S. Also, would be nice to add "kubernetes.azure.com/cluster" so cluster name can be added to the text.

@maksim-paskal
Copy link
Owner

Error said that webhook address is incorrect, please reinstall chart with correct webhook address #66 (comment)

I will add .ClusterName on next week

@kubebn
Copy link
Author

kubebn commented Jan 24, 2024

Error said that webhook address is incorrect, please reinstall chart with correct webhook address #66 (comment)

I will add .ClusterName on next week

Hi Maksim,

I have already reinstalled it multiple times. I am using the same webhook as before :) I manually changed it to "myhook". In the logs the correct webhook is set

{"error":"error in client.Do: Post \"https://hooks.slack.com/services/DDDDD/DDDDD/DDDDD\": error making roundtrip: context deadline exceeded"`

Is it possible that webhook & slack text can't handle that amount of pods in the message?

NodePods:[csi-secrets-store-provider-azure-9zrd6 secrets-store-csi-driver-895j6 aks-node-termination-handler-rtwlv shared-os-cluster-ingest-68b6c8fd9d-vpp5v nvidia-device-plugin-daemonset-x7w8v kafka-cluster-kafka-0 kafka-cluster-zookeeper-0 azure-ip-masq-agent-kgc6l cloud-node-manager-sthz9 csi-azuredisk-node-b295m csi-azurefile-node-9gh8z kube-proxy-9l9fd microsoft-defender-collector-ds-crkkf microsoft-defender-publisher-ds-tm87n cost-analyzer-network-costs-jvf6s kyverno-cleanup-admission-reports-28434960-82tmx kyverno-cleanup-cluster-admission-reports-28434960-hhsjs filebeat-k8s-filebeat-6qccd alertmanager-alertmanager-devops-0 grafana-agent-ebpf-pk27s grafana-mimir-alertmanager-1 metrics-thanos-shard-9-0 prometheus-node-exporter-999fv prometheus-prometheus-devops-0 prometheus-prometheus-k8s-0 prometheus-prometheus-nginx-1 prometheus-pushgateway-0 node-agent-h4tm6]}

@maksim-paskal
Copy link
Owner

Seems that doing request to Slack is bigger than 5 seconds (default value, I will raise this default value to 30s sooner) and it errored. You can raise this limit, adding new limit in your chart installation

...
--set 'args[2]=-webhook.timeout=30s'

@kubebn
Copy link
Author

kubebn commented Jan 24, 2024

Seems that doing request to Slack is bigger than 5 seconds (default value, I will raise this default value to 30s sooner) and it errored. You can raise this limit, adding new limit in your chart installation

... --set 'args[2]=-webhook.timeout=30s'

Hi Maksim,

Timeout helped and message has been sent. Thanks for your help.

Am I right in saying that if I use {{ .NewLine }}, I can add spaces to the message and make it look nice?

image

@maksim-paskal
Copy link
Owner

Yes, try to compose your Slack payload as best as you can, and try to test it. Share with me this payload, please, I will add to README as example. Slack has rich functionality to compose messages, look here https://api.slack.com/reference/surfaces/formatting#advanced

I think you do not need {{ .NewLine }} - this marker I use only when I load templates from string. As you load payload file instead - I think that this option not for your usage.

@kubebn
Copy link
Author

kubebn commented Jan 24, 2024

Yes, try to compose your Slack payload as best as you can, and try to test it. Share with me this payload, please, I will add to README as example. Slack has rich functionality to compose messages, look here https://api.slack.com/reference/surfaces/formatting#advanced

I think you do not need {{ .NewLine }} - this marker I use only when I load templates from string. As you load payload file instead - I think that this option not for your usage.

{
    "channel": "#aks-spot-termination",
    "username": "spot-terminator",
    "text": "Test",
	"blocks": [
		{
			"type": "header",
			"text": {
				"type": "plain_text",
				"text": "Spot Instance Eviction"
			}
		},
		{
			"type": "section",
			"fields": [
				{
					"type": "mrkdwn",
					"text": "*Node Name*\n{{ .NodeName }}"
				},
				{
					"type": "mrkdwn",
					"text": "*Instance Type:*\n{{ .InstanceType }}"
				}
			]
		},
		{
			"type": "section",
			"fields": [
				{
					"type": "mrkdwn",
					"text": "*Zone:*\n{{ .NodeZone }}"
				},
				{
					"type": "mrkdwn",
					"text": "*Cluster Name:*\n{{ .ClusterName }}"
				}
			]
		},
		{
			"type": "section",
			"text": {
				"type": "mrkdwn",
				"text": "*Evicted Pods:*\n{{ .NodePods }}"
			}
		}
	],
    "icon_emoji": ":k8s:"
}

Looks like this:

image

Cluster name is NodeZone on the screenshot. @maksim-paskal do you think there is possibility to exclude DaemonSets from .NodePods info? It seems like their info can be taken from Drain. Thanks

ks-node-termination-handler-4kvjf aks-node-termination-handler {"file":"github.com/maksim-paskal/aks-node-termination-handler/pkg/api/api.go:85","func":"github.com/maksim-paskal/aks-node-termination-handler/pkg/api.DrainNode.func1","level":"info","msg":"WARNING: ignoring DaemonSet-managed Pods: csi/csi-secrets-store-provider-azure-plxkk, csi/secrets-store-csi-driver-mq8wb, daemonsets/aks-node-termination-handler-4kvjf, gpu-resources/nvidia-device-plugin-daemonset-l7bgx, kube-system/azure-ip-masq-agent-rkt6m, kube-system/cloud-node-manager-4gkw2, kube-system/csi-azuredisk-node-gqbsm, kube-system/csi-azurefile-node-nzp4k, kube-system/kube-proxy-8m8gp, kube-system/microsoft-defender-collector-ds-6j5tq, kube-system/microsoft-defender-publisher-ds-ht9zt, kubecost/cost-analyzer-network-costs-8qtw8, logging/filebeat-k8s-filebeat-glj5k, monitoring/grafana-agent-ebpf-sdq4k, monitoring/prometheus-node-exporter-526sv, velero/node-agent-lwnh4\n","time":"2024-01-24T15:30:58Z"}

@maksim-paskal
Copy link
Owner

Good catch. Thanks. Yes, sure I will remove daemonsets pods from this list next week.

My TODO list for next release:

  1. Add .ClusterName
  2. .NodePods list pods without daemonsets
  3. Increase default webhook timeout to 30s
  4. Update README for Slack integration

I will try to release this new features next week.

@maksim-paskal
Copy link
Owner

maksim-paskal commented Jan 30, 2024

@kubebn this changes was released, I recomend you delete your current installation and move to stable releases

helm repo add aks-node-termination-handler https://maksim-paskal.github.io/aks-node-termination-handler/
helm repo update

# delete old configmap
kubectl -n kube-system delete configmap aks-node-termination-handler-files

# I recomend you use values.yaml other than create configmap
# https://github.com/maksim-paskal/aks-node-termination-handler?tab=readme-ov-file#send-notification-events
cat <<EOF | tee values.yaml
priorityClassName: system-node-critical

args:
- -webhook.url=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX
- -webhook.template-file=/files/slack-payload.json
- -webhook.contentType=application/json
- -webhook.method=POST
- -webhook.timeout=30s

configMap:
  data:
    slack-payload.json: |
      {
        "channel": "#mychannel",
        "username": "webhookbot",
        "text": "This is message for {{ .NodeName }}, {{ .InstanceType }} from {{ .NodeRegion }}",
        "icon_emoji": ":ghost:"
      }
EOF

# install/upgrade helm chart
helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--values values.yaml

@kubebn
Copy link
Author

kubebn commented Jan 31, 2024

Hi @maksim-paskal,

I have set these values:

priorityClassName: system-node-critical

labels:
  Product: DevOps
  ProductComponents: aks-node-termination-handler

metrics:
  addAnnotations: false

securityContext:
  runAsNonRoot: true
  privileged: false
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop:
    - ALL
  windowsOptions:
    runAsUserName: "ContainerUser"

tolerations:
- key: "kubernetes.azure.com/scalesetpriority"
  operator: "Equal"
  value: "spot"
  effect: "NoSchedule"

nodeSelector: {}
# if you want handle events only from spot instances
# nodeSelector:
#   kubernetes.azure.com/scalesetpriority: spot

resources:
  limits:
    cpu: 20m
    memory: 100Mi
  requests:
    cpu: 20m
    memory: 100Mi

args:
- -webhook.url=https://hooks
- -webhook.template-file=/files/slack-payload.json
- -webhook.contentType=application/json
- -webhook.method=POST
- -webhook.timeout=30s

configMap:
  create: true
  name: aks-node-termination-handler-files
  mountPath: /files
  data:
    slack-payload.json: |
      {
          "channel": "#aks-spot-termination",
          "username": "spot-terminator",
          "text": "Test",
        "blocks": [
          {
            "type": "header",
            "text": {
              "type": "plain_text",
              "text": "Spot Instance Eviction"
            }
          },
          {
            "type": "section",
            "fields": [
              {
                "type": "mrkdwn",
                "text": "*Node Name*\n{{ .NodeName }}"
              },
              {
                "type": "mrkdwn",
                "text": "*Instance Type:*\n{{ .InstanceType }}"
              }
            ]
          },
          {
            "type": "section",
            "fields": [
              {
                "type": "mrkdwn",
                "text": "*Zone:*\n{{ .NodeZone }}"
              },
              {
                "type": "mrkdwn",
                "text": "*Cluster Name:*\n{{ .ClusterName }}"
              }
            ]
          },
          {
            "type": "section",
            "text": {
              "type": "mrkdwn",
              "text": "*Evicted Pods:*\n{{ .NodePods }}"
            }
          }
        ],
          "icon_emoji": ":k8s:"
      }

Simulated eviction, there is no webhook error logs but no message in slack either: https://paste.openstack.org/show/brLN9Az0Iqgr0MCcYFxf/

ConfigMap seems to be fine:

~/temp/aks-node-termination-handler main* 3m 51s ❯ k get pod
NAME                                 READY   STATUS    RESTARTS   AGE
aks-node-termination-handler-f6hs6   1/1     Running   0          5m34s
aks-node-termination-handler-lthbh   1/1     Running   0          5m35s
aks-node-termination-handler-mxb5s   1/1     Running   0          5m34s
aks-node-termination-handler-nkhq5   1/1     Running   0          116s
aks-node-termination-handler-s8krx   1/1     Running   0          5m34s
aks-node-termination-handler-s8zsk   1/1     Running   0          78s
aks-node-termination-handler-xgwc6   1/1     Running   0          5m34s
~/temp/aks-node-termination-handler main* ❯ k get cm
NAME                                 DATA   AGE
aks-node-termination-handler-files   1      5m39s
kube-root-ca.crt                     1      8d
~/temp/aks-node-termination-handler main* ❯ k exec -it aks-node-termination-handler-f6hs6 ls /files
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
slack-payload.json
~/temp/aks-node-termination-handler main* ❯ k get cm aks-node-termination-handler-files -o yaml
apiVersion: v1
data:
  slack-payload.json: |
    {
        "channel": #aks-spot-termination",
        username": "spot-terminator",
        text": "Test",
      blocks": [
        {
          type": "header",
          text": {
            type": "plain_text",
            "text": "Spot Instance Eviction"
          }
        },
        {
          "type": section",
          fields": [
            {
              type": "mrkdwn",
              "text": "*Node Name*\n{{ .NodeName }}"
            },
            {
              "type": mrkdwn",
              "text": "*Instance Type:*\n{{ .InstanceType }}"
            }
          ]
        },
        {
          "type": section",
          fields": [
            {
              type": "mrkdwn",
              "text": "*Zone:*\n{{ .NodeZone }}"
            },
            {
              "type": mrkdwn",
              "text": "*Cluster Name:*\n{{ .ClusterName }}"
            }
          ]
        },
        {
          "type": section",
          text": {
            type": "mrkdwn",
            "text": "*Evicted Pods:*\n{{ .NodePods }}"
          }
        }
      ],
        "icon_emoji": ":k8s:"
    }
kind: ConfigMap
---
k get po aks-node-termination-handler-f6hs6 -o yaml

  containers:
  - args:
    - -webhook.url=https://hooks
    - -webhook.template-file=/files/slack-payload.json
    - -webhook.contentType=application/json
    - -webhook.method=POST
    - -webhook.timeout=30s
....
    volumeMounts:
    - mountPath: /files
      name: files
      readOnly: true
  volumes:
  - configMap:
      defaultMode: 420
      name: aks-node-termination-handler-files
    name: files

@kubebn
Copy link
Author

kubebn commented Jan 31, 2024

I have tried with simpler payload like

          {
            "channel": "#aks-spot-termination",
            "username": "webhookbot",
            "text": "This is message for {{ .NodeName }}, {{ .InstanceType }} from {{ .NodeRegion }}",
            "icon_emoji": ":ghost:"
          }

same thingy.

@maksim-paskal
Copy link
Owner

@kubebn maybe you use not latest chart - try to define latest 1.1.3 version

helm uninstall aks-node-termination-handler --namespace kube-system

helm repo update

helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--version 1.1.3 \
--values=/tmp/values.yaml

Also please remove from values.yaml resorce section, I notice that on windows nodes it works weird - I remove resources.limits.cpu by default

resources:
  limits:
    cpu: 20m
    memory: 100Mi
  requests:
    cpu: 20m
    memory: 100Mi

@maksim-paskal
Copy link
Owner

Add please -log.level=debug to args - and if webhook will not send, please share to me pod logs

args:
- -log.level=debug
- -webhook.url=https://hooks
- -webhook.template-file=/files/slack-payload.json
- -webhook.contentType=application/json
- -webhook.method=POST
- -webhook.timeout=30s

@kubebn
Copy link
Author

kubebn commented Jan 31, 2024

@kubebn maybe you use not latest chart - try to define latest 1.1.3 version

helm uninstall aks-node-termination-handler --namespace kube-system

helm repo update

helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--version 1.1.3 \
--values=/tmp/values.yaml

Also please remove from values.yaml resorce section, I notice that on windows nodes it works weird - I remove resources.limits.cpu by default

resources:
  limits:
    cpu: 20m
    memory: 100Mi
  requests:
    cpu: 20m
    memory: 100Mi

Hi, yeah I definitely use the latest chart version. I noticed that when I removed "text": "*Evicted Pods:*\n{{ .NodePods }}" part, slack message was sent without issues. Let me try with log level.

image

@maksim-paskal
Copy link
Owner

I will extend logs #71 for webhooks, for next webhook debug

@kubebn
Copy link
Author

kubebn commented Jan 31, 2024

Okay, that config worked: https://paste.openstack.org/show/bfRvfT5iZurBLsPT1gG0/


  args:
  - '-webhook.url=https://hooks'
  - '-webhook.template-file=/files/slack-payload.json'
  - '-webhook.contentType=application/json'
  - '-webhook.method=POST'
  - '-webhook.timeout=30s'
  - 'log.level=debug'

  configMap:
    create: true
    name: aks-node-termination-handler-files
    mountPath: /files
    data:
      slack-payload.json: |
        {
            "channel": "#aks-spot-termination",
            "username": "spot-terminator",
            "text": "Test",
          "blocks": [
            {
              "type": "header",
              "text": {
                "type": "plain_text",
                "text": "Spot Instance Eviction"
              }
            },
            {
              "type": "section",
              "fields": [
                {
                  "type": "mrkdwn",
                  "text": "*Node Name*\n{{ .NodeName }}"
                },
                {
                  "type": "mrkdwn",
                  "text": "*Instance Type:*\n{{ .InstanceType }}"
                }
              ]
            },
            {
              "type": "section",
              "fields": [
                {
                  "type": "mrkdwn",
                  "text": "*Zone:*\n{{ .NodeZone }}"
                },
                {
                  "type": "mrkdwn",
                  "text": "*Cluster Name:*\n{{ .ClusterName }}"
                }
              ]
            },
            {
              "type": "section",
              "text": {
                "type": "mrkdwn",
                "text": "*Evicted Pods:*\n{{ .NodePods }}"
              }
            }
          ],
            "icon_emoji": ":k8s:"
        }
image

@kubebn
Copy link
Author

kubebn commented Feb 1, 2024

@kubebn maybe you use not latest chart - try to define latest 1.1.3 version

helm uninstall aks-node-termination-handler --namespace kube-system

helm repo update

helm upgrade aks-node-termination-handler \
--install \
--namespace kube-system \
aks-node-termination-handler/aks-node-termination-handler \
--version 1.1.3 \
--values=/tmp/values.yaml

Also please remove from values.yaml resorce section, I notice that on windows nodes it works weird - I remove resources.limits.cpu by default

resources:
  limits:
    cpu: 20m
    memory: 100Mi
  requests:
    cpu: 20m
    memory: 100Mi

Hi @maksim-paskal, apologies for the off topic but what actually is wrong with requests/limits on Windows nodes? Is it really better to avoid using them at all on Windows containers?

@maksim-paskal
Copy link
Owner

when I test windows node, pod doesn't show any logs, same configuration works on Linux. When I remove resources.limits.cpu, pod start to logs messages on Windows.

Seems that Windows nodes have some other cpu limiter other then Linux. If you need to limit pod cpu limit - you can set - but that value needs to be greater than 20m

@ShashankV007
Copy link

ShashankV007 commented Nov 13, 2024

Hi @maksim-paskal , even I to facing the similar issue, slack notification is not getting triggered. We are running the application in aks (Linux) cluster through helm chart.Below is my values.yaml file for reference

image: paskalmaksim/aks-node-termination-handler:latest
imagePullPolicy: Always
imagePullSecrets: []

args:
  - -webhook.url=https://hooks.slack.com/services/token**@@##
  - -webhook.template-file=/files/slack-payload.json
  - -webhook.contentType=application/json
  - -webhook.method=POST
  - -webhook.timeout=30s
  - -log.level=debug
env: []

priorityClassName: system-node-critical
annotations: {}
labels: {}

configMap:
  create: true
  name: "{{ .Release.Name }}-files"
  mountPath: /files
  data:
    slack-payload.json: |
      {
        "channel": "#channel",
        "username": "slacknot",
        "text": "This is message for {{ .NodeName }}, {{ .InstanceType }} from {{ .NodeRegion }}",
        "icon_emoji": ":ghost:"
      }

extraVolumes: []
extraVolumeMounts: []

metrics:
  addAnnotations: true

hostNetwork: false

securityContext:
  runAsNonRoot: true
  privileged: false
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop:
    - ALL
  windowsOptions:
    runAsUserName: "ContainerUser"
  seccompProfile:
    type: RuntimeDefault

affinity: {}

tolerations:
- key: "kubernetes.azure.com/scalesetpriority"
  operator: "Equal"
  value: "spot"
  effect: "NoSchedule"

nodeSelector: {}
# if you want handle events only from spot instances
# nodeSelector:
#   kubernetes.azure.com/scalesetpriority: spot

resources:
  limits:
    memory: 100Mi
  requests:
    cpu: 20m
    memory: 100Mi

@maksim-paskal
Copy link
Owner

Hi @ShashankV007, can you share the logs of pod of aks-node-termination-handler when nodes terminates, you can try to simulate node eviction, and grab logs from aks-node-termination-handler - logs will help to resolve your issue.

@ShashankV007
Copy link

Hi @maksim-paskal , please find the logs attached here.
Explore-logs-2024-11-14.txt

@maksim-paskal
Copy link
Owner

@ShashankV007 thanks for raising this problem. Will be fixed today in #90

@maksim-paskal
Copy link
Owner

@ShashankV007 please restart all aks-node-termination-handler pods from your cluster, this will apply fix to your workload.

@ShashankV007
Copy link

Thanks @maksim-paskal, the issue is fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants