AAW Infra: scale down gerenal nodepool #1965

Souheil-Yazji · 2024-09-13T15:00:43Z

Is your feature request related to a problem? Please link issue ticket

Let's look at scaling down the General Nodepool to saturate the nodes better.
Currently, with some observation using grafana/lens, I can see that we are looking at maximum utilization of ~30% for CPU/mem in terms of node resource usage.

Another issue is that most of the daemonsets that don't necessarily need to be on the general nodes do have a pod on them. This is costly at no added value.

Describe the solution you'd like

Investigate the resource saturation on the average general nodepool node.
Investigate the daemonsets which deploy to general nodes
Identify possible clean up for those daemonsets, we can use taints/toleration to prevent them from scheduling pods to the general nodes
--- Maybe for next sprint ---
Apply step 3
Collect metrics on new resource utilization
Adjust the size of the nodes by changing the VMSS VM type
Save lots of money.

Describe alternatives you've considered

NA

Additional context

jacek-dudek · 2024-11-15T18:39:40Z

Resource saturation per node on dev and prod clusters:
cluster: aaw-dev-cc-00-aks
nodepool name: general
machine type: Standard D8s v3 (8cores, 32GiB)
autoscaling: enabled
minimum node count: 0
maximum node count: 8

metrics sourced from azure dashboard:
time period: last 7 days
cpu metric: percentage of total cpu utilized on node
mem metric: percentage of memory working set utilized on node

node: vmss000000
uptime: 100%
cpu: avg of 10%
mem: 70%

node: vmss000001
uptime: 100%
cpu: avg of 30%
mem: 80%

node: vmss000005
uptime: 100%
cpu: avg 13% with a peak of 17%
mem: 95%

node: vmss000007
uptime: 100%
cpu: avg of 13% with a peak of 16%
mem:?

node: vmss00004r
uptime: 100%
cpu: avg of 30% with a peak of 36%
mem: avg of 105%

node: vmss00004s
uptime: 100%
cpu: avg of 16%
mem: 79%

node: vmss00004t
uptime: 100
cpu: 12% ramping up to 18% for a day
mem: 53%

cluster: aaw-prod-cc-00-aks
nodepool name: general
machine type:
Cannot drill down to nodepool level in azure, my permissions seem screwed up.
But from grafana metrics I'm seeing 32cores and 126GiB of memory per node.

metrics sourced from grafana dashboard: Kubernetes/Compute Resources/Node (Pods)
time period: last 7 days
cpu metric: total cpu utilization on node in cores
mem metric: total mem utilization on node in Gibibytes

node: vmss00001c
cpu: avg of 0.75cores with peaks of 1.5cores of 32cores total
mem: avg of 4.7GiB of 126GiB total

node: vmss00001n
cpu: avg of 0.6cores with peaks of 1core of 32cores total
mem: avg of 33GiB of 126GiB total

node: vmss00001r
cpu: avg of 2.3cores of 32cores total
mem: avg of 8GiB of 126GiB total

node: vmss00001v
cpu: avg of 0.4cores with peaks of 1core of 32cores total
mem: avg of 21GiB of 126GiB total

node: vmss00001z
cpu: avg of 0.6cores with peaks of 1.5cores of 32cores total
mem: avg of 8GiB of 126GiB total

Souheil-Yazji added kind/feature New feature or request area/engineering Requires attention from engineering: focus on foundational components or platform DevOps priority/soon labels Sep 13, 2024

Souheil-Yazji assigned EveningStarlight Oct 30, 2024

Jose-Matsuda assigned jacek-dudek Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AAW Infra: scale down gerenal nodepool #1965

AAW Infra: scale down gerenal nodepool #1965

Souheil-Yazji commented Sep 13, 2024 •

edited

Loading

jacek-dudek commented Nov 15, 2024

AAW Infra: scale down gerenal nodepool #1965

AAW Infra: scale down gerenal nodepool #1965

Comments

Souheil-Yazji commented Sep 13, 2024 • edited Loading

Is your feature request related to a problem? Please link issue ticket

Describe the solution you'd like

Describe alternatives you've considered

Additional context

jacek-dudek commented Nov 15, 2024

Souheil-Yazji commented Sep 13, 2024 •

edited

Loading