Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AAW Infra: scale down gerenal nodepool #1965

Open
7 tasks
Souheil-Yazji opened this issue Sep 13, 2024 · 1 comment
Open
7 tasks

AAW Infra: scale down gerenal nodepool #1965

Souheil-Yazji opened this issue Sep 13, 2024 · 1 comment
Assignees
Labels
area/engineering Requires attention from engineering: focus on foundational components or platform DevOps kind/feature New feature or request priority/soon

Comments

@Souheil-Yazji
Copy link
Contributor

Souheil-Yazji commented Sep 13, 2024

Is your feature request related to a problem? Please link issue ticket

Let's look at scaling down the General Nodepool to saturate the nodes better.
Currently, with some observation using grafana/lens, I can see that we are looking at maximum utilization of ~30% for CPU/mem in terms of node resource usage.

Another issue is that most of the daemonsets that don't necessarily need to be on the general nodes do have a pod on them. This is costly at no added value.

Describe the solution you'd like

  • Investigate the resource saturation on the average general nodepool node.
  • Investigate the daemonsets which deploy to general nodes
  • Identify possible clean up for those daemonsets, we can use taints/toleration to prevent them from scheduling pods to the general nodes
    --- Maybe for next sprint ---
  • Apply step 3
  • Collect metrics on new resource utilization
  • Adjust the size of the nodes by changing the VMSS VM type
  • Save lots of money.

Describe alternatives you've considered

NA

Additional context

image
image

@Souheil-Yazji Souheil-Yazji added kind/feature New feature or request area/engineering Requires attention from engineering: focus on foundational components or platform DevOps priority/soon labels Sep 13, 2024
@jacek-dudek
Copy link

Resource saturation per node on dev and prod clusters:
cluster: aaw-dev-cc-00-aks
nodepool name: general
machine type: Standard D8s v3 (8cores, 32GiB)
autoscaling: enabled
minimum node count: 0
maximum node count: 8

metrics sourced from azure dashboard:
time period: last 7 days
cpu metric: percentage of total cpu utilized on node
mem metric: percentage of memory working set utilized on node

node: vmss000000
uptime: 100%
cpu: avg of 10%
mem: 70%

node: vmss000001
uptime: 100%
cpu: avg of 30%
mem: 80%

node: vmss000005
uptime: 100%
cpu: avg 13% with a peak of 17%
mem: 95%

node: vmss000007
uptime: 100%
cpu: avg of 13% with a peak of 16%
mem:?

node: vmss00004r
uptime: 100%
cpu: avg of 30% with a peak of 36%
mem: avg of 105%

node: vmss00004s
uptime: 100%
cpu: avg of 16%
mem: 79%

node: vmss00004t
uptime: 100
cpu: 12% ramping up to 18% for a day
mem: 53%

cluster: aaw-prod-cc-00-aks
nodepool name: general
machine type:
Cannot drill down to nodepool level in azure, my permissions seem screwed up.
But from grafana metrics I'm seeing 32cores and 126GiB of memory per node.

metrics sourced from grafana dashboard: Kubernetes/Compute Resources/Node (Pods)
time period: last 7 days
cpu metric: total cpu utilization on node in cores
mem metric: total mem utilization on node in Gibibytes

node: vmss00001c
cpu: avg of 0.75cores with peaks of 1.5cores of 32cores total
mem: avg of 4.7GiB of 126GiB total

node: vmss00001n
cpu: avg of 0.6cores with peaks of 1core of 32cores total
mem: avg of 33GiB of 126GiB total

node: vmss00001r
cpu: avg of 2.3cores of 32cores total
mem: avg of 8GiB of 126GiB total

node: vmss00001v
cpu: avg of 0.4cores with peaks of 1core of 32cores total
mem: avg of 21GiB of 126GiB total

node: vmss00001z
cpu: avg of 0.6cores with peaks of 1.5cores of 32cores total
mem: avg of 8GiB of 126GiB total

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/engineering Requires attention from engineering: focus on foundational components or platform DevOps kind/feature New feature or request priority/soon
Projects
None yet
Development

No branches or pull requests

3 participants