Core node pool revision to minimize cost - towards `n2-highmem-2` and `r5.large` machines #2212

consideRatio · 2023-02-16T10:45:06Z

We have a few clusters, and a few core node pool configurations. I suggest that we inspect what we got, and we transition to 1:8 CPU:RAM nodes like n2-highmem-2 and r5.large (2:16) or n2-highmem-4 and r5.xlarge (4:32).

We should make a decision for each clusters I think, and then we start to learn what we need to consider and can establish a ruleset to follow.

Consider

What is the minimal nodes we can have?
- For GKE it is two it seems, at least unless we configure the GKE managed Calico installation in GKE clusters get two core nodes without being CPU / Memory constrained #2199 to not have two replicas
- For EKS it is one it seems
What is the minimal node CPU:RAM we can have per node?
- For practical purposes in 2i2c, 2:16 CPU:RAM should be the lower bound
  - The cost of our time is high compared to saving 1 CPU more a month, and we would invest time to optimize this and keep it reliably at 1 CPU while providing a reliable operation of services
- For some clusters, exceptions, we may need 4:32 CPU:RAM machines if prometheus-server peaks very high

Disruptive maintenance consideration

This will all pods in the core node, so ideally there are no active users on the cluster as restarting hub and proxy pod will disrupt users for example, even if they can recover their user server session.

Related cloud operations

EKS

To switch to a new core node pool type, you should adjust files in the eksctl folder
- Update <clustername>.jsonnet template to declare one new core node pool, and render the jsonnet template into a eksctl config
- Use eksctl to create the new core node pool
- Use kubectl to drain the old core node pool
- Update <clustername>.jsonnet template to declare one new core node pool, and render the jsonnet template into a eksctl config
- Use eksctl to delete the old core node pool

GKE

To switch to a new core node pool, you should adjust the files in the terraform folder
- I'm not confident on this, but I think steps involve terraform init, terraform workspace list, terraform workspace select, updating terraform variable files, and terraform plan + terraform apply

Summary of current opionion

EKS basehub, use r5.large
EKS daskhub, use r5.xlarge so that two smaller isn't needed just for the pod limit anyhow
GKE, use n2-highmem-2 or n2-highmem-4 if we for example believe that the prometheus-server will need more memory than the 16 GB n2-highmem-2 is specced for.

consideRatio mentioned this issue Feb 16, 2023

GKE clusters get two core nodes without being CPU / Memory constrained #2199

Closed

consideRatio mentioned this issue Feb 24, 2023

LEAP prometheus server is down/scheduler faiiling #2248

Closed

consideRatio mentioned this issue Mar 20, 2023

callysto: maintenance plan for Wed, Mar 22, 0-4am PDT #2388

Closed

6 tasks

consideRatio mentioned this issue Apr 12, 2023

Update AWS and GKE docs and templates to use n2-highmem / r5 machine types #2481

Closed

3 tasks

This comment was marked as resolved.

Sign in to view

consideRatio mentioned this issue Aug 4, 2023

Phasing out use of n1 machines in favor of n2 machines on GKE #2923

Closed

19 tasks

consideRatio added the tech:cloud-infra Optimization of cloud infra to reduce costs etc. label Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core node pool revision to minimize cost - towards `n2-highmem-2` and `r5.large` machines #2212

Core node pool revision to minimize cost - towards `n2-highmem-2` and `r5.large` machines #2212

consideRatio commented Feb 16, 2023

yuvipanda commented Feb 16, 2023

consideRatio commented Mar 19, 2023 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

Core node pool revision to minimize cost - towards n2-highmem-2 and r5.large machines #2212

Core node pool revision to minimize cost - towards n2-highmem-2 and r5.large machines #2212

Comments

consideRatio commented Feb 16, 2023

Consider

Disruptive maintenance consideration

Related cloud operations

Related

yuvipanda commented Feb 16, 2023

consideRatio commented Mar 19, 2023 • edited Loading

Summary of current opionion

This comment was marked as resolved.

This comment was marked as resolved.

Core node pool revision to minimize cost - towards `n2-highmem-2` and `r5.large` machines #2212

Core node pool revision to minimize cost - towards `n2-highmem-2` and `r5.large` machines #2212

consideRatio commented Mar 19, 2023 •

edited

Loading