Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core node pool revision to minimize cost - towards n2-highmem-2 and r5.large machines #2212

Open
consideRatio opened this issue Feb 16, 2023 · 4 comments
Labels
tech:cloud-infra Optimization of cloud infra to reduce costs etc.

Comments

@consideRatio
Copy link
Member

We have a few clusters, and a few core node pool configurations. I suggest that we inspect what we got, and we transition to 1:8 CPU:RAM nodes like n2-highmem-2 and r5.large (2:16) or n2-highmem-4 and r5.xlarge (4:32).

We should make a decision for each clusters I think, and then we start to learn what we need to consider and can establish a ruleset to follow.

Consider

  1. What is the minimal nodes we can have?
  2. What is the minimal node CPU:RAM we can have per node?
    • For practical purposes in 2i2c, 2:16 CPU:RAM should be the lower bound
      • The cost of our time is high compared to saving 1 CPU more a month, and we would invest time to optimize this and keep it reliably at 1 CPU while providing a reliable operation of services
    • For some clusters, exceptions, we may need 4:32 CPU:RAM machines if prometheus-server peaks very high

Disruptive maintenance consideration

This will all pods in the core node, so ideally there are no active users on the cluster as restarting hub and proxy pod will disrupt users for example, even if they can recover their user server session.

Related cloud operations

  1. EKS
  • To switch to a new core node pool type, you should adjust files in the eksctl folder
    • Update <clustername>.jsonnet template to declare one new core node pool, and render the jsonnet template into a eksctl config
    • Use eksctl to create the new core node pool
    • Use kubectl to drain the old core node pool
    • Update <clustername>.jsonnet template to declare one new core node pool, and render the jsonnet template into a eksctl config
    • Use eksctl to delete the old core node pool
  1. GKE
  • To switch to a new core node pool, you should adjust the files in the terraform folder
    • I'm not confident on this, but I think steps involve terraform init, terraform workspace list, terraform workspace select, updating terraform variable files, and terraform plan + terraform apply

Related

@yuvipanda
Copy link
Member

Ref #2005 as well

@consideRatio
Copy link
Member Author

consideRatio commented Mar 19, 2023

I concluded that the r5.large nodes on AWS just allows for 29 pods, so having just one may cause a "out of pods" reason to schedule on a node. Either two or having one r5.xlarge with support for 58 pods.

I ended up with two nodes needed recently on a cluster using dask gateway that also takes up a few pods. I suggest using r5.large only for basehub projects due to this.

Summary of current opionion

  • EKS basehub, use r5.large
  • EKS daskhub, use r5.xlarge so that two smaller isn't needed just for the pod limit anyhow
  • GKE, use n2-highmem-2 or n2-highmem-4 if we for example believe that the prometheus-server will need more memory than the 16 GB n2-highmem-2 is specced for.

@yuvipanda

This comment was marked as resolved.

@consideRatio

This comment was marked as resolved.

@consideRatio consideRatio added the tech:cloud-infra Optimization of cloud infra to reduce costs etc. label Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tech:cloud-infra Optimization of cloud infra to reduce costs etc.
Projects
No open projects
Status: Needs Shaping / Refinement
Development

No branches or pull requests

2 participants