Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QCL turnup add profile options based of linked earth #2332

Merged
merged 13 commits into from
Mar 14, 2023

Conversation

pnasrat
Copy link
Contributor

@pnasrat pnasrat commented Mar 10, 2023

See #2287

@pnasrat pnasrat self-assigned this Mar 10, 2023
@pnasrat pnasrat changed the title QCL turnup add profile options based of leap QCL turnup add profile options based of linked earth Mar 10, 2023
Copy link
Member

@consideRatio consideRatio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work Pris!

I hope its okay that I reviewed this now before a review is explicitly requested, I'll be quite busy on monday with other maintenance efforts and wanted to offload myself from doing it then.

I list what I overview as the remaining work for this PR below:

  • Small as the default profile list entry
  • Both highcpu-32 and highcpu-96 have profile list entries with requests ensuring user pods get dedicated nodes
  • display_name and description of the highcpu-32 and highcpu-96 entries
    I think it would be good if we manage to reserve size adjective for the standardized choices of small/medium/large corresponding to highmem-4/highmem-16/highmem-64. Here is an idea of what we could have in display_name and description without a size adjective.
          - display_name: "n2-highcpu-32: 32 CPU / 32 GB RAM"
            description: "Start a container on a dedicated node"
  • If you look to include image choices in this PR, I think the two choices mentioned in [Request deployment] New Hub: QCL aka QuantifiedCarbon #2118 (comment) should be included, and that we provide pangeo/pangeo-notebook:latest to represent the wish about using a prebuild "2i2c docker image". I opened Intervention to have new hubs not couple to the 2i2c-hubs-image, but something more up to date #2336 to motivate why a new community should do that instead of using quay.io/2i2c/2i2c-hubs-image.
  • Because of the default value in kubespawner to set imagePullPolicy to IfNotPresent, I think now that we reference latest, we should make this Always via singleuser.image.pullPolicy. (Related comment about this upstream: Kubespawner image_pull_policy problem jupyterhub/kubespawner#679 (comment))
  • Update the requests etc as seen in linked-earth in openscapes/linked-earth: fix node sharing to fit on nodes perfectly #2338. Note it should be linked-earth (GCP, GKE, n2-highmem) specifically and not openscapes (AWS, EKS, r5.) as the allocatable memory is a bit different.

config/clusters/qcl/common.values.yaml Outdated Show resolved Hide resolved
@pnasrat
Copy link
Contributor Author

pnasrat commented Mar 13, 2023

@consideRatio could you share the script as I can't quite follow the rounding and math you are using in #2338

@consideRatio
Copy link
Member

@pnasrat I put it here https://gist.github.com/consideRatio/071110916cb58220657398c61c14af7c.

Its based on having kubectl get node report information under allocatable for each node type as summarized in #2121 (comment).

I've done some manual transformations from KiB units to GiB and GB units as well, so the source for the logic is k8s reported allocatable memory for a node of each type in X KiB.

@pnasrat
Copy link
Contributor Author

pnasrat commented Mar 13, 2023

@consideRatio re your point about changing pull policy that sounds like something that should be done outside of the single cluster and in basehub, is that correct? I don't see any other hubs setting this yet.

Update highcpu profile entries

Use pangeo image
@consideRatio
Copy link
Member

About singleuser.image.pullPolicy. Its not needed for any image that uses a immutable tag, but most hubs doesn't do that but instead has a specific image pinned. What would the downside be of using Always? More requests to the container registry, but if the container registry lists hashes for all the image's layers and those layer hashes are already around on the node - its not being downloaded again.

I have seen a binderhubf or example run into rate limitations by hub.docker.com for example though, so it can be relevant to not have Always as a imagePullPolicy by default unless its known to be needed for a mutable tag.

Here is an existing exception for carbonplan:

image:
name: carbonplan/trace-python-notebook
# pullPolicy set to "Always" because we use the changing over time tag
# "latest".
pullPolicy: Always
tag: "latest"

I think it makes sense to opt-in for this rather than rely on a default value change in the basehub helm chart. Ideally in my mind, this should not be set at all in kubespawner, and then the k8s default behavior of acting as if its "Always" if the tag is latest specifically, and acting as if its "IfNotPresent" if the image tag is something else. Many helm charts, for example the boilerplate generated when using helm create <new chart name> comes with explicit configuration of the imagePullPolicy for some reason though.

@pnasrat
Copy link
Contributor Author

pnasrat commented Mar 13, 2023

Not a downside I am merely ensuring that I would be addressing in the correct place!

@pnasrat
Copy link
Contributor Author

pnasrat commented Mar 13, 2023

Ah great reading https://cloud.google.com/kubernetes-engine/docs/concepts/plan-node-sizes makes this clearer.

@consideRatio
Copy link
Member

I read that too!! Then i calculated based on that and couldt observe that the allocatable memory from the described 128GB memory for the n2-highmem-16 node for example. At that point i gave up and looked at the actually allocatable memory as reported by k8s for pods on the node.

@pnasrat
Copy link
Contributor Author

pnasrat commented Mar 13, 2023

@consideRatio please take another look. I've done most things but will double check the allocatable once the profile list is up to both validate my understanding and understand the process you used to get the numbers

config/clusters/qcl/common.values.yaml Outdated Show resolved Hide resolved
config/clusters/qcl/common.values.yaml Outdated Show resolved Hide resolved
config/clusters/qcl/common.values.yaml Outdated Show resolved Hide resolved
Comment on lines 194 to 195
mem_guarantee: 86.662G
cpu_guarantee: 9.6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you are consistent with it the requests here on nodes that users isn't sharing also, sure!

Since we specify a nodeSelector to spawn on a certain node, and we wish to provide users with dedicated nodes (1:1 pod:node), we can make any request as long as one user, but not two, fits on the node.

config/clusters/qcl/common.values.yaml Outdated Show resolved Hide resolved
config/clusters/qcl/common.values.yaml Outdated Show resolved Hide resolved
config/clusters/qcl/common.values.yaml Show resolved Hide resolved
@consideRatio
Copy link
Member

Btw if you get multiple code suggestions you want to commit, you have the choice to combine them into one via the GitHub UI, but only from the "Files changed" tab!

Copy link
Member

@consideRatio consideRatio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a todo note still in the PR, but I figure you can resolve it how you like either before or after merge!

My procedure here would be to deploy this PR manually to qcl staging, test it startup of highcpu-32 and 96, then if needed adjust the mem requests, remove the todo note from the PR, merge.

This approval applies even if highcpu-32 / 96 mem requests are adjusted if needed!

@pnasrat
Copy link
Contributor Author

pnasrat commented Mar 13, 2023

Thanks for the review/approval. I will likely do this with the manual staging deploy first thing my time tomorrow!

@pnasrat
Copy link
Contributor Author

pnasrat commented Mar 14, 2023

Manual deploy to QC: staging

Running helm upgrade --install --create-namespace --wait --namespace=staging staging /home/pnasrat/workspace/src/github.com/2i2c-org/infrastructure/helm-charts/basehub --values=/tmp/tmp8yyyv0vu --values=/home/pnasrat/workspace/src/github.com/2i2c-org/infrastructure/config/clusters/qcl/common.values.yaml --values=/home/pnasrat/workspace/src/github.com/2i2c-org/infrastructure/config/clusters/qcl/staging.values.yaml --values=/tmp/tmptogaic_g
Release "staging" has been upgraded. Happy Helming!
NAME: staging
LAST DEPLOYED: Tue Mar 14 09:54:56 2023
NAMESPACE: staging
STATUS: deployed
REVISION: 15
TEST SUITE: None

@pnasrat
Copy link
Contributor Author

pnasrat commented Mar 14, 2023

kubectl get node gke-qcl-cluster-nb-huge-highcpu-56ceaf12-5fnb   -o json | jq '.["status"]["allocatable"]'
{
  "attachable-volumes-gce-pd": "127",
  "cpu": "95690m",
  "ephemeral-storage": "47060071478",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "91143984Ki",
  "pods": "110"
}

allocatable memory 93.33GB

@pnasrat
Copy link
Contributor Author

pnasrat commented Mar 14, 2023

kubectl get node gke-qcl-cluster-nb-large-highcpu-9a199cd3-75db    -o json | jq '.["status"]["allocatable"]'
{
  "attachable-volumes-gce-pd": "127",
  "cpu": "31850m",
  "ephemeral-storage": "47060071478",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "29130868Ki",
  "pods": "110"
}

pnasrat and others added 3 commits March 14, 2023 13:46
Add QCL allowed_teams to spawner config

Rightsize highcpu memory guarantee based on running nodes
@pnasrat
Copy link
Contributor Author

pnasrat commented Mar 14, 2023

These numbers seem to work it's possible I could get it down to MiB accuracy but I want to get this merged. Tested by deploying and spawning both highcpu profiles.

@pnasrat pnasrat merged commit 23a4f76 into master Mar 14, 2023
@pnasrat pnasrat deleted the qcl-staging-profilelist branch March 14, 2023 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants