QCL turnup add profile options based of linked earth #2332

pnasrat · 2023-03-10T15:22:10Z

consideRatio

Nice work Pris!

I hope its okay that I reviewed this now before a review is explicitly requested, I'll be quite busy on monday with other maintenance efforts and wanted to offload myself from doing it then.

I list what I overview as the remaining work for this PR below:

Small as the default profile list entry
Both highcpu-32 and highcpu-96 have profile list entries with requests ensuring user pods get dedicated nodes
display_name and description of the highcpu-32 and highcpu-96 entries
I think it would be good if we manage to reserve size adjective for the standardized choices of small/medium/large corresponding to highmem-4/highmem-16/highmem-64. Here is an idea of what we could have in display_name and description without a size adjective.
```
      - display_name: "n2-highcpu-32: 32 CPU / 32 GB RAM"
        description: "Start a container on a dedicated node"
```
If you look to include image choices in this PR, I think the two choices mentioned in [Request deployment] New Hub: QCL aka QuantifiedCarbon #2118 (comment) should be included, and that we provide pangeo/pangeo-notebook:latest to represent the wish about using a prebuild "2i2c docker image". I opened Intervention to have new hubs not couple to the 2i2c-hubs-image, but something more up to date #2336 to motivate why a new community should do that instead of using quay.io/2i2c/2i2c-hubs-image.
Because of the default value in kubespawner to set imagePullPolicy to IfNotPresent, I think now that we reference latest, we should make this Always via singleuser.image.pullPolicy. (Related comment about this upstream: Kubespawner image_pull_policy problem jupyterhub/kubespawner#679 (comment))
Update the requests etc as seen in linked-earth in openscapes/linked-earth: fix node sharing to fit on nodes perfectly #2338. Note it should be linked-earth (GCP, GKE, n2-highmem) specifically and not openscapes (AWS, EKS, r5.) as the allocatable memory is a bit different.

config/clusters/qcl/common.values.yaml

See #2287

for more information, see https://pre-commit.ci

pnasrat · 2023-03-13T17:17:02Z

@consideRatio could you share the script as I can't quite follow the rounding and math you are using in #2338

consideRatio · 2023-03-13T17:25:15Z

@pnasrat I put it here https://gist.github.com/consideRatio/071110916cb58220657398c61c14af7c.

Its based on having kubectl get node report information under allocatable for each node type as summarized in #2121 (comment).

I've done some manual transformations from KiB units to GiB and GB units as well, so the source for the logic is k8s reported allocatable memory for a node of each type in X KiB.

pnasrat · 2023-03-13T17:27:09Z

@consideRatio re your point about changing pull policy that sounds like something that should be done outside of the single cluster and in basehub, is that correct? I don't see any other hubs setting this yet.

Update highcpu profile entries Use pangeo image

consideRatio · 2023-03-13T17:38:06Z

About singleuser.image.pullPolicy. Its not needed for any image that uses a immutable tag, but most hubs doesn't do that but instead has a specific image pinned. What would the downside be of using Always? More requests to the container registry, but if the container registry lists hashes for all the image's layers and those layer hashes are already around on the node - its not being downloaded again.

I have seen a binderhubf or example run into rate limitations by hub.docker.com for example though, so it can be relevant to not have Always as a imagePullPolicy by default unless its known to be needed for a mutable tag.

Here is an existing exception for carbonplan:

infrastructure/config/clusters/carbonplan/common.values.yaml

Lines 36 to 41 in de1b067

    
           image: 
        
             name: carbonplan/trace-python-notebook 
        
             # pullPolicy set to "Always" because we use the changing over time tag 
        
             # "latest". 
        
             pullPolicy: Always 
        
             tag: "latest"

I think it makes sense to opt-in for this rather than rely on a default value change in the basehub helm chart. Ideally in my mind, this should not be set at all in kubespawner, and then the k8s default behavior of acting as if its "Always" if the tag is latest specifically, and acting as if its "IfNotPresent" if the image tag is something else. Many helm charts, for example the boilerplate generated when using helm create <new chart name> comes with explicit configuration of the imagePullPolicy for some reason though.

pnasrat · 2023-03-13T17:44:04Z

Not a downside I am merely ensuring that I would be addressing in the correct place!

pnasrat · 2023-03-13T17:52:11Z

Ah great reading https://cloud.google.com/kubernetes-engine/docs/concepts/plan-node-sizes makes this clearer.

consideRatio · 2023-03-13T17:54:48Z

I read that too!! Then i calculated based on that and couldt observe that the allocatable memory from the described 128GB memory for the n2-highmem-16 node for example. At that point i gave up and looked at the actually allocatable memory as reported by k8s for pods on the node.

pnasrat · 2023-03-13T18:04:30Z

@consideRatio please take another look. I've done most things but will double check the allocatable once the profile list is up to both validate my understanding and understand the process you used to get the numbers

config/clusters/qcl/common.values.yaml

consideRatio · 2023-03-13T20:07:41Z

config/clusters/qcl/common.values.yaml

+          mem_guarantee: 86.662G
+          cpu_guarantee: 9.6


Ah you are consistent with it the requests here on nodes that users isn't sharing also, sure!

Since we specify a nodeSelector to spawn on a certain node, and we wish to provide users with dedicated nodes (1:1 pod:node), we can make any request as long as one user, but not two, fits on the node.

config/clusters/qcl/common.values.yaml

Oops copy paste faile Co-authored-by: Erik Sundell <[email protected]>

Co-authored-by: Erik Sundell <[email protected]>

consideRatio · 2023-03-13T20:46:27Z

Btw if you get multiple code suggestions you want to commit, you have the choice to combine them into one via the GitHub UI, but only from the "Files changed" tab!

consideRatio

There is a todo note still in the PR, but I figure you can resolve it how you like either before or after merge!

My procedure here would be to deploy this PR manually to qcl staging, test it startup of highcpu-32 and 96, then if needed adjust the mem requests, remove the todo note from the PR, merge.

This approval applies even if highcpu-32 / 96 mem requests are adjusted if needed!

pnasrat · 2023-03-13T20:58:35Z

Thanks for the review/approval. I will likely do this with the manual staging deploy first thing my time tomorrow!

pnasrat · 2023-03-14T13:58:04Z

Manual deploy to QC: staging

Running helm upgrade --install --create-namespace --wait --namespace=staging staging /home/pnasrat/workspace/src/github.com/2i2c-org/infrastructure/helm-charts/basehub --values=/tmp/tmp8yyyv0vu --values=/home/pnasrat/workspace/src/github.com/2i2c-org/infrastructure/config/clusters/qcl/common.values.yaml --values=/home/pnasrat/workspace/src/github.com/2i2c-org/infrastructure/config/clusters/qcl/staging.values.yaml --values=/tmp/tmptogaic_g
Release "staging" has been upgraded. Happy Helming!
NAME: staging
LAST DEPLOYED: Tue Mar 14 09:54:56 2023
NAMESPACE: staging
STATUS: deployed
REVISION: 15
TEST SUITE: None

pnasrat · 2023-03-14T15:29:14Z

kubectl get node gke-qcl-cluster-nb-huge-highcpu-56ceaf12-5fnb   -o json | jq '.["status"]["allocatable"]'
{
  "attachable-volumes-gce-pd": "127",
  "cpu": "95690m",
  "ephemeral-storage": "47060071478",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "91143984Ki",
  "pods": "110"
}

allocatable memory 93.33GB

pnasrat · 2023-03-14T17:39:08Z

kubectl get node gke-qcl-cluster-nb-large-highcpu-9a199cd3-75db    -o json | jq '.["status"]["allocatable"]'
{
  "attachable-volumes-gce-pd": "127",
  "cpu": "31850m",
  "ephemeral-storage": "47060071478",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "29130868Ki",
  "pods": "110"
}

Add QCL allowed_teams to spawner config Rightsize highcpu memory guarantee based on running nodes

for more information, see https://pre-commit.ci

pnasrat · 2023-03-14T18:44:34Z

These numbers seem to work it's possible I could get it down to MiB accuracy but I want to get this merged. Tested by deploying and spawning both highcpu profiles.

pnasrat self-assigned this Mar 10, 2023

pnasrat changed the title ~~QCL turnup add profile options based of leap~~ QCL turnup add profile options based of linked earth Mar 10, 2023

pnasrat force-pushed the qcl-staging-profilelist branch from 5cf491d to 7947658 Compare March 10, 2023 15:30

consideRatio mentioned this pull request Mar 11, 2023

Intervention to have new hubs not couple to the 2i2c-hubs-image, but something more up to date #2336

Closed

consideRatio reviewed Mar 11, 2023

View reviewed changes

config/clusters/qcl/common.values.yaml Outdated Show resolved Hide resolved

pnasrat and others added 2 commits March 13, 2023 12:54

QCL turnup add profile options based of linked earth

1e07d11

See #2287

[pre-commit.ci] auto fixes from pre-commit.com hooks

4d753bb

for more information, see https://pre-commit.ci

pnasrat force-pushed the qcl-staging-profilelist branch from 41061b7 to 4d753bb Compare March 13, 2023 16:54

PR Revisions profile and image

91ee7ce

Update highcpu profile entries Use pangeo image

Add pullPolicy of Always

f1cdb73

Add TODO to check actual allocation once node can be created

ffede54

consideRatio reviewed Mar 13, 2023

View reviewed changes

pnasrat and others added 5 commits March 13, 2023 16:42

Update config/clusters/qcl/common.values.yaml

9285d37

Oops copy paste faile Co-authored-by: Erik Sundell <[email protected]>

Update config/clusters/qcl/common.values.yaml

b3c47bf

Co-authored-by: Erik Sundell <[email protected]>

Update config/clusters/qcl/common.values.yaml

adc659f

Co-authored-by: Erik Sundell <[email protected]>

Update config/clusters/qcl/common.values.yaml

4cba414

Co-authored-by: Erik Sundell <[email protected]>

Update config/clusters/qcl/common.values.yaml

6604513

Co-authored-by: Erik Sundell <[email protected]>

consideRatio approved these changes Mar 13, 2023

View reviewed changes

pnasrat and others added 3 commits March 14, 2023 13:46

QCL hub spawner config update

90e310e

Add QCL allowed_teams to spawner config Rightsize highcpu memory guarantee based on running nodes

size tweaks

b318c72

[pre-commit.ci] auto fixes from pre-commit.com hooks

53ecd58

for more information, see https://pre-commit.ci

pnasrat merged commit 23a4f76 into master Mar 14, 2023

pnasrat deleted the qcl-staging-profilelist branch March 14, 2023 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QCL turnup add profile options based of linked earth #2332

QCL turnup add profile options based of linked earth #2332

pnasrat commented Mar 10, 2023

consideRatio left a comment •

edited by pnasrat

Loading

pnasrat commented Mar 13, 2023

consideRatio commented Mar 13, 2023

pnasrat commented Mar 13, 2023 •

edited

Loading

consideRatio commented Mar 13, 2023

pnasrat commented Mar 13, 2023

pnasrat commented Mar 13, 2023

consideRatio commented Mar 13, 2023

pnasrat commented Mar 13, 2023

consideRatio Mar 13, 2023

consideRatio commented Mar 13, 2023

consideRatio left a comment •

edited

Loading

pnasrat commented Mar 13, 2023

pnasrat commented Mar 14, 2023

pnasrat commented Mar 14, 2023

pnasrat commented Mar 14, 2023

pnasrat commented Mar 14, 2023

QCL turnup add profile options based of linked earth #2332

QCL turnup add profile options based of linked earth #2332

Conversation

pnasrat commented Mar 10, 2023

consideRatio left a comment • edited by pnasrat Loading

Choose a reason for hiding this comment

pnasrat commented Mar 13, 2023

consideRatio commented Mar 13, 2023

pnasrat commented Mar 13, 2023 • edited Loading

consideRatio commented Mar 13, 2023

pnasrat commented Mar 13, 2023

pnasrat commented Mar 13, 2023

consideRatio commented Mar 13, 2023

pnasrat commented Mar 13, 2023

consideRatio Mar 13, 2023

Choose a reason for hiding this comment

consideRatio commented Mar 13, 2023

consideRatio left a comment • edited Loading

Choose a reason for hiding this comment

pnasrat commented Mar 13, 2023

pnasrat commented Mar 14, 2023

pnasrat commented Mar 14, 2023

pnasrat commented Mar 14, 2023

pnasrat commented Mar 14, 2023

consideRatio left a comment •

edited by pnasrat

Loading

pnasrat commented Mar 13, 2023 •

edited

Loading

consideRatio left a comment •

edited

Loading