-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New default machine types and profile list options - sharing nodes is great! #2121
Comments
Could we discuss this at our next prod/eng meeting? Your experience tells me this will be a good default for many cases but I am not completely clear on who is making the choice of machine type. For some hubs, giving this discretion to the users makes sense but for others (such as educational settings or events) i am not as clear. Is it possible for hub-admin to have any additional control over machine types that users wouldn't be able to see? Regarding the cost/value, I think there is lots of potential work to do in this area to translate from what the cloud providers offer to what our end-users see. That's a different issue but one I think we can work in on behalf of our community partners. With so many different machine types, I think we need to be able to provide advice and recommendations. |
Thank you @jmunroe for reading and considering this. Damián has scheduled this for discussion on the next prod/eng meet!
Currently, we are making the choice of what machine types the cloud should be able to start on demand, and what server options end users are presented with - often mapped to separate machine types capacity in CPU/RAM. Currently we could get a request for a specific machine type, but we are not actively asking for input about machine types. I'd like our default provided machine types and the options to start servers against them to be so good that users don't find themselves constrained and need to request something extra. At least if it is only a matter of CPU and Memory as compared to a need for attached GPUs.
I'll provide a short response, but I want us to not linger on this topic as I think its out of scope for this issue. Yes - it is possible to provide all kinds of logic to make certain users see certain options, as long as we can access state about this. The key issue is that it is tricky to provide this configuration in a standardized way across hubs using different authenticators etc. I've implemented this elsewhere in, making the choice of GPU servers show up only to a few people for example. How I did it is documented in this forum post.
I agree we should write about this to some degree. I think we should settle for clarifying why we made our choice. After that, if its seen as pressing, we could also provide guidance for users to make their own choices, but I don't think we should start out with that. They are better of not needing to learn about this. If they already have learned and have opinions we can adjust to them still though! |
I'm not technically knowledgeable enough to assess your technical suggestions, but they seem reasonable to me and I'll defer to the judgment of the @2i2c-org/engineering team 👍 For the decision in general, it sounds good to me as long as we can assume a baseline level of technical competence from users. This feels like a good feature for research-focused hubs, but not for educational hubs where users have no concept of a "node", "how much CPU they need", "how much RAM they need", etc. For those users, I think they just need to click "launch" and they get a user session without any choices at all. Is it possible to support both this more complex workflow @consideRatio and a simplified one for communities that have less-technical users? |
Thank you so much for putting so much time and effort into this @consideRatio ❤️ Hope you don't mind that I will ask some questions to make sure I fully understand it Questions1.
@consideRatio this 1:1 mapping is actually enforced by setting guarantees (specifically If the above is true, then technically, to achieve a 1:n relationship, we could relax the guarantees to fit more pods, right? But this wouldn't necesarly be useful in practice, because the current node sizes are small, or rather normal sized? And this is the motivation behind changing the machine types available to some that are bigger and better fit to be shared in a high memory usage scenario. 2.
This proposal is mainly for allowing/leveraging sharing nodes when using profile lists, right? 2.
I fear this might be confusing for some? and maybe it would be better to replace it with a combination:
Also, because you can get 4GB by requesting 4th of a 16GB server, but also by requesting half of an 8GB server, then maybe by not strictly linking a pod to a specific machine type, we can achieve better packing of pods on nodes. |
Thanks @GeorgianaElena for thinking about this!! 1.You got it! We would enforce how many pods should be put on the same node by CPU/Memory requests. I tried to clarify what the requests/limits I propose to accomplish this in point 5. Since memory is harder to share than CPU, I also proposed machine types with high memory in relation to the CPU. 2.
Technically, pods are always scheduled to share nodes if possible, but they end up being isolated in dedicated nodes if they request so much CPU or memory that they won't fit next to each other. So, no matter if a profile_list is used and only one machine type is assumed - node sharing will be dictated by if one or more pods can fit on the same node given the node capacity and the pods requested CPU/Memory. Practically I'd argue that hubs only presented with one machine option should also default to a degree of node sharing on a highmem node as well. If only one option is provided, it should be tailored to the amount of users etc. The more users, the larger nodes and the larger degree of sharing I think. 3.
If I understand you correctly, I agree! We should avoid presenting options to users as "one forth" or "25%", but instead present what they at least get in CPU and Memory. We should still make sure they also know what machine type is used, because at least for 4 out of 4 available CPU is different from at least 4 out of 16 available CPU. |
BIG +1 here, for everything that has a profileList. I think educational hubs should not have profileLists by default, as the users don't know what these are. But everything that does have one - plus one on implementing this. For history, we stole the current default from https://github.com/pangeo-data/pangeo-cloud-federation, which was mostly just 'there'. I think the proposed setup here is a clear unconditional positive, and we should accompany it with documentation as well so users can understand why these are the way they are too. |
I want to pick out one as a specific item here that would be uncontroversial - I think that's switching to AMD nodes by default, especially on AWS. |
Another would be to offer a small 'shared' option as default for all our research hubs, with a small pod that lands on a smallish node |
For reference, I've written about this topic in our product and engineering meeting notes for the meeting 2023-02-14. |
Work in progress notes!I wanted to get started writing down some preliminary notes on cpu/memory requests/limits for user pods that could make sense when using node sharing. I think its a very complicated topic where we want to find a balance between many things to optimize for. The text below is a brief start on writing down ideas, but I don't want to prioritize writing more about this atm since there is many topics I'd like to work on. Planning user pods' requests / limitsI suggest we provide a profile list with options to choose from pre-defined cpu and memory requests/limits for efficient node sharing, like piloted in openscapes and linked-earth currently, but what are good cpu/memory requests/limits to set? SchedulingWhen a pod is scheduled requests are very important, but limits isn't. If the combined requests from pods fit on a node, it can be scheduled there. We have our user pods schedule on user dedicated nodes, that though also needs to run some additional pods per node such as Pods like Parts of available capacityIdeas on a fractional cpu request
Related |
I've investigated GCP My conclusion is that we must provide memory request based on each individual machine type because there seem to be no trustworthy formula to calculate it. gcp: # as explored on GKE 1.25
# The e2-highmem options are 99.99% like the n2-highmem options
n2-highmem-4: # 29085928Ki, 27.738Gi, 29.783G
allocatable:
cpu: 3920m
ephemeral-storage: "47060071478"
memory: 29085928Ki
capacity:
cpu: "4"
ephemeral-storage: 98831908Ki
memory: 32880872Ki
n2-highmem-16: # 122210676Ki, 116.549Gi, 125.143G
allocatable:
cpu: 15890m
ephemeral-storage: "47060071478"
memory: 122210676Ki
capacity:
cpu: "16"
ephemeral-storage: 98831908Ki
memory: 131919220Ki
n2-highmem-64: # 510603916Ki, 486.949Gi, 522.858G
allocatable:
cpu: 63770m
ephemeral-storage: "47060071478"
memory: 510603916Ki
capacity:
cpu: "64"
ephemeral-storage: 98831908Ki
memory: 528365196Ki
pods_overhead: |
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system calico-node-vbvg8 100m (0%) 0 (0%) 0 (0%) 0 (0%) 2m34s
kube-system fluentbit-gke-rb4n8 100m (0%) 0 (0%) 200Mi (0%) 500Mi (0%) 2m34s
kube-system gke-metadata-server-d8nnp 100m (0%) 100m (0%) 100Mi (0%) 100Mi (0%) 2m34s
kube-system gke-metrics-agent-7lnb2 6m (0%) 0 (0%) 100Mi (0%) 100Mi (0%) 2m34s
kube-system ip-masq-agent-6sk7j 10m (0%) 0 (0%) 16Mi (0%) 0 (0%) 2m34s
kube-system kube-proxy-gke-leap-cluster-nb-large-d1625cd8-0gc5 100m (0%) 0 (0%) 0 (0%) 0 (0%) 2m33s
kube-system netd-ntd7r 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2m34s
kube-system pdcsi-node-x6lvw 10m (0%) 0 (0%) 20Mi (0%) 100Mi (0%) 2m34s
support support-cryptnono-sz7tp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2m14s
support support-prometheus-node-exporter-8t74g 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2m14s
Resource Requests Limits
-------- -------- ------
cpu 426m (0%) 100m (0%)
memory 436Mi (0%) 800Mi (0%)
aws: # as explored on EKS 1.24
r5.xlarge: # 31391968Ki, 29.937Gi, 32.145G
allocatable:
cpu: 3920m
ephemeral-storage: "76224326324"
memory: 31391968Ki
capacity:
cpu: "4"
ephemeral-storage: 83873772Ki
memory: 32408800Ki
r5.4xlarge: # 127415760Ki, 121.513Gi, 130.473G
allocatable:
cpu: 15890m
ephemeral-storage: "76224326324"
memory: 127415760Ki
capacity:
cpu: "16"
ephemeral-storage: 83873772Ki
memory: 130415056Ki
r5.16xlarge: # 513938668Ki, 490.130Gi, 526.273G
allocatable:
cpu: 63770m
ephemeral-storage: "76224326324"
memory: 513938668Ki
capacity:
cpu: "64"
ephemeral-storage: 83873772Ki
memory: 522603756Ki
pods_overhead: |
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
amazon-cloudwatch fluent-bit-w464b 500m (3%) 0 (0%) 100Mi (0%) 200Mi (0%) 7m29s
kube-system aws-node-cf989 25m (0%) 0 (0%) 0 (0%) 0 (0%) 7m29s
kube-system ebs-csi-node-csht5 30m (0%) 300m (1%) 120Mi (0%) 768Mi (0%) 7m28s
kube-system kube-proxy-2wlq9 100m (0%) 0 (0%) 0 (0%) 0 (0%) 7m28s
support support-cryptnono-crqrq 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7m8s
support support-prometheus-node-exporter-6dlxw 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7m8s
Resource Requests Limits
-------- -------- ------
cpu 655m (4%) 300m (1%)
memory 220Mi (0%) 968Mi (0%) |
In the product and engineering meeting April 4 2023 we agreed that I will try to document these ideas for other other 2i2c engineers during Q2.
Related
|
With the move to node sharing, we probably want to update the first line of this documentation page as I think it is no longer accurate: https://docs.2i2c.org/user/topics/data/filesystem/ EDIT: Erik added note about this to #2041 |
While this issue had a purpose to try to capture the kind of changes I think made sense, it was also a large issue with many things that is now hard to track. I've opened a lot of other issues that tracks parts of this instead, so I'm now closing this to help us focus on smaller pieces. |
Our clusters ships with a default set of machine types, and we provide a KubeSpawner profile_list to start pods on each type where there is a 1:1 relation between nodes:pods. For each node, there is only one user pod.
I thinnk a
1:n
nodes:pods relationship should be the default, not1:1
! With1:n
, I think we will provide a far better experience for users and for ourselves since it will reduce the amount of support tickets we will get I think. I think I can motivate this thoroughly, but to not write a long post, let me propose new defaults instead.Proposed new defaults
UPDATE: Now tracked via Disruptive maintenance? Consider transitioning to highmem nodes (core / user) #2511, 2i2c, terraform: transition user nodes & neurohackademy nodes from n1- to n2- nodes #3210
Let's provide high memory machine types instead of normal machine types as memory is the limitation when sharing nodes.
n2-highmem-4
/e2-highmem-4
on GCP andr5.xlarge
/r5a.xlarge
as the smallest node types for GCP / AWS.n1-highmem
machines doesn't have 1:8 ratio between CPU:Memory and shouldn't be considered, then2
machines are also having more performant CPUs. Only when GPU's are needed, we must usen1
as compared ton2
.UPDATE: we stick with intel
The difference between n2/e2 and r5/r5a on GCP and AWS is between Intel and AMD processors, where the AMD processors are ~10% and ~30% cheaper on GCP and AWS respectively. I suggest we default to AMD unless we foresee common issues. EDIT: It was GCP that was 30% cheaper, and AWS 10% cheaper.
UPDATE: Tracked via Let all clusters cloud infra support the instance types 4, 16, and 64 CPU highmem nodes #3256
Provide machine types that increment 4x in size to simplify a lot, where we start at 4 CPU, then 16 CPU, then 64 CPU.
UPDATE: tracked via Add script to generate resource allocation (nodeshare) choices #3030
We default to provide a profile list that include three choices representing the three machine types, but for each, we allow the user to specify the share they want. Do they want a dedicated server, a 4th of a server, or a 16th of a server?
UPDATE: tracked via Add script to generate resource allocation (nodeshare) choices #3030
The CPU requests/limits and Memory requests/limits should become as below, where
n
is 1 for dedicated, 4 for a 4th of a server etc.When calculating this, we should account for some important pods need to run on each node, and save small space for them.
Please review with 👍 or comment
I'd love to be actionable about this as I think its a super important change, but for me to feel actionable I'd like to see agreement on the direction. @2i2c-org/engineering could you give a 👍 to this post if you are positive to the changes suggested or leave a comment describing what you think?
If possible, please also opine if we should make use of AMD servers by default on GCP and AWS respectively. I'm not sure at all, intel is more tested, but 30% savings on AWS is a big deal.
References
r5a.xlarge
/r5a.4xlarge
/r5a.16xlarge
e2-highmem-4
/e2-highmem-16
/n2-highmem-64
Standard_E4a_v4
/Standard_E16a_v4
/Standard_E64a_v4
Motivation
👍 in this ticket a user ended up wanting to try using a 64 CPU node, but we only had 16 CPU nodes setup by default. If we had 4 / 16 / 64 CPU nodes by default instead of 2 / 4 / 8 / 16 the hub users would have a bit more flexibility.
The text was updated successfully, but these errors were encountered: