-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add a POC of an alternate partitioning scheme #35
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: klueska The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
e2d0d09
to
dea2041
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand.
- Create common attribute groups
- Create common sharedCapacityTemplates that are used to represent what a "namespace" should contain for shared capacities
- Create deviceTemplates that describe the shape of the main GPU as well as each partition
- Create device instances that reference those templates and overlay a name, attributes, and capacities specific to that instance, as well as specify the "namespace" (physical card in this example) from which the capacities are drawn
I think this can work. It is sort of like Option 4, except it flattens the "DeviceShape contains partitions" into "one shape per partition", then lists all the devices and partitions explicitly, referencing the shape (template) to reduce repetition.
I don't like the word "namespace", it already has too strong of a meaning and is pretty confusing to see here.
Since you are explicitly listing every device, I think this will not achieve order-of-magnitude level reduction in size. Quick calculation: each partition takes ~13 lines of yaml, and so the size of the yaml will grow and X + 13dp where d = number of physical devices and p = partitions per device, and X is some constant for the template. since p is 28(?) we are looking at X + 364d for total lines.
Option 4 encodes the shape in roughly the same X, but the device list does not grow O(dp) but just O(d), and with a smaller constant - something like X + 7d.
But we can see once you finish the prototype.
dra-evolution/pkg/api/poc.yaml
Outdated
- name: mig-capable | ||
bool: true | ||
sharedCapacitiesConsumed: | ||
- sharedCapacityTemplateName: gpu-shared-resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what the sharedCapacityTemplate
is for? What information does the template provide that you are not repeating below? How do we use the named template and what is its relationship to the capacities below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is repeated here, but only because a full GPU happens to consume all of the shared capacity that this example contains. It doesn't need to be true in general though.
In fact, I haven't brought it up much, but on A100, we will not actually be advertising full GPUs at the same time as MIG devices. If an A100 has MIG enabled we will only advertise its MIG devices; if it has MIG disabled we will only advertise it as a full GPU.
This is due to the fact that putting a GPU into and out of MIG mode on Ampere is very difficult (it requires all GPU workloads on all GPUs to be drained, and a GPU reset to be performed).
However, on Hopper+ we will advertise both the full GPU and its MIG device (because a GPU reset is no longer required to flip in and out of MIG mode on these newer generation GPUs).
I have updated the example to match reality better.
0408229
to
51cef6d
Compare
Signed-off-by: Kevin Klues <[email protected]>
In putting this together, it's become obvious that alot of what is being "templated" would be repeated in each and every slice. Would it be possible to create a separate API server object to hold the "template" objects for a given driver that can then be referenced by its resource slices? Possibly even leveraging a config map to do it instead of defining a new type. |
Certainly it's possible, the question is whether it is worth the complexity, since then you have another independent object that can change or be missing, etc. This would help for any of the options 2, 4+ actually. One thing we may want to think about is which factors drive scale and which are likely to grow over time, fastest:
We can characterize each suggestion then based on which of these scaling factors are relevant:
The suggestion above would change these (on a per node basis):
Setting that aside, going back to the options without that suggestion, it would be possible to merge options 4 and 6 (option "10"...no, better stick with 7), such that: 1) We capture each partition shape once like in option 6; 2) Implicitly generate partitions like in option 4. If we did that, we would have:
which seems like the best we can do while keeping the repeated items in the slice. |
Thinking more, I really do think that the things that will likely increase the most in the next 3-5 years are:
This means that factoring out things that are duplicated per slice is a good idea, as number of slices will increase with In other words, let's try to prevent growth being a multiplicative factor of This makes me think our best bet is going to be:
|
I hadn't put the numbers together, but your conclusion at the end is where my head was when suggesting this. There will still need to be some per-slice "template" data (e.g. the |
I picture one "front matter" object per GPU type which defines everything that is non-node-specific. And then each device in a resource slice has fields that point to a specific "front matter" object and then pull bits and pieces from it as appropriate. |
Simple devices can still be just a named list of attributes, but if you want anything more sophisticated you have to start using this more complex structure. |
Yes, that's what I am thinking too. Basically push the invariant stuff across nodes into a separate object, and then refer to it. Those "front matter" pieces are probably constant for a given combination of hardware, firmware and driver versions. |
FYI I added this as "Option 6" as well as the "Option 7" here: #20 (comment) |
In relation to what came up in the call tonight ... Instead of having a single centralized object with all of the "front matter", we could have have one "front matter" object per node that all of the slices for that node refer to. It would likely have redundant information to most other nodes, but then we at least keep the front-matter separate from the resource slices that consume it (and if a driver does want to go through the headache of centralizing it, they still can). |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
I haven't yet written this up properly (or added any code for it), but I wanted to push something out there with my thoughts around how to support partitioning in a more compact way.
Below is the (incomplete) YAML for what one A100 GPU with MIG disabled, one A100 with MIG enabled, and one H100 GPU (regardless of MIG mode) would look like. I am currently only showing the full GPUs and the
1g.*gb
devices (because I wrote this by hand), but you can imagine how it would be expanded with the rest.Most of it is self-explanatory, except for 1 thing -- what the new
sharedCapacityInstances
field on a device implies. It is a way to define a "boundary" for any shared capacity referenced in a device template. Meaning that all devices that provide the same mappings for a givensharedCapacityInstance
will pull from the sameSharedCapacity
.I will add more details soon (as well as a full prototype), but I wanted to get this out for initial comments before then.