Partitionable with common attributes #30

johnbelamaric · 2024-06-14T15:51:28Z

This is an evolution of the partitionable model defined in #27, which moves common attributes up to the pool level to reduce the size of the objects.

k8s-ci-robot · 2024-06-14T15:51:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [johnbelamaric]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pohly · 2024-06-14T16:08:50Z

dra-evolution/pkg/api/capacity_types.go

@@ -68,6 +68,11 @@ type ResourcePoolSpec struct {
 	// +optional
 	SharedCapacity []SharedCapacity `json:"sharedCapacity,omitempty"`

+	// Attributes contains common device attributes that are the same
+	// for all devices in the pool, unless a device specifically over


"the pool" or "the slice"?

The prototype is not up-to-date here 😢 We should update it to match the current KEP before doing further prototyping.

Yeah. I didn't want to do that yet, I figured it could be done and I could rebase...

pohly · 2024-06-14T16:10:49Z

to reduce the size of the objects

That helps reduce the size on average, but for the worst-case analysis which determines the limits of the slices we have to assume that the new Attributes is fully-populated and all devices have the maximum number of attributes.

johnbelamaric · 2024-06-14T16:12:50Z

to reduce the size of the objects

That helps reduce the size on average, but for the worst-case analysis which determines the limits of the slices we have to assume that the new Attributes is fully-populated and all devices have the maximum number of attributes.

Yes. True.

johnbelamaric · 2024-06-14T16:13:17Z

I am making several options, see comment in #20

thockin · 2024-06-15T19:38:20Z

dra-evolution/testdata/pools-two-nodes-dgxa100.yaml

+      - bool: true
+        name: mig-capable
+      - bool: false
+        name: mig-enabled


NOTE: we should establish core guidelines - are attributes expected to be mutable or immutable? Do we encourage drivers to embed status here or are attributes only intrinsic things about the device? This one feels like status.

This should not be an attribute. MIG capable yes, enabled no. This is leftover from the structure I used to represent all of this in classic DRA. It is very compact, but also not generalizable.

thockin · 2024-06-15T19:41:08Z

dra-evolution/testdata/pools-two-nodes-dgxa100.yaml

+        string: 1g.5gb+me
+      name: gpu-0-mig-1g.5gb-me-1
+      sharedCapacityConsumed:
+      - capacity: "14"


"capacity" is a producer term, not a consumer term. This should be "quantity" or "amount" or "request" or something.

It reuses the same struct as that used to declare the capacity. We would need a new struct then (which seems fine).

thockin · 2024-06-15T19:43:08Z

dra-evolution/testdata/pools-two-nodes-dgxa100.yaml

+        name: gpu-0-ofa-engines
+      - capacity: 4864Mi
+        name: gpu-0-memory
+      - capacity: "1"


MAYBE (really not sure) we should say that if capacity is omitted here, it means "all"? Then it is sort of a fancy mutex for "partitions" and an quantity for fungibles.

Maybe not worthwhile

thockin

I think it is clear that shared attributes is an appropriate optimization to include.

I think this further emphasizes that some form of nesting would (IMO) make other optimizations possible. There are attributes that apply to all "cards" (e.g. vendor). There are some attributes that apply to all GPUs (MIG or not) on a single card (e.g. the card UUID, for matching). There are some attributes that apply to each leaf device. Why are we shoehorning that into 2 levels?

johnbelamaric · 2024-06-15T19:52:38Z

Option 4 has some nesting. That's #31. It is much more efficient than this one.

johnbelamaric · 2024-06-15T19:53:09Z

We could do more levels. Not clear the payoff is there.

johnbelamaric · 2024-06-15T19:54:01Z

Option 3 makes common attributes. Option 4 makes common attributes AND common partition "map".

klueska · 2024-06-15T20:05:57Z

I 100% agree we should have shared attributes. My first prototype had it, but at the time you said „let’s not prematurely optimize for size“ so we dropped it.

I’m still not sold on nesting though. I also had it originally (in my „recursive“ device model), but you all (rightly) convinced me to drop it. And now after working with flat devices and updating both the example driver and the NVIDIA GPU driver to adhere to them, I’m really happy with the flexibility that a flat model brings us.

I really don’t think it buys us much to have nesting and I have a strong feeling will come back to bite us fairly quickly.

thockin · 2024-06-15T21:05:13Z

I was continuing on the KEP PR, but detoured to these option, so I will copy a comment:

I just find it super weird to have gpu1's shared items consumable from gpu0. This is the thing that is setting me on edge. That implies to me a level of fungibility which doesn't exist. There is a grouping that is smaller than a ResourceSlice but bigger than a Device, and we are not modelling it. Call it a "card" for the moment. A slice doesn't have shared resources, a "card" does. Now you'll probably tell me "actually, one card can borrow resources on another card". In fact, I can already see the (hypothetical) use-case for a channelized <something> which can be effectively RAID'ed into a larger logical device. But that's not this, and (TTBOMK) that doesn't exist yet.

I really can see both sides, and I don't mean to be dogmatic. It just smells funny. Let's keep the conversation overall moving forward, and if this is all that's left, we can hash it out.

My first prototype had it, but at the time you said „let’s not prematurely optimize for size“ so we dropped it.

Yeah, we dropped a LOT to get the baseline, and bringing partitions back makes it clear to me that this is one piece that really does make sense.

thockin · 2024-06-15T21:10:33Z

dra-evolution/testdata/pools-two-nodes-dgxa100.yaml

+      - capacity: 40Gi
+        name: gpu-0-memory
+    - attributes:
+      - name: mig-profile


We should include in this YAML the requirement that all MIG partitions on a "card" have a "parent UUID" attribute with the same value (for matching). I think that would highlight the "hidden" nesting level, even if we ultimately choose not to model it.

Based on kubernetes-sigs/wg-device-management#30.

Based on kubernetes-sigs/wg-device-management#30. Instead of a discriminator (kubernetes-sigs/wg-device-management#36 (comment)), a versioning scheme gets added.

k8s-triage-robot · 2024-09-14T02:02:34Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-10-14T02:42:25Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-11-13T02:45:03Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-11-13T02:45:08Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot requested review from klueska and pohly June 14, 2024 15:51

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 14, 2024

johnbelamaric mentioned this pull request Jun 14, 2024

dra-evolution: partitioning of devices #20

Closed

pohly reviewed Jun 14, 2024

View reviewed changes

thockin reviewed Jun 15, 2024

View reviewed changes

Add common attributes

0c1d5fa

johnbelamaric force-pushed the partitionable-3 branch from 8da0e74 to 0c1d5fa Compare June 16, 2024 01:41

pohly added a commit to pohly/kubernetes that referenced this pull request Jul 8, 2024

DRA: add partitionable devices as separate field

580f00e

Based on kubernetes-sigs/wg-device-management#30.

pohly added a commit to pohly/kubernetes that referenced this pull request Jul 8, 2024

DRA: add partitionable devices as separate field

5cf9929

Based on kubernetes-sigs/wg-device-management#30.

klueska mentioned this pull request Aug 30, 2024

DRA: Add support for partitionable devices kubernetes/enhancements#4815

Open

4 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 14, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 14, 2024

k8s-ci-robot closed this Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partitionable with common attributes #30

Partitionable with common attributes #30

johnbelamaric commented Jun 14, 2024

k8s-ci-robot commented Jun 14, 2024

pohly Jun 14, 2024

johnbelamaric Jun 14, 2024

pohly commented Jun 14, 2024

johnbelamaric commented Jun 14, 2024

johnbelamaric commented Jun 14, 2024

thockin Jun 15, 2024

klueska Jun 15, 2024 •

edited

Loading

thockin Jun 15, 2024

klueska Jun 15, 2024

thockin Jun 15, 2024

thockin left a comment

johnbelamaric commented Jun 15, 2024

johnbelamaric commented Jun 15, 2024

johnbelamaric commented Jun 15, 2024

klueska commented Jun 15, 2024 •

edited

Loading

thockin commented Jun 15, 2024

thockin Jun 15, 2024

k8s-triage-robot commented Sep 14, 2024

k8s-triage-robot commented Oct 14, 2024

k8s-triage-robot commented Nov 13, 2024

k8s-ci-robot commented Nov 13, 2024

Partitionable with common attributes #30

Partitionable with common attributes #30

Conversation

johnbelamaric commented Jun 14, 2024

k8s-ci-robot commented Jun 14, 2024

pohly Jun 14, 2024

Choose a reason for hiding this comment

johnbelamaric Jun 14, 2024

Choose a reason for hiding this comment

pohly commented Jun 14, 2024

johnbelamaric commented Jun 14, 2024

johnbelamaric commented Jun 14, 2024

thockin Jun 15, 2024

Choose a reason for hiding this comment

klueska Jun 15, 2024 • edited Loading

Choose a reason for hiding this comment

thockin Jun 15, 2024

Choose a reason for hiding this comment

klueska Jun 15, 2024

Choose a reason for hiding this comment

thockin Jun 15, 2024

Choose a reason for hiding this comment

thockin left a comment

Choose a reason for hiding this comment

johnbelamaric commented Jun 15, 2024

johnbelamaric commented Jun 15, 2024

johnbelamaric commented Jun 15, 2024

klueska commented Jun 15, 2024 • edited Loading

thockin commented Jun 15, 2024

thockin Jun 15, 2024

Choose a reason for hiding this comment

k8s-triage-robot commented Sep 14, 2024

k8s-triage-robot commented Oct 14, 2024

k8s-triage-robot commented Nov 13, 2024

k8s-ci-robot commented Nov 13, 2024

klueska Jun 15, 2024 •

edited

Loading

klueska commented Jun 15, 2024 •

edited

Loading