Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4381: DRA: PRR and API for beta of structured parameters in 1.32 #4869

Merged
merged 1 commit into from
Oct 8, 2024

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Sep 24, 2024

  • One-line PR description: PRR and API for beta of structured parameters in 1.32
  • Other comments:

Much of the PRR text that was originally written for "classic DRA" applies also to "structured parameters". It gets moved from #3063 to #4381, with some minor adaptions. The placeholder comments get restored in #3063 because further work on the KEP would be needed to move it forward - if it gets moved forward at all instead of being abandoned.

The v1beta1 API will be almost identical to the v1alpha3 API, with just some minor tweaks to fix oversights.

The kubelet gRPC gets bumped with no changes. Nonetheless, drivers should get updated, which can be done by updating the Go dependencies and optionally changing the API import.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 24, 2024
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 24, 2024
@pohly
Copy link
Contributor Author

pohly commented Sep 24, 2024

/assign @johnbelamaric @klueska

@pohly
Copy link
Contributor Author

pohly commented Sep 24, 2024

/cc @mrunalp

@@ -2166,6 +2215,10 @@ k8s.io/kubelet/pkg/apis/dra gRPC interface. It was inspired by
[CSI](https://github.com/container-storage-interface/spec/blob/master/spec.md),
with “volume” replaced by “resource” and volume specific parts removed.

Versions v1alpha4 and v1beta1 are supported by kubelet. Both are identical.
DRA drivers should implement both because support for v1alpha4 might get
removed.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are curious about the technical aspect of implementing this, see the "DRA kubelet: add v1beta1 gRPC API" commit.

# - my_feature_metric
- resource_controller_create_total
- resource_controller_create_failures_total
- resource controller workqueue with name="resource_claim"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to search for these metrics but I did not find them in the code base.

Maybe resourceclaim_controller_create_attempts_total
and resourceclaim_controller_create_failures_total.

I don't know what the last metric corresponds too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will update the KEP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About "resource controller workqueue": Those are the workqueue_* metrics from https://kubernetes.io/docs/reference/instrumentation/metrics inside the kube-controller-manager.

// the value from there if this field is not set.
//
// +required
AdminAccess *bool
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// of logical steps. Validation against those limits happens only when
// setting an expression for the first time or when changing
// it. Therefore it is possible to change these limits without
// affecting stored expressions. Those remain valid.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpbetz, @cici37: can you help review this proposal for a CEL cost limit in DRA? See also below for more comments on CELSelectorExpressionMaxCost.

The corresponding code is currently in kubernetes/kubernetes#127511.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this is not a blocker for merging the KEP. We can still review the details in the code PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed


Do we need to enforce such a policy by default? If so, with which API?

<<[/UNRESOLVED]>>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was raised by David Eads when reviewing the VAP that we used in 1.31 (see above).

cc @ritazh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Options:

  • Keep the field as alpha with a separate feature gate (DRAAdminAccess) while promoting the rest of the KEP to beta. Defer the decision how to do permission checks.
  • Same as before, but describe the field in a separate KEP.
  • Implement access control via a admin-access.kubernetes.io namespace label (name to be decided) with an in-tree admission controller in 1.32, promote to beta together with this KEP.
  • Create a separate AdminAccessResourceClaim top-level type and use RBAC permissions to decide who's allowed to create them.

We discussed this in today's WG Device Management meeting and lean towards the first option. A second KEP seemed overkill. But we are aware that this stretches the KEP process a bit. The third option is doable, but it might not be the right API for this kind of access control. Perhaps some future mechanism in Kubernetes will allow doing this in a better way. The last option doesn't scale when the number of claim aspects that need special permission grows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consensus was to leave the AdminAccess field in alpha. This will mean either a separate KEP, or combining that with another set of small enhancements under a single feature gate. Not for any technical reason, but just because our release process doesn't have a way to move KEPs partially to another stage. At least, I think that's true - @jrrickard does that sound right? The question is - could we have two feature gates in one KEP, and have one feature gate advance to beta and the other stay in alpha? I am assuming this will confuse our processes that track features.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combining that with another set of small enhancements under a single feature gate

That feels wrong to me. Each KEP should have its own use cases and proceed according to its own graduation criteria. We don't want to hold back one "sub-feature" because its sibling isn't ready and they share the same feature gate.

Copy link
Contributor Author

@pohly pohly Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"AdminAccess" probably deserves its own KEP. I have a slight tendency towards keeping it here with a separate feature gate for 1.32 to avoid causing more confusion and work shortly before the KEP freeze, and then split it out after the 1.32 release for "proper" tracking in >= 1.33.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added text and feature gate.

gate explicitly gets enabled. The feature gate remains off by default because
DRA depends on a new API group which following the
[convention](https://github.com/kubernetes/enhancements/tree/master/keps/sig-architecture/3136-beta-apis-off-by-default)
is off by default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the feature gate be on by default, but just the API off by default? This is a significant distinction for managed k8s which may allow different controls for feature gates versus API enablement.

Copy link
Contributor Author

@pohly pohly Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the code which gets enabled because of the feature gate then tries to use the API, which causes failures if the API is disabled. For example, kube-scheduler fails to start because the ResourceSlice informer fails to sync the cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmph. That's what discovery is for...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what's the advantage of enabling the feature gate when the API is off?

I can see how it might simplify the cluster administration, but that's all. It adds more code and thus additional complexity, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And a need for more testing. "Feature gate on, API off" becomes another supported configuration that must work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. validate user doesn't enable the gate and not the API

I'm not aware of any code which implements that. @aojea, you already had to deal with this in 1.31 for MultiCIDRServiceAllocator. The comment was about that feature. Did you implement such validation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the last remaining PRR item, then I can approve. If you can add this validation as the plan-of-record, we can merge it now. If it turns out to be too difficult, we can skip it and update the KEP to reflect that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added one paragraph about this to the scheduler section, ending with "How to implement such a check reliably still needs to be determined."

I'm surprised that this is now required because I've not seen something like that before and a bit worried that this additional check itself will be a source of failures. I'll start a new discussion in #sig-architecture about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@aojea aojea Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. validate user doesn't enable the gate and not the API

I'm not aware of any code which implements that. @aojea, you already had to deal with this in 1.31 for MultiCIDRServiceAllocator. The comment was about that feature. Did you implement such validation?

@pohly no, I didn't, enabling APIs in the multicidr feature without the feature gate is a noop, and enabling the feature without the APIs does not start the cluster at all. It seems that now is too late, but if I have to do it again I will implement the cross validation, seems the best option and probably we should document it for future cases if we find a standard a reliable way of doing it

keps/sig-node/4381-dra-structured-parameters/README.md Outdated Show resolved Hide resolved

Do we need to enforce such a policy by default? If so, with which API?

<<[/UNRESOLVED]>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consensus was to leave the AdminAccess field in alpha. This will mean either a separate KEP, or combining that with another set of small enhancements under a single feature gate. Not for any technical reason, but just because our release process doesn't have a way to move KEPs partially to another stage. At least, I think that's true - @jrrickard does that sound right? The question is - could we have two feature gates in one KEP, and have one feature gate advance to beta and the other stay in alpha? I am assuming this will confuse our processes that track features.

how much hardware is installed in the cluster. Typically, each driver will
publish one ResourceSlice per node where it manages hardware.

Kubernetes itself will not impose specific limitations for the number of these
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any etcd scale limits that we need to worry about, especially, if the ResourceSlices do get bigger with partionable devices support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any besides the usual maximum size of a single object. The limits imposed on slices and string lengths must ensure that the object can be stored when encoded, which is why we did those back-of-the-envelope calculations about the maximum size.

@pohly
Copy link
Contributor Author

pohly commented Oct 4, 2024

Can we capture some of the points re: scheduling/allocation perf and scale testing as GA criteria (as discussed in the wg meeting today) or a note so we don't forget when we go to GA :)

I added:

  • Scalability and performance analysis done and published

@pohly
Copy link
Contributor Author

pohly commented Oct 4, 2024

I think I have addressed all review comments and there's nothing unresolved left.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 4, 2024

@johnbelamaric I have added the sig-node approval. Leaving the PRR approval/lgtm to you.

Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the proposed handling of the alpha field in the beta API is correct.

# because it remains in alpha. It controls whether the "AdminAccess"
# field can be set. If set, other components support it
# because the field might have been set when enabled.
# Therefore other components don't need to and shouldn't
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this field should be omitted from the beta1 API, unless the feature gate is set, shouldn't it?

Also, clients should also have the feature gate (kube-scheduler in particular), and only take action if the feature gate is set. It doesn't matter if the value was set while the gate was enabled; the feature should not have an effect when the gate is disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A while back @thockin started a discussion about how we handle feature gates. I can't find it again now, but if I remember correctly, it boiled down to the observation that we handle feature gates inconsistently. We either:

  • Put all code related to it behind an if check.
  • Check at the boundary, typically the apiserver.

I am using the second approach here because I think the first one leads to additional error scenarios. For example:

  • A user was able to request admin access (either by using 1.31, or 1.32 with feature gate enabled) when deploying a monitoring solution.
  • Feature gate gets disabled before pod scheduling.
  • Scheduler treats those claims like normal claims.
  • They now prevent using hardware that was meant for production workloads. To the user it is not obvious that this happened because they see the monitoring solution as deployed with admin access in the request.

Copy link
Contributor Author

@pohly pohly Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found it: https://docs.google.com/document/d/1roVAHyF7eWZAccmCKw7MXYUNgx4BCDSXF2IMS8h10oY/edit?resourcekey=0-x6Tw2qz1SpCIPhbec6Qa2A. From that document:

It would be very bad if their stored object started failing validation or worse, their cluster started acting badly.

It's this "acting badly" that happens when ignoring the admin access field because of a feature gate check in code which is supposed to honor that field.

We do not have good guidelines about whether implementation logic should itself be gate-controlled or not.

In particular, "only take action if the feature gate is set" is not a general guideline.

As explained in the doc, not putting code behind an if check comes with risk that the code is faulty and then gets invoked despite the feature gate being off. This is a real problem which has no good answer in the general case.

Coming back to admin access: the code which would have to be put behind an if check is pretty short. I think the risk that it's faulty is lower than the problems that arise when not executing it, so I stand by my proposal to always honor the field if it was set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if the code in kube-scheduler that handles admin access is causing it to crash loop, then doing it "only at the boundary" means that turning off the feature gate won't fix the crashing. You have to turn off the feature gate, and then delete or update the claim with admin access. I don't like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having said that, I don't care enough to fight over this. If you want the if checks, then I can add them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative to "ignore the field" is to explicitly cause allocation to fail when it is set and the feature gate is off.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly let's do that - your last suggestion. Refuse to allocate if the field is set. In particular, this is the lowest risk option from a security standpoint, while still avoiding an "incorrect" (allocated NOT in admin mode) outcome.

For posterity: https://kubernetes.slack.com/archives/CPNHUMN74/p1728317533313819

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

gate explicitly gets enabled. The feature gate remains off by default because
DRA depends on a new API group which following the
[convention](https://github.com/kubernetes/enhancements/tree/master/keps/sig-architecture/3136-beta-apis-off-by-default)
is off by default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. @deads2k a little help here?

keps/sig-node/4381-dra-structured-parameters/README.md Outdated Show resolved Hide resolved
keps/sig-node/4381-dra-structured-parameters/README.md Outdated Show resolved Hide resolved
<!--
This section must be completed when targeting beta to a release.
-->
The container runtime must support CDI.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What errors should we expect if this is not the case? Where would they appear? Would this show up as: pod schedules, but errors out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some text for this.

need to watch additional resources:
- kube-controller-manager: ResourceClaimTemplate
- kube-scheduler: ResourceClaim, DeviceClass, ResourceSlice
- kubelet: ResourceClaim
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presumably only the ones relevant for that node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, kubelet: ResourceClaim was wrong. We replaced that with doing a GET precisely because watching all by all kubelets would have been a scalability problem. The GET adds latency, but it scales linearly with number of nodes and claims.

Removed.


- Testing: An E2E test covers the expected retry mechanism in kubelet when
`NodePrepareResources` fails intermittently.


###### What steps should be taken if SLOs are not being met to determine the problem?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the SLOs are clearer, you should fill this out.


Workloads not using ResourceClaims should not be impacted because the new code
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is classic DRA. Some of the answers there became out-dated when pivoting the two KEPs, so in this PR I moved those answers to where they belong now (the structured parameter KEP) and restored the placeholder KEP comments in the classic DRA KEP for future editing of that KEP.

But classic DRA is dead. I suppose I can just leave that KEP in whatever state it was and focus exclusively on structured parameters here: I'll remove the changes to keps/sig-node/3063-dynamic-resource-allocation/README.md.


###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Merely enabling the feature is not expected to increase resource usage much.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed? Shouldn't this be (at least), "no"?


###### What are other known failure modes?

- DRA driver does not or cannot allocate a resource claim.
- DRA driver controller does not or cannot allocate a resource claim.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any metrics we can refer to?

removed.

The following metric mirrors the `csi_operations_seconds`:
- Metric name: `dra_operations_seconds`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to name it so that it's clear that this metric is about GRPC, e.g. dra_grpc_operations_seconds or something similar?
We may decide to add other metrics for DRA operations, so it would be nice if they don't conflict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's then not consistent with other metrics. Let's ask our SIG Instrumentation reviewers during the implementation review what they prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, even dra_operations_seconds is not consistent with similar metrics in the pkg/kubelet/metrics/metrics.go. We should probably decide which naming convention we're going to follow csi or kubelet.
As we're going to add new metrics to the kubelet, I tend to prefer kubelet naming.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to resolve that here, we can update the KEP based on what SIG instrumentation recommends during the implementation phase.

Copy link
Contributor

@bart0sh bart0sh Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for Kubelet naming convention the metrics could be named this way:

  • dra_operations_duration_seconds
  • dra_operations_errors_total
  • dra_grpc_operations_duration_seconds
  • dra_grpc_operations_errors_total

And for CSI they can be named this way:

I'd propose to use Kubelet naming scheme as it's much less confusing then CSI from my point of view.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought though that we didn't need the separate errors_total because we have it in the duration_seconds histograms? Or is that an aggregation (not sure we need that?)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to resolve that here, we can update the KEP based on what SIG instrumentation recommends during the implementation phase.

absolutely

I thought though that we didn't need the separate errors_total because we have it in the duration_seconds histograms? Or is that an aggregation (not sure we need that?)?

That's what kubelet uses for its metrics, e.g. runtime_operations_duration_seconds and runtime_operations_errors_total

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd propose to use Kubelet naming scheme as it's much less confusing then CSI from my point of view.

csi_operations_seconds is also a kubelet metric.

Let's continue the discussion in your PR or in #wg-device-management. You can present the alternatives and perhaps some future users of these metrics will have an opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's continue there

Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there :)

gate explicitly gets enabled. The feature gate remains off by default because
DRA depends on a new API group which following the
[convention](https://github.com/kubernetes/enhancements/tree/master/keps/sig-architecture/3136-beta-apis-off-by-default)
is off by default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Jordan's comment you linked above, Jordan suggests validating this. Perhaps there is an example in the code where this is done?

// of logical steps. Validation against those limits happens only when
// setting an expression for the first time or when changing
// it. Therefore it is possible to change these limits without
// affecting stored expressions. Those remain valid.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. Applications that were already deployed and are running will continue to
work, but they will stop working when containers get restarted because those
restarted containers won't have the additional resources.

DRAAdminAccess: No. Workloads which were deployed with admin access will continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if it's in a ResourceClaimTemplate, then Pod recreation will fail, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the claims still need to be generated, then we have a different failure scenario: the template has management access, the claim won't.

This calls for yet another feature gate check, this time in the claim controller: if a template has management access set and the feature gate is off, refuse to generate the claim. Added.

So the answer here becomes "it depends". Updated.

Much of the PRR text that was originally written for "classic DRA" applies also
to "structured parameters". It gets copied from kubernetes#3063 to kubernetes#4381, with some
adaptions.

The v1beta1 API will be almost identical to the v1alpha3 API, with just some
minor tweaks to fix oversights.

The kubelet gRPC gets bumped with no changes. Nonetheless, drivers should get
updated, which can be done by updating the Go dependencies and optionally
changing the API import.
@alculquicondor
Copy link
Member

/cc

@johnbelamaric
Copy link
Member

/approve

But we missed sig-scheduling, so leaving the LGTM to @alculquicondor

/assign @alculquicondor

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric, mrunalp, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 8, 2024
@@ -2518,6 +2648,20 @@ For Beta and GA, add links to added tests together with links to k8s-triage for
https://storage.googleapis.com/k8s-triage/index.html
-->

The scheduler plugin and resource claim controller are covered by the workloads
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the coverage percentages for all involved packages (new and old)

IIRC, the coverage was pretty low for alpha.

Let's make it a goal for beta graduation to reach 80%

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several PRs have already gone in to increase coverage. I'll create a KEP update with the latest numbers that we can review before merging the beta promotion PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@@ -1234,8 +1258,22 @@ type DeviceAttribute struct {

// DeviceAttributeMaxValueLength is the maximum length of a string or version attribute value.
const DeviceAttributeMaxValueLength = 64

// DeviceCapacity describes a quantity associated with a device.
type DeviceCapacity struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to OK this, since I am late to the conversation, but I will be looking to understand this change more :)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 8, 2024
@k8s-ci-robot k8s-ci-robot merged commit 9a23114 into kubernetes:master Oct 8, 2024
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.32 milestone Oct 8, 2024
@thockin
Copy link
Member

thockin commented Oct 8, 2024

Oops! Sorry, Aldo. @pohly will followup, I am confident.

@alculquicondor
Copy link
Member

@pohly please update the coverage numbers in the KEP ASAP.

@pohly
Copy link
Contributor Author

pohly commented Oct 9, 2024

See #4914

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.