Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a design proposal for dynamic volume limits #2051

Closed

Conversation

gnufied
Copy link
Member

@gnufied gnufied commented Apr 18, 2018

Add a proposal for dynamic volume limits.

cc @liggitt @saad-ali @derekwaynecarr

xref - kubernetes/enhancements#554

/sig storage

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. sig/storage Categorizes an issue or PR as relevant to SIG Storage. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. labels Apr 18, 2018
@gnufied
Copy link
Member Author

gnufied commented Apr 18, 2018

cc @childsb


### Changes to scheduler

When a node is being considered for scheduling of a pod with volume, it will use currently running pods against value of node.Status.AttachedVolumeLimit.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For in-tree volumes, how do you map the volume type with the node.status.attachedvolumelimit key? Is it just a hardcoded mapping?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah - all volume types eventually resolve to something like - https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/aws_ebs/aws_ebs.go#L54 and hence for in-tree volume plugins this will be a very small change in scheduler. The key in attachedVolumeLimit will be same as what is being returned in GetPluginName for in-tree plugins.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the scheduler does not plumb in the volumeplugin interface

}
```

If given cloudprovider implements this interface - kubelet will query the interface and set(via merge) the returned values in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there's a push towards moving cloud provider dependencies out of kubelet.

Could this be something the external cloud controller calls and updates for each node?

@msau42
Copy link
Member

msau42 commented Apr 18, 2018

/cc @yujuhong

@k8s-ci-robot k8s-ci-robot requested a review from yujuhong April 18, 2018 21:25

### API Changes

We propose following changes to node's status field:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this go under node.capacity with the other resource constraints?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drawback is that you have to request a resource type. But, according to @thockin, device plugin people are pushing a model where admission controller detects what type of volume you are requesting and it decorates the correct type of resource request. Agreed that we should try to repurpose this design.

Copy link
Member Author

@gnufied gnufied Apr 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a discussion today in sig-node call and there were valid concerns raised against using node.Capacity for scheduling. The main problem being - there is no corresponding pod level field. memory, cpu and GPU devices are all specified at container level but volume is counted at pod level.

At this point we either heard some support for using node.Status.AttachedVolumeLimits or no strong opinion about it. cc @derekwaynecarr

If we choose to use node and pod capacity then we will have to create a new field in pod object spec and there was concern that storage might be only sub system using that field. We expect more comments as follow up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we discussed this in a follow-on:

this feels closest to "pods" per node value.
node.status.capacity reports "pods" as a resource
the container is not able to request a "pods" resource as its not a standard compute resource name.
the scheduler special cases "pods" as a special resource type and counts each pod against that value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additional clarification:

"pods" never appears in a container resource requirement

if it did, it would be a problem.


For CSI the name used in key will be - `kubernetes.io/csi/<driver_name>`

### Changes to CSI
Copy link
Contributor

@childsb childsb Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove CSI call outs for now. I would like to move/remove 'defaults' from the volume plugins / CSI layer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify why? how else then CSI volumes will be configured to have these limits which are essential for scheduler?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design should move the 'default values' from the volume plugin to an admin modifiable location (such as CLI args). The admin should set them per node as part of configuration.

We should not conflate CSI spec with this design. Once we have the design correct for kubernetes, then we look at changes to CSI.

@gnufied
Copy link
Member Author

gnufied commented Apr 20, 2018

We had a sig-storage call today and following items were discussed:

  1. We will need to perform dynamic look up of plugin type in scheduler. That means linking volume packages into scheduler. @gnufied is going to request some scheduler folks to review this proposal. cc @aveshagarwal
  2. We might end up pushing CSI out of 1.11 release for this proposal - if CSI spec proposal isn't accepted in time. cc @saad-ali
  3. There is an effort underway to move cloudproviders out of tree and hence making a change to cloud interface(GetVolumeLimits) could be problematic. We need to document how external cloud providers can set this limit. For in-tree cloudproviders, @msau42 is going to talk to GCE folks about proposed interface. In case - we can't reach a consensus, we will drop proposed change to cloud provider interface and stick to existing defaults for GCE/AWS and just make necessary change so as it is possible for k8s admin to set newer limits via CLI.
  4. @gnufied is going to update the proposal with flexvolume bits.

@saad-ali
Copy link
Member

I would support modifying the CloudProvider interface to add support for this.

In case - we can't reach a consensus, we will drop proposed change to cloud provider interface and stick to existing defaults for GCE/AWS and just make necessary change so as it is possible for k8s admin to set newer limits via CLI.

Can you clarify what this means? Does it mean modifying the in-tree plugins without modifying the cloud provider to figure out what the limit should be (the plugin could maintain an internal map of limits, but the challenge I see with that is figuring what type of machine it is running on without relying on cloud provider code).

@msau42
Copy link
Member

msau42 commented Apr 20, 2018

I believe the fallback plan is to have override flags into kubelet, and it would be up to the admin or the deployment manager to set those limits.

@liggitt
Copy link
Member

liggitt commented Apr 20, 2018

Adding to the interface we're trying to remove doesn't seem like a great approach. Making new features work from the outside in will drive actually figuring out how to configure the kubelet in various cloud envs, instead of just leaning on the existing in-tree interface

@childsb
Copy link
Contributor

childsb commented Apr 23, 2018

To add to @liggitt and summarize a design discussion we had last week.... The cloud provider could just as easily set the attach limits dynamically by updating the Node object or some other mechanism.

I feel the best first step design for this feature is creating knobs to set attach limits by the volume type, but not try and populate or determine the limit 'dynamically'. The 'environment' (cloud provider or otherwise) can set the values as it determines, but the storage layer not create new API or tighter coupling with the cloud provider. The onus would be on each cloud provider (or in the absence, some startup script) to set appropriate values based on the environment.

}
```

For CSI the name used in key will be - `kubernetes.io/csi/<driver_name>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a format unlike any other in the system. I don't think this is correct use of resource names.


### Setting of limit for existing in-tree volume plugins

The volume limit for existing volume plugins will be set by querying the cloudprovider. Following interface
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this belongs in cloud provider - it's a facet of the storage driver.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know which plugins are available on a given node? The plugin initialization code gets initialized on all node types. For example - we initialize EBS plugin on a GCE cluster. So populating limits by querying the volume plugin will result in populating limits for all available in-tree volume types.

Do we want to populate node.Status.Capacity with relatively large dict? I do not know if there is a way if we can determine if certain plugin can be initialized on certain node type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know which plugins are available on a given node?

Plugins register themselves with the node on startup.

The plugin initialization code gets initialized on all node types. For example - we initialize EBS plugin on a GCE cluster. So populating limits by querying the volume plugin will result in populating limits for all available in-tree volume types.

In-tree plugin can have an additional check that verifies that it is running in cloud provider.

@msau42
Copy link
Member

msau42 commented Apr 23, 2018

We had a discussion with @saad-ali @thockin and @cheftako and concluded that:

  • Kubelet command line option to report plugin limits is not a great user experience. The user could get it wrong, and it would be a pain to do on large clusters with varying node sizes.
  • It should be up to the volume plugin, not cloud provider to report its limits
  • CSI can support by adding an API for the volume plugin to report limits
  • However, we cannot make in-tree volume plugins that rely on a cloud-provider actually call out to the cloud provider library. As a temporary workaround until in-tree plugins are routed through CSI, the plugin can directly encode the limit table.
  • The capacity reporting API could follow the device plugin design: reuse the same capacity fields, and have a plugin-specific admission webhook examine the Pod spec, count up the volumes of its type, and add it to the Pod capacity requests. Then we can reuse the existing scheduler capacity code. This also eliminates the need for the scheduler to have volume plugin callouts.

@childsb
Copy link
Contributor

childsb commented Apr 23, 2018

I do not like 'discovering' this at the Volume/CSI layer...

It's impossible for the volume layer / CSI driver to have an accurate MAX that applies to most users / environments.... Kube can place conservative MAX default values but then admins will want to tweak it for performance. Kube can use the published performant MAX as default but users may have other agreements for unpublished values or find the performant values overload the cluster. I dont see a "set it and forget it" value for many storage system really.

As further counter-point to adding this to the volume layer, I dont know of any storage API which supports this discovery.... It's all discoverable via documentation matrix.

Is the conclusion of the discussion @msau42 that it would be better UX for the admin to 'tweak' the value through the volume layer / CSI config?

@msau42
Copy link
Member

msau42 commented Apr 23, 2018

For the scenario where the admin still wants to override the plugin, I can see two ways for override:

  • modify the Node object directly
  • the CSI driver could expose some configurable override

I agree that there's an issue that these limits for cloud volume types are usually only encoded in documentation, so it would be up to the CSI driver maintainers to update the limits table whenever the number changes. I think that's a problem we can't avoid either way. At least with CSI, since it's out of tree, we can push new versions faster.

@gnufied
Copy link
Member Author

gnufied commented Apr 24, 2018

@msau42 thank you for summing the discussion.

@thockin @saad-ali I would have still preferred that CLI provides an option to override the limits. For example - even on EBS the performance of disk and the limit slightly depends on network bandwidth. The default documented limit is 40 but if one attaches 40 volumes to a m1.medium instance type - likely the instance is not going to perform optimally. In that case, an admin may want to override the limit to a value that he/she knows best. So my vote is - have some defaults but allow admin to override them. I do not buy the argument that it is clumsy or error prone. We have feature-gates flag as prior art which uses similar format and requires enabling on all nodes.

@msau42
Copy link
Member

msau42 commented Apr 24, 2018

@gnufied the admin can still override the limit by directly setting the node.status. Do you see issues with that method vs having to login to the node, change and persistent new kubelet arguments, and restart kubelet?

@gnufied
Copy link
Member Author

gnufied commented Apr 24, 2018

@msau42 Modifying node.Status is definitely worse. For one thing - a node restart could replace the value; for another - manually editing node.Status.Capacity is problematic because Status of most objects is populated by k8s. I do not think we encourage or support users directly editing Status field of api objects - so this will be an exception.

Lastly - how does a configuration management tool like puppet/ansible keep this value in sync? Does it needs to watch node object? does it need to poll node object and make sure value is always set? It arguably seems so much worse. :(

@msau42
Copy link
Member

msau42 commented Apr 24, 2018

I believe kubelet only does Patch on the Node object, so a restart would not cause the value to be overridden or lost. I agree that in the case of managed instances, if the cloud provider deletes and creates a new instance, then those values would be lost. So perhaps, for this admin override case, a place to persistently store the override is needed. Could something like dynamic kubelet config help here?

@gnufied
Copy link
Member Author

gnufied commented Apr 24, 2018

I believe kubelet only does Patch on the Node object, so a restart would not cause the value to be overridden or lost.

A node restart where node was down for sufficiently long time will cause node object to be recreated. On AWS etc a node object is removed from api-server when it is shutdown.

@msau42
Copy link
Member

msau42 commented Apr 24, 2018

I had another quick discussion with @saad-ali. In the long term, we can provide overrides through the CSI plugin. In the short term, this is still an alpha feature, and the admin could disable it if they run into issues with the new limits. I think it's better to not define a new flag/config parameter for something that we're going to replace soon.

@gnufied
Copy link
Member Author

gnufied commented Apr 25, 2018

@msau42 @saad-ali @childsb PTAL. I pushed following changes to the proposal:

  1. Query volume plugin rather than Cloudprovider for limits.
  2. Drop the CLI options to kubelet.
  3. Small change in naming convention for CSI to make in-line with flexvolume.

k8s-publishing-bot added a commit to kubernetes/api that referenced this pull request Jun 8, 2018
Automatic merge from submit-queue (batch tested with PRs 64613, 64596, 64573, 64154, 64639). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Implement dynamic volume limits

Implement dynamic volume limits depending on node type.

xref kubernetes/community#2051

```release-note
Add Alpha support for dynamic volume limits based on node type
```

Kubernetes-commit: e5686a366815cbb82ef91503151ef6b2e531e6f3
wenjiaswe pushed a commit to wenjiaswe/kubernetes that referenced this pull request Jun 19, 2018
This PR implements the function to return attachable volume limit based
on machineType for GCE PD. This is part of the design in kubernetes/community#2051/
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: childsb

If they are not already assigned, you can assign the PR to them by writing /assign @childsb in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gnufied
Copy link
Member Author

gnufied commented Aug 10, 2018

@saad-ali @msau42 @liggitt I have pushed an update to the proposal that covers CSI and moves the feature to beta in 1.12. It is pretty minimal change - PTAL.

@gnufied gnufied force-pushed the dynamic-volume-attach-limit branch from 849a4d0 to d3edde3 Compare August 10, 2018 21:05
csiPrefixLength := len(CSIAttachLimitPrefix)
totalkeyLength := csiPrefixLength + len(driverName)
if totalkeyLength >= ResourceNameLengthLimit {
// compute SHA1 of driverName and get first 36 chars
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that keeping some prefix with the original name would help users deciphering what the key is, e.g. reallyreallylongcsidrivername -> reallyreallylongcs-19e793491b1c6.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@gnufied gnufied force-pushed the dynamic-volume-attach-limit branch from d3edde3 to e846f61 Compare August 13, 2018 23:40
format restrictions applied to Kubernetes Resource names. Volume limit key for existing plugins might look like:


* `storage-attach-limits-aws-ebs`
* `storage-attach-limits-gce-pd`
* `attachable-volumes-aws-ebs`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we also want to add a max capacity limit, does "attachable-volumes" still make sense, or do we want to clarify that it is a count?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to use separate resource name for max attachable capacity. Something like - attachable-capacity-xxx and keep attachable-volumes-xx for representing count. I think capacity vs volumes word is "good enough" indication.

@gnufied gnufied force-pushed the dynamic-volume-attach-limit branch from e846f61 to 7a71128 Compare August 15, 2018 19:15
@gnufied gnufied force-pushed the dynamic-volume-attach-limit branch from 7a71128 to e1cb861 Compare August 15, 2018 19:17
@gnufied
Copy link
Member Author

gnufied commented Aug 15, 2018

@msau42 @jsafrane fixed and updated the proposal. Also added a section about using #2514 as a possibility but we can always migrate to it, when it becomes available.

@gnufied gnufied force-pushed the dynamic-volume-attach-limit branch from 2249bc5 to 1ca2366 Compare August 15, 2018 19:30
We can also use kubernetes#2514
when it becomes available.
@gnufied gnufied force-pushed the dynamic-volume-attach-limit branch from 1ca2366 to eb28ce7 Compare August 15, 2018 19:32
Alternately we also considered storing attach limit resource name in `CSIDriver` introduced as part
of https://github.com/kubernetes/community/pull/2514 proposal.

This will work but depends on acceptance of proposal. We can always migrate attach limit resource names to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSI doesn't have a way to report some alternative key though right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think there is a proposal to allow CSI drivers to register themselves and not use auto registration mechanism. In which case, it will be possible for a plugin to report alternative key as a part of registration process.

```

This function will be used both on node and scheduler for determining CSI attach limit key.The value of the
limit will be retrieved using `GetNodeInfo` CSI RPC call and set if non-zero.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to handle the case of override and changing the limit on the node after the initial configuration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No not in this release. We will tackle this problem in 1.13

k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this pull request Sep 5, 2018
Automatic merge from submit-queue (batch tested with PRs 68161, 68023, 67909, 67955, 67731). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.

Fix csi attach limit

Add support for volume limits for CSI.

xref: kubernetes/community#2051

```release-note
Add support for volume attach limits for CSI volumes
```
}
```

The prefix `storage-attach-limits-*` can not be used as a resource in pods, because it does not adhere to specs defined in following function:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be the same prefix as above?

format restrictions applied to Kubernetes Resource names. Volume limit key for existing plugins might look like:


* `attachable-volumes-aws-ebs`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really dislike this sort of ad hoc syntax embedded in a string, and aside from that, I don't buy that this should be magical like this. Why can't we do something less stringy?

Just thinking out loud...

e.g. Some volume have per-node attachment limits. For those volumes, you install an admission controller that examines the volumes in play and decorates the pod as requiring certain resources, and then the scheduler simply schedules?

I feel like we have explored this, but if so, I don't recall why it was rejected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two part of it right?

  1. The plugin must be able to specify a maximum attachment limit on per node basis. We chose to use node's allocatable/capacity field for that. And when we want to express this limit in node's allocatable regardless of how we do counting via admission controller or something else - the limit must have a key and a value. The key MUST be a string that adheres to resourceName naming conventions(allocatable is of type map[v1.ResourceName]resource.Quantity). So isn't the attachable-volumes-aws-ebs prefix chosen for expressing attach limits in node's allocatable/capacity orthogonal to how we do counting?

  2. We could still use a admission controller to do actual counting but I think it was agreed that, scheduler is better(or may be easier) place for counting actual volumes in use. Because:
    a. It can consider all pods on the node. Not just the one being scheduled.
    b. I think the way volume limits interacts with late binding and even distribution of pods with volume workloads(Balanced resource allocation priority to include volume count on nodes. kubernetes#60525) makes it better place to do in scheduler.

And obviously a admission plugin may be disabled or not enabled(but same goes for a scheduler predicate I guess). I am not particularly opposed to using admission controller for counting but I think we did not see benefits we thought we will see.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to reformat for comments:

We chose to use node's allocatable/capacity field for that [to specify a maximum attachment limit] ... The key MUST be a string that adheres to resourceName naming conventions ... So isn't the attachable-volumes-aws-ebs prefix chosen for expressing attach limits in node's allocatable/capacity orthogonal to how we do counting

Yes it is orthogonal. Sorry for putting them in one message.

My first grump here is that we have invented non-obvious syntax without any specification for it. I detest string-embedded syntaxes (stringly type APIs) and even if we DO go that way, we can't tolerate under-specified ones. Where do these magical names come from? How do they not conflict? Jordan had some comments on that, too. At the very very least this needs a strong spec, and perhaps we actually need something different.

Regarding scheduler, what I really don't want to do is build up a body of special -cases that we dump on the scheduler, when the problem is actually a pretty general problem. How do we opaquely allocate and schedule resources WITHOUT changing the scheduler? We may need to make other changes to the system to accomodate that, but I think it should be a preferred choice. @bsalamat am I over-blowing this?

Late-binding is from a PVC to an PV via a class. Presumably a single class can't use multiple drivers, so it's still feasible to reify the PVC -> class -> driver somewhere? The fact that classes could, in theory change, is a problem that maybe we should think harder about and maybe we need a different kind of plugin (e.g. maybe we need explicit scheduler pre-processor plugins). "Let's just hack the scheduler' seems like a very slippery slope to me.

And, BTW, I want this capability (and have for years), so please don't think I am poo-pooing the idea :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second what @thockin said about dumping a lot of special case logic in the scheduler. That said, not everything can be checked at admission time. For example, dynamic volume binding happens after scheduler finds a node to run a pod. I guess there is no choice other than checking the number of attached volumes to the node at that point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do these magical names come from? How do they not conflict? Jordan had some comments on that, too. At the very very least this needs a strong spec, and perhaps we actually need something different.

In current iteration any resource name(in node's allocatable) that uses attachable-volumes- prefix refers to attachable limits. We have added some tests to ensure that containers themselves aren't able to directly request this resource. It is however possible that someone can set a new attachable-volumes-xxx value that does not apply to volume limits(via patch ).

I am not entirely sure about some external component inadvertently overriding any of existing limits though. Because - when kubelet starts, it sets those limits and it periodically syncs the value from same source. So if an external component does override those values it will be wiped out.

I still see the need of speccing them better and perhaps there is a better way. I need to think and will bring this up on sig-storage.

Copy link
Member

@msau42 msau42 Sep 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the earlier design proposals was to leverage the existing opaque resource counting feature that already exists in the scheduler. However there were some hurdles:

  1. Opaque resources are specified at the container level. We would need to extend the pod api to add pod-level resources.
  2. Users don't specify volume counts via resources. I think this is where the admission controller idea came in. Have an admission controller that parses pod.volumes and convert that into a pod-level volume count resource. I remember discussing this option in a sig-node meeting and there were concerns about users being confused by this field they didn't set, and also users potentially manually setting the fields themselves.

That being said, it may be worth revisiting this. It turns out we also want to support limiting volume attached capacity per node. If we follow the current design, that will require adding more special names/logic for this new scenario.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous discussion here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a discussion with @jsafrane as well about this and we discussed two problematic things about moving the count outside the scheduler:

  1. For local volumes with delayed binding, storageClass could be dummy or unknown and hence counting unbound PVCs is going to be tricky outside the scheduler.
  2. Currently the scheduler counts unique volumes. If we specify volumes as a pod level countable resource and scheduler counts volumes the way it counts memory or other resources, it can't accurately count unique volumes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is however possible that someone can set a new attachable-volumes-xxx value that does not apply to volume limits(via patch ).

What is the specification for xxx here? Is it the CSI driver name? Some other source? Does it get some auto-mutation? Are characters changed?

@thockin
Copy link
Member

thockin commented Oct 2, 2018 via email

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 19, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.