Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow ephemeral-storage capacity overrides for instance types (per node template or provisioner) #2723

Closed
wkaczynski opened this issue Oct 24, 2022 · 22 comments
Labels
feature New feature or request v1.x Issues prioritized for post-1.0

Comments

@wkaczynski
Copy link
Contributor

wkaczynski commented Oct 24, 2022

Tell us about your request

Currently there is no way of letting karpenter know that during the bootstrap of a node with nvme instance volumes, kubelet root is re-mounted to an array created out of the nvme instance volumes effectively changing the ephemeral-storage capacity of the node.
Possible solutions would be:

  • expose a setting in the provisioner / node template to allow to specify karpenter.k8s.aws/instance-local-nvme as the source of the ephemeral-storage capacity value (and not rely on the defaults or blockDeviceMappings) -> this would probably cover the most common case where a single raid0 array is created out of all the available instance volumes (like in this recommendation: Use ephemeral disks to store container images bottlerocket-os/bottlerocket#1991 (comment))
  • expose a setting in the provisioner / node template to allow to specify selected node resource overrides per instance type - this would be a more general solution and could also account for:
    • different array setups in different node templates for the same instance types
    • overrides for potentially other resources too

#2390 seems to offer some interesting options as well

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Without the ephemeral-storage overrides, karpenter will be unable to select instances with ephemeral-storage provided using instance-volume-backed array for pods with ephemeral-storage requirements unless there is an ebs volume added to blockDeviceMappings that is matching the array size (that will effectively be unused)

Are you currently working around this issue?

There is currently no good workaround, in our case we need to add additional requirement to the provisioner (or pods) on "karpenter.k8s.aws/instance-local-nvme" to not provision instances with instance nvme storage smaller than our ebs configuration (otherwise the pods would never be scheduled on the boostrapped nodes). The issue is also that if karpenter choses to add bigger instances it is very likely to overprovison nodes (the pods will eventually schedule on a smaller number of nodes due to ephemeral-storage >> ebs size) and then remove empty nodes.

Another workaround could be to match the ebs size in blockDeviceMappings to the nvme instance total size which would generate additional costs (and the ebs would be effectively unused)

Additional Context

No response

Attachments

No response

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@wkaczynski wkaczynski added the feature New feature or request label Oct 24, 2022
@jonathan-innis
Copy link
Contributor

We have an open PR #2554 that's working on surfacing instance store volumes through the AWSNodeTemplate. We are discussing taking that PR a step further where, if you specify this virtualName in the NodeTemplate, we would make some assumption about your intention to use the instance store volume for your ephemeral storage and then use that value as the ephemeral-storage size.

I think eventually, this work extends into Instance Type Settings and #2390. Ideally, we could discover the instance store for a given instance type and always assume that this is being used for ephemeral-storage so that no user-based configuration was needed.

@wkaczynski
Copy link
Contributor Author

wkaczynski commented Nov 2, 2022

Would this step further for #2554 also address the case of multiple nvme instance volumes ? Would we need to explicitly map each instance volume separately in the blockDeviceMappings section ? (meaning potentially needing separate aws node templates for each instance type in the same family)

#2390 - could be a good option too but it wouldn't support different nvme array setups for the same instance types (ideally we'd allow this for a pair of aws node template (or a provisioner) & instance type)
(not that it wouldn't solve the issue we're facing as we're always creating raid0 composed of all available nvme instance volumes)

@jonathan-innis
Copy link
Contributor

also address the case of multiple nvme instance volumes

Yes, this should address the case of multiple nvme instance volumes; however, you are correct that without #2390, you would have to create a separate Provisioner for each different array setup.

could be a good option too but it wouldn't support different nvme array setups for the same instance types

There's some extensions of #2390 that we have thought about where you could proxy instance type setups to create your own "custom" instance type, but that seems a bit further down the line.

@jonathan-innis jonathan-innis self-assigned this Nov 8, 2022
@wkaczynski
Copy link
Contributor Author

That sounds great, proxy instance type setups could add a lot of flexibility.

I guess we can live with either the extension to #2554 (assuming it does not break the bottlerocket boostrap image nvme setup from bottlerocket-os/bottlerocket#1991 (comment)) or with the basic functionality of #2390.

Please give us an update once the approximate timeline for availability of any of these options is known.

@jonathan-innis
Copy link
Contributor

Sure @wkaczynski, I think @bwagner5 as the assignee on #2544 should be able to give you a good timeline on that PR to allow the initial NVME functionality.

For instance types and #2390, this was put on the backburner in favor of some other work but it should be re-picked up soon. Once the RFC goes in, that should be a good indicator of when the work is about to start.

@wkaczynski
Copy link
Contributor Author

wkaczynski commented Nov 17, 2022

also address the case of multiple nvme instance volumes

Yes, this should address the case of multiple nvme instance volumes; however, you are correct that without #2390, you would have to create a separate Provisioner for each different array setup.

just a thought - with separate provisioners (as the provisioners for different array setups (like just different nvme disk counts) would be selected at random) we wouldn't necessarily see instances with the optimal cost selected, right ? (so we could end up getting bigger and more expensive instances than needed)

@bwagner5
Copy link
Contributor

I don't think #2554 will take care of this use-case. Even if instance-stores can be mapped as a block device, it doesn't indicate the configuration that the volumes would be used as (i.e. if 2 volumes are mapped does it mean they'll be in a RAID-0, RAID-1, ... etc).

I'm wondering if it would make more sense to configure the instance-store volumes within the Karpenter AMI Family itself. We could, by default within the AL2 amiFamily, RAID the volumes and remount where kubernetes components point to storage.

@jonathan-innis jonathan-innis removed their assignment Jan 25, 2023
@cep21
Copy link

cep21 commented Feb 3, 2023

you would have to create a separate Provisioner for each different array setup

Are there examples for this? Would this happen inside spec.userData or spec.blockDeviceMappings?

@bwagner5
Copy link
Contributor

There is on-going work in the eks optimized AL2 AMI to setup a RAID-0 out of instance storage disks and remount containerd and kubelet. Once that PR is merged into the EKS Optimized AMI, we can then set the bootstrap flag within Karpenter to enable the new functionality and adjust the node ephemeral-storage capacity to assume that we'll use a RAID-0 setup for instance types with NVMe instance storage. awslabs/amazon-eks-ami#1171

@bwagner5 bwagner5 self-assigned this Feb 17, 2023
@dschaaff
Copy link
Contributor

We'll want that option for bottlerocket too. We use a startup container to format and mount a raid array of the disks already, we just need to have a way to properly account for the node ephemeral storage in the kubelet.

@cep21
Copy link

cep21 commented Jul 13, 2023

Our current karpenter instances are using EBS volumes since that's what's currently supported by karpenter. We don't need EBS volumes, and would rather use the instance storage for ephermeral. This would save thousands of dollars a month on our AWS bill. Very excited to see this ticket get progress.

@ryanschneider
Copy link

In the meantime, now that the new EKS AMI has setup-local-disks I think we can add this to our AWSNodeTemplate to get the SSDs setup:

userData: |
    MIME-Version: 1.0
    Content-Type: multipart/mixed; boundary="BOUNDARY"

    --BOUNDARY
    Content-Type: text/x-shellscript; charset="us-ascii"

    #!/bin/bash
    /bin/setup-local-disks raid0

    --BOUNDARY--

@taylorturner
Copy link

@ryanschneider any update on whether or not your suggestion is working?

@ryanschneider
Copy link

@taylorturner we ended up using a custom script since we wanted more control than setup-local-disks provided but I did test it once and it seemed to work.

@billrayburn billrayburn added v1.x Issues prioritized for post-1.0 and removed v1 Issues requiring resolution by the v1 milestone labels Sep 20, 2023
@purnasanyal
Copy link

Yes, This feature is important

@alec-rabold
Copy link
Contributor

This would be a very useful feature for us too; took a stab at a possible solution: #4735

@armenr
Copy link

armenr commented Oct 21, 2023

This is becoming a cost-prohibitive issue for our company as well.

@armenr
Copy link

armenr commented Nov 23, 2023

Following up -- is there any intention to listen to the customers here, and let us save on thousands of dollars of wasted spend, monthly?

@jonathan-innis
Copy link
Contributor

is there any intention to listen to the customers here, and let us save on thousands of dollars of wasted spend, monthly

Apologize for the miss on response here. There's a small number of maintainers trying to keep up across a number of requests on the project. We were working hard to push out the beta and some other high-pri features and now that we are unblocked on those, will start to burndown the list of open PRs that are out there.

Looking at #4735 at a high-level, it sounds like a fairly reasonable approach to me. It allows the user to specify which way they want to go with their NVME storage and then will configure them as such 🎉

@armenr
Copy link

armenr commented Dec 5, 2023

@jonathan-innis - Thank you for the prompt and informative follow-up. We all appreciate the enormous amount of work you're all putting in.

And thank you for putting eyes on this specific issue, and the accompanying PR. It's really exciting to know it's getting attention, and seems to be coming down the pipeline soon 😎

Thanks again! 🙌🏼

@cep21
Copy link

cep21 commented Dec 11, 2023

For some numbers, we've noticed a consistent 18% of our "EC2-other + EC2 Instances" bill is spent on these non-ephemeral disks, due to the large container images we deploy.

This ticket will have a real, material, and noticeable impact on the cost to run services in AWS.

@jonathan-innis
Copy link
Contributor

I think since #4735 got merged, we can consider this one closed. #2394 should cover the other case where we want to scale EBS volumes dynamically based on pod resource requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request v1.x Issues prioritized for post-1.0
Projects
None yet
Development

No branches or pull requests