-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for spot instances #1876
Comments
I think we should start with machines on a per-provider basis. I'm not sure we need a common abstraction at this point. Here's what I came up with for AWS: https://github.com/openshift/cluster-api-provider-aws/pull/255/files#diff-d1ff143ff12272410f75d9d545d45ec1R144-R152 Each provider is going to be handled differently. For instance, on AWS, you're creating a spot instance request, that may or may not be filled depending on current usage and your max bid. This might be desireable by the end user: I have some work I need done, and it can wait until the price is X. GCP is different. If the machine is deleted before the request is fulfilled, we need to delete the request rather than an instance (this is trivial, hasn't been implemented yet). The other consideration here is the IAM policies for Spot Instances require additional privileges. Once we have spot instance support in a particular cloud, a user could create machinesets to easily add and remove spot instances/requests. I think regular machinesets are fine for now. Users might want to taint those nodes to ensure only interruptible workloads are on thoses machines. AWS will notify a machine that it's about to go down, and there's already some existing code to drain the node: https://github.com/kube-aws/kube-spot-termination-notice-handler Obviously, you shouldn't deploy your entire cluster on spot instances (in 80% use case I'd argue), so we just need to document what the current behavior is, and people can opt-in to using them if that's useful. From there, I think we should build provider-specific machine-spot-set controllers, or some other type of abstraction that can balance machines across AZs and instance-types, as per the recommended practices from AWS to minimize disruption. Or, another abstraction would be to create the ability for the machineset to 'deletegate' function. Leave the template empty, just use the replica count, and something else responds to the desired number of instances by watching that machineset. |
what I'd like is to consolidate and enforce behaviour for the machine resource across providers at the machine controller level for the case when a given cloud provider decides to interrupt and terminate a spot instance. So this behaviour is consistent, predictable and upper level controllers can react appropriately, i.e I'd probably expect the machine to go failed in such a case. |
From the machine-controller's POV, there shouldn't be anything new. The instance goes away for one reason or another, and we react accordingly, in the default case, do nothing, that's up to the remediation layer if desired. The day-1 use case I'm thinking of is, I scale up say 20 spot instances, and they process through my batch jobs. After 24 hours, I might scale them down. During that time, if a handful of instances go away, I'm not really concerned about it. After that 24 hours, I scale the machineset down to zero, and I might repeat the same thing next week. |
I think one potential issue here is that we'd need to react to the signal to shutdown the spot instance in order to fasttrack the cordon/drain of the Node faster than we would otherwise need to do for other types of remediation. |
@detiber I've linked a project that does just that in my first comment. Someone could optionally deploy that thing (I believe it runs as a daemonset) if they so chose. Or we can develop a different daemonset to notify some other system. In any case, I don't view this as day-1 feature, it's perfectly fine to have workers that might disappear quickly, I would suggest taints as a starting point. |
/assign @JoelSpeed |
I've put together a proposal which I discussed a bit with @vincepri last week. I've added a large section towards the bottom of this about the state of Spot instances across AWS, Azure and GCP, so if anyone is unfamiliar with them or would like a refresher on how they work, please read the additional details I've added. The proposal TL;DR is basically that:
Vince had some thoughts around this (correct me if I'm wrong Vince):
I'd like to gather feedback from the Owners of the Azure, GCP and AWS providers on the proposal to add fields to their MachineSpecs and their thoughts on the stability and implications of adding these fields as Vince has raised. I don't know who to poke to gather this feedback however, so if anyone knows who I should be poking, please let me know 😃 |
Propose AWS spot instances support kubernetes-sigs#1876
Propose spot instances support kubernetes-sigs#1876
To give an update on this, I synced up with the CAPA maintainers at their call on Monday (March 9th), and with @CecileRobertMichon representing CAPZ today (if anyone else seeing this has questions and wants a call to sync, please ping me) Both seemed to support the initial proposal and direction of trying to get consistent looking approaches to the feature across AWS/GCP/Azure, the current direction I've been agreeing with people is:
From these calls I have a couple of questions to answer:
If there are more questions, please add to the doc and I will try to address |
Does this have any impact on Cluster API controllers and types? |
No it shouldn't do, in terms of implementation it should basically be adding a field to each of the The only changes I expect within this repo are within the docs |
Got it, it sounds like we might need to follow-up in each infrastructure provider with some issues and keep this one open for the next cycle once we look at MachinePool |
Linked the capz issue ^ |
From my POV, the provider fields are less rigid. They shouldn't be forced to support things they don't want to support. It's up to their each provider to specify an API migration story. "We don't think anyone is using this and it's broken anyway" is reason enough for me to drop it from a particular provider. |
I can see that, but within SIG sponsored sub-projects we're trying to adhere to the usual Kubernetes API guarantees. In this case, after adding a field, it should wait until the next cycle to be dropped / deprecated which is probably ok-ish |
The providers should adhere to API contracts, so a change to remove a field should be handled accordingly. |
I agree with deprecating and removing over a release, or just removing the next release if it's clear that the field is not GA. I'm not advocating we just remove it mid release.
What the providers do as far as their fields and implementations should be transparent to cluster-api. We've got some standards, but it's outside of our domain to dictate what is best for each provider IMO. In the case of a provider that lives in k8s-sigs org, yeah, we should probably do the k8s way of things. But I view providers as things you replace and/or modify to your own needs. |
to clarify, my comment was more about how to expose the field while the feature is still in preview. I'm not too concerned about the feature not making it out of preview. |
This is why I originally asked to make all of this an experiment and have it only be under MachinePool, it's way easier to make changes to something that hasn't been published yet |
@JoelSpeed what do you think about that? It does make sense to use MachinePool since the Spot feature would also be experimental at first. Also it might be a forcing function for all the infra providers to adopt MachinePools :) |
For just Azure, using MachinePools is fine. For the other providers, I don't think we should gate on how MachinePools works because they are the least mature. |
@CecileRobertMichon @vincepri I'd expect this to be exposed at the Personally I don't see an issue with Azure Spot Virtual Machines being Preview https://azure.microsoft.com/en-us/pricing/spot/. I see that orthogonal to the maturity of our API. We would just refer CAPI users to the cloud provider docs for that particular feature expectations while guaranteeing our API contracts nevertheless if Azure Cloud ever drops support for this. |
I agree that I don't think we want to tie this to heavily into MachinePools, since not all users will want the MachinePool semantics. |
I'm in the camp with Michael, Alberto and Jason on this one. We know the APIs for AWS and GCP are stable as they haven't changed in a long time, so this problem of perhaps being uncertain about the Cloud API or having to signal to users that this is a preview feature only really applies to Azure, it makes me beg the question as to whether we should just delay implementing in Azure but that also isn't ideal as I wanted to try and make sure we are consistent with this feature across the providers who do support it Honestly I don't think the API field changing is likely to be a problem. Azure Spot is their second iteration of this (they're stopping low-priority which was the first iteration) so I would imagine this will be more stable and probably make it out of preview without many user facing changes I don't really want to hold up support on AWS and GCP to wait for MachinePool as we currently have no timeline for when those are due. And as Jason says, I think we will likely have users who want to use this feature with individual machines, and probably, machinepools in the future too.
Apologies, I may have misunderstood that when I was typing this up, my note on it from the call wasn't particularly great. In terms of signalling it, I think we just make a note in the docs to users saying this is a preview feature and please don't use it in production and link to the Azure docs which say the same. |
I've just gone through and updated the doc based on comments given in the doc and moved the termination handling stuff to a future work section so that we can focus on just the API fields and patching the options through to the create instances calls within the first pass of spot support |
I'm fine with either approach. The potential issue I'd like to avoid is to have changes made to the CAPI controllers (especially Machine) to take into account spot semantics. This doesn't seem to be the case today given the conversations above, so 👍 if the infrastructure providers are fine supporting it. |
I've opened a PR to add the proposal formally #2817 |
https://azure.microsoft.com/en-us/updates/azure-spot-virtual-machines-are-now-generally-available/ Looks like we don't need to worry about that anymore 🙂 cc @JoelSpeed |
Thanks for the heads up @CecileRobertMichon, will update the proposal 🙂 |
@JoelSpeed can we close this one now that the proposal has been merged? |
We probably should follow-up with a documentation task to write up in the book what infrastructure providers are expected to follow as convention |
Yep, next steps I would say are:
|
/kind documentation |
/milestone v0.3.x |
@JoelSpeed @enxebre can this issue be closed? |
I think we still need to put some docs into the cluster-api book for this |
/help |
@vincepri: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Given that the work here has been done, the proposal has been merged, I'll go ahead and close this issue for now. Let's follow-up for provider implementer docs separately. /close |
@vincepri: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
User Story
As a user I would like to leverage the cluster-api and particularly machines to run "spot" instances so I can run specific workloads on cheaper instances.
Detailed Description
This is to get started a discussion for possible paths forward for supporting spot instances.
Spot instances (aka preemptible vms in gcp or low-priority vms in azure) have different API exposure and different behavioural peculiarities for each cloud provider.
We'd need to come up with an API/behavioural cross provider abstraction that we want to expose and see if that fits with the existing machine/machineSet/machineDeployment abstraction or otherwise a new resource abstraction in addition should emerge. E.g:
Do we want to support the concept of fleets as in aws? Do we only support individual requests?
What do we do when a spot instance is terminated? Do we let the spot request keep trying?
As a very rough guess we could expose a gated list of each specific provider parameters for each infra providerSpec level and standardise/enforce the behaviour at the machine controller level as follows:
For groups of spot instances this might be relevant 📖CAEP: MachinePool API (provider agnostic scale types) #1703
Alternatively a new abstraction might be created the likes of spotMachine or spotPool
Pods running on a spot instance won’t be shutdown gracefully. This needs a particular solution for each provider. E.g https://github.com/aws/aws-node-termination-handler/blob/master/README.md
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
/kind feature
The text was updated successfully, but these errors were encountered: