-
Notifications
You must be signed in to change notification settings - Fork 989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor cloud provider, add packing package #221
Conversation
} | ||
} | ||
return subnetIds, nil | ||
return c.vpc.GetTopologyDomains(ctx, key, c.spec.Cluster.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This direct passthrough feels a bit awkward to me. I'd probably leave the topology domain function at the top level, but retrieve the specific subnets, zones in the vpcprovider. For example, imagine if there was a topology domain that doesn't belong inside vpc.
ec2: ec2, | ||
securityGroupCache: cache.New(CacheTTL, CacheCleanupInterval), | ||
}, | ||
vpcProvider := &VPCProvider{launchTemplateProvider: &LaunchTemplateProvider{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: put launchtemplateprovider on a new line
pkg/cloudprovider/aws/fleet/vpc.go
Outdated
} | ||
} | ||
|
||
func (p *VPCProvider) getLaunchTemplate(ctx context.Context, clusterSpec *v1alpha1.ClusterSpec) (*ec2.LaunchTemplate, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think about the public and private methods of each of these components. IMO, GetLaunchTemplate should be a public method of VPCProvider. getConstrainedZonalSubnets is probably a private one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also -- launchTemplate seems a bit poorly encapsulated w/ vpc
instanceTypeOptions []string, | ||
zonalSubnetOptions map[string][]*ec2.Subnet, | ||
) (*string, error) { | ||
constraints *cloudprovider.CapacityConstraints, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, leaving constraints at the top level (i.e. "capacity.go") is by design. It creates a fairly stringy, poorly encapsualted codebase to be passing too many of these configuration objects (i.e. constraints/provisioner) to all of these subcomponents. I'd prefer to compute these zonal/subnet constraints at the top level and then pass them into the instance provider.
capacity.go: translates constraints into ec2 concepts (i.e. zone)
foo_provider.go: constructs ec2 resources from ec2 concepts (i.e. (instancetype, launchtemplate) -> instance)
"github.com/awslabs/karpenter/pkg/cloudprovider" | ||
) | ||
|
||
type VPCProvider struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like merging subnet/zones into the same provider and calling that "vpc". I could imagine this extending to security groups if we ever need to.
I'd expect the public interface of this object to be something like
GetZones() []string
GetZonalSubnets() [string]Subnet
If we go this route, I'd probably just merge the subnet provider with this file.
pkg/cloudprovider/types.go
Outdated
@@ -85,6 +85,28 @@ type CapacityConstraints struct { | |||
// CapacityPacking is a solution to packing pods onto nodes given constraints. | |||
type CapacityPacking map[*v1.Node][]*v1.Pod | |||
|
|||
// Packings contains the result of packing decision made by a CapacityPacking, an | |||
// instance ID as key and value is the node object and the pods packed in the instance | |||
type Packings map[string]*NodePacking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I'm sold on exposing this at the cloud provider level. InstanceTypeOptions, etc seems like it's leaking our implementation for how we're computing the packing. At the cloud provider level, I think the original CapacityPacking struct is the correct abstraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could imagine that a packing is a
struct {
Node
Pods
}
But I think Id and InstanceTypeOptions are both awkward.
) | ||
|
||
// Factory returns a Packer to calculate the pod packing based of PackingMethod passed. | ||
func Factory(ec2 ec2iface.EC2API, method PackingMethod) cloudprovider.Packer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably hold off on premature abstraction of packing implementation (though I'm occasionally guilty of this, too). This pattern makes sense elsewhere in the codebase since our old APIs were so flexible. I would just use a packer directly.
|
||
// Get returns the packings for the provided pods. Computes an ordered set of viable instance | ||
// types for each packing of pods. Instance variety enables EC2 to make better cost and availability decisions. | ||
func (p *binPacker) Get(ctx context.Context, constraints *cloudprovider.CapacityConstraints) (cloudprovider.Packings, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The right verb is probably "Pack"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not sure we want to pass "constraints" through this interface.
if err != nil { | ||
// TODO Aggregate errors and continue | ||
return nil, fmt.Errorf("creating capacity %w", err) | ||
} | ||
instanceIds = append(instanceIds, instanceId) | ||
instancePackings[*ec2InstanceID] = instancePacking | ||
delete(instancePackings, tempID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever I see statements like "delete", "break", "continue", my overly-complex code detector starts going off. In this case, you're doing an in place mutation of a map and removing all of the dummy keys and replacing them with real ones. This is pretty tricky, especially if it ever needs to change. For example, it means that anyone solving "TODO Aggregate errors and continue" now needs to be really careful about making sure all of the right keys are added and delete when they change their logic.
A simpler "immutable" approach would simply construct a new map and only add the new packings to it. The old map will be garbage collected. However, given previous comments about where or not we even need "id", I think this code might change anyways.
@@ -49,8 +49,8 @@ func (a *GreedyAllocator) Allocate(provisioner *v1alpha1.Provisioner, pods []*v1 | |||
if err != nil { | |||
return fmt.Errorf("while creating capacity, %w", err) | |||
} | |||
for node, pods := range packing { | |||
if err := a.bind(ctx, node, pods); err != nil { | |||
for _, pack := range packing { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that it's a list, I'd probably use the terms for _, packing := range packings.
subnetProvider *SubnetProvider | ||
instanceProvider *InstanceProvider | ||
ec2 ec2iface.EC2API | ||
vpc *VPCProvider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd call this vpcProvider for parity with the others.
spec: spec, | ||
nodeFactory: f.nodeFactory, | ||
instanceProvider: f.instanceProvider, | ||
vpc: f.vpc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same naming here
instanceTypeOptions []string, | ||
launchTemplate *ec2.LaunchTemplate, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason for the reordering here? Logically, launchTemplate makes more sense to go first since it's the primary definition of the node, where the other fields are ancillary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this got jumbled up with changes back and forth. I should have checked the final diff here in Github
zonalSubnetOptions map[string][]*ec2.Subnet, | ||
) (*string, error) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why the newline here?
pkg/cloudprovider/types.go
Outdated
// PodPackings is a solution to packing pods onto nodes given constraints. | ||
type PodPackings map[*v1.Node][]*v1.Pod | ||
|
||
type Packings struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this struct is used internally by the AWS cloud provider implementation and probably belongs in aws/fleet/packer.go.
This cloudprovider types file should be as simple as possible and represent the minimal structs and interfaces necessary to implement a cloud provider.
pkg/cloudprovider/types.go
Outdated
|
||
// Packer is a method that takes contraints and calculates the pods that can be | ||
// efficiently placed on the instances. | ||
type Packer interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise -- Unless I'm misunderstanding something, this interface doesn't belong here.
I could imagine that we replace our Capacity.Create() interface with something similar to what you're going for here (e.g. Packer.Pack), but then we'd need to include constraints as well. I actually quite like the idea of separating pods from the constraints struct, so the cloud provider's implementation is effectively: packer.Pack(ctx, pods, constraints)
instanceId, err := c.instanceProvider.Create(ctx, launchTemplate, instancePacking.InstanceTypeOptions, zonalSubnets) | ||
podsMapped := make(map[string][]*v1.Pod) | ||
for _, packing := range packings { | ||
ec2InstanceID, err := c.instanceProvider.Create(ctx, packing.InstanceTypeOptions, launchTemplate, zonalSubnetOptions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd just call this instanceId for consistency elsewhere. EC2 doesn't add much value to the name here.
capacityPacking := cloudprovider.CapacityPacking{} | ||
for i, node := range nodes { | ||
capacityPacking[node] = instancePackings[i].Pods | ||
nodePackings := make(cloudprovider.PodPackings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This naming seems odd to me. I typically try to just name my variables according to the type. In this case, it's not clear if the concept we're dealing with here is a "pod packing" or a "node packing". Whatever we call it, let's be consistent.
I also see that you're a little cornered since you have two packing concepts in this function (not to mention the misnamed "podpackings"). Off the top of my head, I might call the one above an "InstancePacking" since it's EC2 Instance specific and the second one a "NodePacking" since its a kubernetes node.
subnetProvider *SubnetProvider | ||
} | ||
|
||
func (p *VPCProvider) GetLaunchTemplate(ctx context.Context, clusterSpec *v1alpha1.ClusterSpec) (*ec2.LaunchTemplate, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the value of wrapping the launchTemplateProvider inside the vpc concept?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted a single abstraction over the things we need to configure/check like launchTemplate, subnet for creating an instance. Later may be we will add things like security groups or SSM they can all go under this abstraction and we can just pass the vpcProvider
around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok -- this approach makes sense to me. I'm not sure VPC is the right name though -- sorry to nitpick so much about names. In some sense, it is more or less the role of this entire package (i.e. dealing with subnets, security groups, zones, launch templates, etc). The size of that scope is what originally led me to break these apart into separate files and tie them together with capacity.go.
I expect all of this will evolve over time, and don't want to spend too much time bike shedding. I think it's best to just go with what you think is best (unless I see something obviously wrong/dangerous), and we can expect to keep reshaping the code as we grow.
* Switches presubmits to track main. (#139) * Switches presubmits to track main. * Address https://github.blog/changelog/2020-10-01-github-actions-deprecating-set-env-and-add-path-commands/ * Added WORKING_GROUP (#140) * Readme edits + organization docs (#142) * readme and admin docs * faq edits * doc edits * edits and organization * readme edits * readme edit * Remove Queue reference from NodeGroup overprovisioning example (#145) * correct a typo in README.md (#143) * readme typo (#146) * Capitalise TERMS.md link (#147) * Update OWNERS (#150) * Update README.md (#149) * Fix installation to target `main` branch instead of `master` (#153) * Cleaned up FAQs, Working Group Invite, and other typos (#148) * Update OWNERS (#155) add tabern * Switches to Allocatable from Capacity and exposes more human readable (#154) values in capacity reservation status. * Update FAQs.md (#156) * Update DESIGN.md (#158) Fixed typos * Update README.md (#160) * Update README.md * Update README.md * Update README.md * Initial github banner image (#162) * Initial github banner image * Small image fix and add to readme * Remove repo name from top of readme * Smaller readme banner. Add repo preview * HA Bottoms out at 1 DesiredReplica for Utilization and Value metrics (#161) * Update README.md (#163) * changed consumer context from customer to user (#168) * changed consumer context from customer to user (#167) * Reorganized OWNERS (#169) * grammar & syntax updates in docs/DESIGN.md (#170) * Add unit test for scalable node group controller (#157) * Add unit test for scalable node group controller * Add some more tests * Add error checks for SetReplicas call * Fix Unit tests * rename fakeError * update tests * rename a few things * rename fake error object * Fixed pointers comparison * Simplify tests a little * Updated release image to use public ecr (#171) * Improved release automation. Cut pre-release for v0.1.1 (#174) * Changed desiredReplicas to be a pointer to allow for displaying scale to (#172) * Changed desiredReplicas to be a pointer to allow for displaying scale to zero. * Updated tests * Created a minimal Helm Chart (#175) * Implemented a minimal helm chart with release automation using Github Pages * Removed stray files * Updated chart * Updated install instructions (#176) * Updated install instructions * Fixed a typo * Improved calendar invite for Working Group (#177) * Add working group meeting notes (#180) * Add working group meeting notes (#182) * Create NOTICE (#178) * Fix errors in hack/quick-install.sh (#185) Fix issues: 1/ Command helm upgrade --install karpenter charts/karpenter fails. 2/ Uninstall: Current order of operation leaves deployment: karpenter and statefulset: prometheus-karpenter-monitor remaining: * Fixed a few typos in developer guide (#187) The developer guide hadn't been updated after some of the make commands changed name. Also corrected an environment variable name. * Add notes from working group meeting (#192) * ScheduledCapacity API (#194) * fixed a small typo in metricsproducer (#195) * Add notes from working group meeting (#196) * Implemented Scheduled Capacity Metrics Producer (#200) * fixing merge problems * Implemented Scheduled Capacity Metrics Producer * Design Doc for ScheduledCapacity (#197) * Delete karpenter.ics (#204) Deleted in ics invite favor of google calendar. * enable build on provisioner branch too (#206) * Validations for MetricsProducers (#211) * MetricsProducer Schedule Validation * cleaned up validation efforts for non-schedules * Added Validation checks in all resource reconcile loops * Add working group notes (#217) * add working group notes * add some more notes to WG notes * Rebase provisioner-work into main. (#219) * Provisioner api (#198) * First commit for provisioner spec * remove comented code * Added a simple controller skeleton (#199) * Adding capacity API client (#202) * Adding fleet API client * Add capacity provisioner and refine the interface definitions * Provisioner skeleton code (#203) * Wired up controller to main and trivial test (#205) * More tweaks to test wiring (#207) * Add end to end functionality for pending pods (#209) * add end to end functionality * simplify functions after PR feedback * simplify create fleet method * fix comments, remove new line char and remove private create method * Implemented cloud provider initialization (#208) * Implemented Cloud Provider initialization and discovery logic. * Implemented cloud provider initialization * Cloudprovider improvements: Lazy Initialization, IAM Separation, and Topology Support (#213) * Create node objects and bind pods to nodes (#215) * Create node objects and bind pods to nodes * fix error check * fix error check * Reoriented Capacity CloudProvider around BinPacking (#214) * Setup packing interface and refactored instance provider for launching constrained instances (#216) * Pausing on packing * Setup packing interface and refactored instance provider for launching constrained instances Co-authored-by: Prateek Gogia <[email protected]> * Defaults for Metrics Producers (#222) * Refactor cloud provider, add packing package (#221) * refactor cloud provider, add packing package * Refactored allocator and improve logging (#218) * Skeleton code for reallocator (#235) * Generated v0.1.2 release Co-authored-by: Nate Taber <[email protected]> Co-authored-by: Guy Templeton <[email protected]> Co-authored-by: dherman <[email protected]> Co-authored-by: Guy Templeton <[email protected]> Co-authored-by: Tom Kerkhove <[email protected]> Co-authored-by: alethatarn <[email protected]> Co-authored-by: Justin Garrison <[email protected]> Co-authored-by: Cameron Senese <[email protected]> Co-authored-by: Prateek Gogia <[email protected]> Co-authored-by: Prateek Gogia <[email protected]> Co-authored-by: Henri Yandell <[email protected]> Co-authored-by: Jacob Gabrielson <[email protected]> Co-authored-by: Nick Tran <[email protected]>
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.