-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make cluster creation and deletion idempotent, repeatable and resumable from partial state #86
Conversation
The Problem: Specifically under ClusterProvider.CreateCluster, there are 4 tasks that need to complete. The first two tasks run in parallel. The next 2 run serially, but are not gated against success of the first two. This means that if more than one task fails, the cluster creation process can get stuck attempting to write to the unbuffered taskErr channel. The operational fix: Buffer the channel with 5 slots for errors, ensuring there are no surprises. eksctl is not some high-performance app, so having a buffer is useful. The systematic fix: When writing to the channel, provide a default escape clause that allows a failed write to be noticed, and warned of. This will prevent silent hangings.
When EKS cluster is deleted, if the control plane does not exist, the entire delete operation is aborted. This can leave a lot of AWS artifacts lying around - Nodes, security groups, subnets, VPCs, etc. There is a real material cost to leaving these resources around. This change makes delete a repeatable and idempotent operation. It allows a paused/aborted/failed delete to be repeated ensuring all resources are deleted eventually.
This change makes 2 major changes: 1. No action-at-a-distance when it comes to opening/closing channels Channels are opened and defer-closed in the same scope, and only readable channels are returned from functions. All functions are non-blocking, returning immediately. All callers are now forced to read only one channel to get any returned errors. This also means channels are not accidentally abused, meaning nobody can accidentally post to a read-only channel. 2. No more return-nil protocol. By closing channels when their goroutines are complete, any blocked readers on the channels will, by definition, be unblocked instantly and they'll get the default zero-value which is nil. This removes another class of problems where all return paths no longer need to be validated, so long as the channel-closing is assured (which is easier since it's within the same scope.)
…ore/eksctl into archis/idempotent-create
When re-running cluster create, the first step is importing a Public SSH Key from the default location. However, the fallback was counter-intuitive - if the file doesn't exist, then an existing key (based on that filename) is searched for. But if the file does exist and the key has been imported already, the create command fails. But it doesn't really fail in any helpful way, and makes scripting difficult. This change flips the flow: 1. First, existence of the key is checked in EC2. If it exists, it is used. 2. If it does not exist, it is attempted to be imported. This means if the create command is executed from a different host, the caller has to ensure THEIR key is used correctly.
1. Reuse SSH Key: If EKS-<clustername> key already exists, use it. More idempotency. 2. InstantTicker: Since resources that may already exist are reused, in create stack, it makes no sense to wait 20 seconds for the first poll. The time.Ticker in golang, unfortunately, doesn't support the initial tick to be sent instantly. So built a small scrappy InstantTicker under utils. I know canonically I should use a library, but taking another external dependency for extreme normalization felt unnecessary.
Moving out of WIP status. This is now ready for review. I was able to ctrl+c the script partially, and it successfully picks up and continues. This is crucial once the timeouts kick in, because an EKS cluster can take a lot more than 5 minutes to become ready. |
With PRs like this, we start to see the impact of #46 : having no tests makes really hard to see if things are still working. What do you think of start adding some with this PR? I'm not the project maintainer, so @errordeveloper opinion would probably be better here. |
Yes, that makes sense. I'll re-submit once a testing structure/framework/guidelines are posted, with proper tests. I agree that there's likely to be someone else who's wanted to test against AWS SDK stubs. |
Depends on: #84 and #85
I made some extensive changes to make
eksctl
be able to pick up partially-created or existing resources in AWS whenever they are partially created.This is intended to allow resumption of a partially executed create run, and it also sets the stage for launching clusters in existing VPCs.