Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make cluster creation and deletion idempotent, repeatable and resumable from partial state #86

Closed
wants to merge 17 commits into from
Closed

Conversation

archisgore
Copy link
Contributor

Depends on: #84 and #85

I made some extensive changes to make eksctl be able to pick up partially-created or existing resources in AWS whenever they are partially created.

This is intended to allow resumption of a partially executed create run, and it also sets the stage for launching clusters in existing VPCs.

The Problem:

Specifically under ClusterProvider.CreateCluster, there are 4
tasks that need to complete.

The first two tasks run in parallel. The next 2 run serially,
but are not gated against success of the first two.

This means that if more than one task fails, the cluster creation
process can get stuck attempting to write to the unbuffered
taskErr channel.

The operational fix:

Buffer the channel with 5 slots for errors, ensuring there
are no surprises.

eksctl is not some high-performance app, so having a buffer
is useful.

The systematic fix:

When writing to the channel, provide a default escape clause
that allows a failed write to be noticed, and warned of. This
will prevent silent hangings.
When EKS cluster is deleted, if the control plane does not
exist, the entire delete operation is aborted. This can
leave a lot of AWS artifacts lying around - Nodes, security groups,
subnets, VPCs, etc.

There is a real material cost to leaving these resources around.

This change makes delete a repeatable and idempotent operation.

It allows a paused/aborted/failed delete to be repeated ensuring
all resources are deleted eventually.
@archisgore archisgore changed the title Make cluster creation idempotent and restartable WIP: Make cluster creation and deletion idempotent, repeatable and resumable from partial state Jun 27, 2018
This change makes 2 major changes:
1. No action-at-a-distance when it comes to opening/closing channels

Channels are opened and defer-closed in the same scope, and only readable
channels are returned from functions.

All functions are non-blocking, returning immediately. All callers
are now forced to read only one channel to get any returned errors.

This also means channels are not accidentally abused, meaning
nobody can accidentally post to a read-only channel.

2. No more return-nil protocol.

By closing channels when their goroutines are complete,
any blocked readers on the channels will, by definition,
be unblocked instantly and they'll get the default
zero-value which is nil.

This removes another class of problems where all return paths
no longer need to be validated, so long as the channel-closing is
assured (which is easier since it's within the same scope.)
When re-running cluster create, the first step is importing a Public SSH Key
from the default location.

However, the fallback was counter-intuitive - if the file doesn't exist,
then an existing key (based on that filename) is searched for.

But if the file does exist and the key has been imported already,
the create command fails. But it doesn't really fail
in any helpful way, and makes scripting difficult.

This change flips the flow:
1. First, existence of the key is checked in EC2. If it exists, it is used.
2. If it does not exist, it is attempted to be imported.

This means if the create command is executed from a different host, the caller
has to ensure THEIR key is used correctly.
1. Reuse SSH Key:
If EKS-<clustername> key already exists, use it. More idempotency.

2. InstantTicker:
Since resources that may already exist are reused, in create stack,
it makes no sense to wait 20 seconds for the first poll.

The time.Ticker in golang, unfortunately, doesn't support the initial
tick to be sent instantly. So built a small scrappy InstantTicker under utils.

I know canonically I should use a library, but taking another external dependency
for extreme normalization felt unnecessary.
@archisgore archisgore changed the title WIP: Make cluster creation and deletion idempotent, repeatable and resumable from partial state Make cluster creation and deletion idempotent, repeatable and resumable from partial state Jun 28, 2018
@archisgore
Copy link
Contributor Author

Moving out of WIP status. This is now ready for review. I was able to ctrl+c the script partially, and it successfully picks up and continues. This is crucial once the timeouts kick in, because an EKS cluster can take a lot more than 5 minutes to become ready.

@Raffo
Copy link

Raffo commented Jun 30, 2018

With PRs like this, we start to see the impact of #46 : having no tests makes really hard to see if things are still working. What do you think of start adding some with this PR? I'm not the project maintainer, so @errordeveloper opinion would probably be better here.

@archisgore
Copy link
Contributor Author

Yes, that makes sense. I'll re-submit once a testing structure/framework/guidelines are posted, with proper tests. I agree that there's likely to be someone else who's wanted to test against AWS SDK stubs.

@archisgore archisgore closed this Jul 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants