Make cluster creation and deletion idempotent, repeatable and resumable from partial state #86

archisgore · 2018-06-27T03:17:31Z

Depends on: #84 and #85

I made some extensive changes to make eksctl be able to pick up partially-created or existing resources in AWS whenever they are partially created.

This is intended to allow resumption of a partially executed create run, and it also sets the stage for launching clusters in existing VPCs.

The Problem: Specifically under ClusterProvider.CreateCluster, there are 4 tasks that need to complete. The first two tasks run in parallel. The next 2 run serially, but are not gated against success of the first two. This means that if more than one task fails, the cluster creation process can get stuck attempting to write to the unbuffered taskErr channel. The operational fix: Buffer the channel with 5 slots for errors, ensuring there are no surprises. eksctl is not some high-performance app, so having a buffer is useful. The systematic fix: When writing to the channel, provide a default escape clause that allows a failed write to be noticed, and warned of. This will prevent silent hangings.

When EKS cluster is deleted, if the control plane does not exist, the entire delete operation is aborted. This can leave a lot of AWS artifacts lying around - Nodes, security groups, subnets, VPCs, etc. There is a real material cost to leaving these resources around. This change makes delete a repeatable and idempotent operation. It allows a paused/aborted/failed delete to be repeated ensuring all resources are deleted eventually.

This change makes 2 major changes: 1. No action-at-a-distance when it comes to opening/closing channels Channels are opened and defer-closed in the same scope, and only readable channels are returned from functions. All functions are non-blocking, returning immediately. All callers are now forced to read only one channel to get any returned errors. This also means channels are not accidentally abused, meaning nobody can accidentally post to a read-only channel. 2. No more return-nil protocol. By closing channels when their goroutines are complete, any blocked readers on the channels will, by definition, be unblocked instantly and they'll get the default zero-value which is nil. This removes another class of problems where all return paths no longer need to be validated, so long as the channel-closing is assured (which is easier since it's within the same scope.)

…evamp

…ore/eksctl into archis/idempotent-create

When re-running cluster create, the first step is importing a Public SSH Key from the default location. However, the fallback was counter-intuitive - if the file doesn't exist, then an existing key (based on that filename) is searched for. But if the file does exist and the key has been imported already, the create command fails. But it doesn't really fail in any helpful way, and makes scripting difficult. This change flips the flow: 1. First, existence of the key is checked in EC2. If it exists, it is used. 2. If it does not exist, it is attempted to be imported. This means if the create command is executed from a different host, the caller has to ensure THEIR key is used correctly.

1. Reuse SSH Key: If EKS-<clustername> key already exists, use it. More idempotency. 2. InstantTicker: Since resources that may already exist are reused, in create stack, it makes no sense to wait 20 seconds for the first poll. The time.Ticker in golang, unfortunately, doesn't support the initial tick to be sent instantly. So built a small scrappy InstantTicker under utils. I know canonically I should use a library, but taking another external dependency for extreme normalization felt unnecessary.

archisgore · 2018-06-28T03:51:15Z

Moving out of WIP status. This is now ready for review. I was able to ctrl+c the script partially, and it successfully picks up and continues. This is crucial once the timeouts kick in, because an EKS cluster can take a lot more than 5 minutes to become ready.

Raffo · 2018-06-30T21:26:17Z

With PRs like this, we start to see the impact of #46 : having no tests makes really hard to see if things are still working. What do you think of start adding some with this PR? I'm not the project maintainer, so @errordeveloper opinion would probably be better here.

archisgore · 2018-07-01T13:18:51Z

Yes, that makes sense. I'll re-submit once a testing structure/framework/guidelines are posted, with proper tests. I agree that there's likely to be someone else who's wanted to test against AWS SDK stubs.

archis-polyverse added 6 commits June 26, 2018 19:05

Merge branch 'archis/buffered-taskerr-chan' into archis/working

ac8c38c

idempotent

27a8a25

Make EKS cluster creation idempotent as well.

60ca310

Improved deletion logs and details

f774a6f

archisgore changed the title ~~Make cluster creation idempotent and restartable~~ WIP: Make cluster creation and deletion idempotent, repeatable and resumable from partial state Jun 27, 2018

archis-polyverse added 11 commits June 27, 2018 08:25

Merged add-timeouts branch

fa541ca

At task level, don't return error if channel is used.

50ce636

Merge branch 'master' into archis/idempotent-create

52c810f

Merge remote-tracking branch 'weaveworks/master' into archis/errors-r…

efaa413

…evamp

Removed an inner goroutine and close

c68494d

Merge branch 'archis/errors-revamp' into archis/idempotent-create

c8a30e2

Merge branch 'archis/idempotent-create' of https://github.com/archisg…

ea5646a

…ore/eksctl into archis/idempotent-create

Tested extensively end to end.

8e4192d

archisgore changed the title ~~WIP: Make cluster creation and deletion idempotent, repeatable and resumable from partial state~~ Make cluster creation and deletion idempotent, repeatable and resumable from partial state Jun 28, 2018

archisgore closed this Jul 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make cluster creation and deletion idempotent, repeatable and resumable from partial state #86

Make cluster creation and deletion idempotent, repeatable and resumable from partial state #86

archisgore commented Jun 27, 2018

archisgore commented Jun 28, 2018

Raffo commented Jun 30, 2018

archisgore commented Jul 1, 2018

Make cluster creation and deletion idempotent, repeatable and resumable from partial state #86

Make cluster creation and deletion idempotent, repeatable and resumable from partial state #86

Conversation

archisgore commented Jun 27, 2018

archisgore commented Jun 28, 2018

Raffo commented Jun 30, 2018

archisgore commented Jul 1, 2018