Fallback behavior when IAM is down #96

zxlin · 2017-10-26T21:27:35Z

I'm sure many people noticed a very brief IAM outage earlier this week. During the outage, IAM was not responsive and as a result, this script would go and delete all of the local users synced from IAM as IAM did not return a list of users.

I was hoping to discuss what are the options for some fallback behavior in the event of IAM outage or actually just plain network connectivity outage.

michaelwittig · 2017-10-27T11:29:01Z

if the IAM API is down the CLI will fail. As a result, the script will fail. It should not delete all users.
Why do you think that this will happen?

zxlin · 2017-10-27T15:09:49Z

It happened with the IAM outage earlier this week, in /var/log/auth.log it shows aws-ec2-ssh deleting all the users during the outage. Perhaps IAM returned 200 and an empty list while it was still recovering?

Nonetheless, this is still a case we have to consider.

michaelwittig · 2017-10-30T09:50:46Z

that's kind of unexpected.... the problem is that we can not really decide that an empty list means no IAM users or something is broken...

zxlin · 2017-10-30T17:39:54Z

It would be kind of annoying to implement, but a configurable n number of "confirms" that a user is deleted before actually deleting the user would be nice.

Keep some state of the list of users and a counter of how many times a user's been missing from the returned list, on the n-th sync of it missing, delete the user.

michaelwittig · 2017-10-31T08:53:18Z

I don't see how this would solve the problem. Depending on the length of the outage the script would still delete all the users?

richard-scott · 2017-10-31T11:09:18Z

Curl has lots of exit codes so you can see exactly why the download of data failed and act accordingly.

zxlin · 2017-10-31T16:27:50Z

@michaelwittig n can be set to a org's fault tolerance preference and if the outage does last longer than however long the n confirm allows, then it at least gives time for the org to react to system failures.

This is not the only solution to fix it, it's just what I thought of first. I'm completely open to other ways to introduce some fault tolerance into this.

michaelwittig · 2017-10-31T19:01:52Z

@richard-scott there is no curl involved here at all...

assertnotnull · 2017-11-01T20:20:15Z

I saw this issue and maybe it's the problem I am facing here.
Today in the same afternoon, the import_user deleted the users twice.

michaelwittig · 2017-11-02T09:05:52Z

@assertnotnull we log to system log (priority auth.info | tag aws-ec2-ssh) when users are created or deleted by ec2-ssh. Depending on the configuration of your OS, the logs will be placed in different log files. Usually it is /var/log/messages. Could you provide the relevant log lines?

assertnotnull · 2017-11-02T18:04:58Z

Here's what I found in the logs
https://defuse.ca/b/ToCgVcUDGNTQj4gwv0YhS3

michaelwittig · 2017-11-03T16:06:35Z

@assertnotnull ok. and on what time was the user added again after Nov 1 20:40:20?

MassimoSporchia · 2018-05-07T15:27:46Z

Same is happening to me during an outage on IAM: https://twitter.com/gbcis/status/993502731762655232
Even having the users during an outage like this means that users won't be able to log-in your system, is there a way to cache public-keys?

packetfairy · 2018-05-07T17:29:24Z

After having this code shoot us in the face on three discrete events, here are the changes I made to better protect myself:

removed the -r from the userdel instruction, so that home directories would be left in place during an outage which resulted in users being deleted
added extra bailouts to import_users, because, at least for our environment, YES WE CAN conclude that an empty user list means that something is horribly horribly broken
added generation of authorized_keys files to import_users and stripped the sshd configurations back out

zxlin · 2018-05-07T17:33:09Z

@packetfairy Would you mind linking your repo? I'd love to take a look!

packetfairy · 2018-05-07T19:00:48Z

I am managing the code as part of a larger repo, in probably a lamer way than I should be (ie: without using submodules 😝), but here's a fork with my changes incorporated: https://github.com/packetfairy/aws-ec2-ssh/tree/condoms

zxlin · 2018-05-07T19:45:52Z

@packetfairy lol branch name. But I like this a lot actually!

Maybe the fallback logic could be enabled if some IAM_OUTAGE_PROTECTION flag was enabled in aws-ec2-ssh.conf so that organizations can choose their own fault tolerance level.

Thoughts @michaelwittig?

MassimoSporchia · 2018-05-07T20:59:55Z

@packetfairy is it me or you are still configuring sshd to run the script to authorize logins?

This way, users will still be able to access the instance even if IAM apis are down: the only thing that won't work is update to ssh keys.

packetfairy · 2018-05-07T21:34:56Z

@MassimoSporchia I removed the AuthorizedKeysCommand and AuthorizedKeysCommandUser configuration options from sshd_config on my own, but I did not update the installation/setup scripts to fit with my changes, as I manage my sshd_config separately with ansible. (Is that what you meant?)

Because, yes, as you observe: using this code, existing users should be able to access the instance even if the IAM API is down.

The bail outs from empty iam_users and sudo_users prevent good users from being deleted from the system during an outage. Having a local authorized_keys file for each user prevents authentication timeouts/errors during an outage. Given that I now had local authorized_keys files generated with current data from the API, it felt foolish to also do an API call every time a user authenticated, so I just stripped that bit out.

As an unintended consequence, the first authentication pass is now lightning fast again!

michaelwittig · 2018-05-14T11:38:50Z

I merged the emtpy IAM user list detection from @packetfairy (Thanks!)

The ability to configure userdel (e.g. homedir removal) is tracked here #112

The caching of local users is also discussed here #114

* if iam returns no users at all, it is likey down (implementation merged from https://github.com/packetfairy/aws-ec2-ssh/blob/condoms/import_users.sh#L285) fixes widdix#96 * Fix typo (widdix#125) * Fix typo in rpm install output text * Fix file name * fix * fix * improved check * re added creation policy * changing aws command line detection to make it silent (widdix#127) * fix RHEL * increase timeout * allow parallel tests * fix RHEL showcase * re-add creation policy * Update README.md bump version * added license to templates * Install from latest release instead of master (widdix#133) Add argument to install script to specify release * document ##ALL## * fix tag enabled groups in multi account setup (widdix#136) * added hint to AWS Systems Manager Session Manager * Changed URI for RPM to latest release version (widdix#140) * fix IAM SSH access * fix * Update README.md

michaelwittig added the investigation label Dec 27, 2017

michaelwittig closed this as completed in ba76613 May 14, 2018

michaelwittig added bug and removed investigation labels May 14, 2018

michaelwittig mentioned this issue Mar 3, 2020

Imported users are deleted if API fails (e.g. rate limiting) #154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fallback behavior when IAM is down #96

Fallback behavior when IAM is down #96

zxlin commented Oct 26, 2017

michaelwittig commented Oct 27, 2017

zxlin commented Oct 27, 2017

michaelwittig commented Oct 30, 2017

zxlin commented Oct 30, 2017

michaelwittig commented Oct 31, 2017

richard-scott commented Oct 31, 2017 •

edited

Loading

zxlin commented Oct 31, 2017

michaelwittig commented Oct 31, 2017

assertnotnull commented Nov 1, 2017

michaelwittig commented Nov 2, 2017

assertnotnull commented Nov 2, 2017

michaelwittig commented Nov 3, 2017

MassimoSporchia commented May 7, 2018 •

edited

Loading

packetfairy commented May 7, 2018

zxlin commented May 7, 2018

packetfairy commented May 7, 2018

zxlin commented May 7, 2018

MassimoSporchia commented May 7, 2018 •

edited

Loading

packetfairy commented May 7, 2018

michaelwittig commented May 14, 2018

Fallback behavior when IAM is down #96

Fallback behavior when IAM is down #96

Comments

zxlin commented Oct 26, 2017

michaelwittig commented Oct 27, 2017

zxlin commented Oct 27, 2017

michaelwittig commented Oct 30, 2017

zxlin commented Oct 30, 2017

michaelwittig commented Oct 31, 2017

richard-scott commented Oct 31, 2017 • edited Loading

zxlin commented Oct 31, 2017

michaelwittig commented Oct 31, 2017

assertnotnull commented Nov 1, 2017

michaelwittig commented Nov 2, 2017

assertnotnull commented Nov 2, 2017

michaelwittig commented Nov 3, 2017

MassimoSporchia commented May 7, 2018 • edited Loading

packetfairy commented May 7, 2018

zxlin commented May 7, 2018

packetfairy commented May 7, 2018

zxlin commented May 7, 2018

MassimoSporchia commented May 7, 2018 • edited Loading

packetfairy commented May 7, 2018

michaelwittig commented May 14, 2018

richard-scott commented Oct 31, 2017 •

edited

Loading

MassimoSporchia commented May 7, 2018 •

edited

Loading

MassimoSporchia commented May 7, 2018 •

edited

Loading