Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Vault Agent Auto-Auth AWS ec2 method should support storing or providing the nonce #5312

Closed
stephansnyt opened this issue Sep 10, 2018 · 10 comments

Comments

@stephansnyt
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Our current aws ec2 auth flows involve generating our own nonce prior to doing the vault write where we provide the signed pkcs7 to get a token. This is to allow these long-lived ec2 instances to reauthenticate as needed. We can't switch to using vault agent auto-auth aws method because it does not support

Describe the solution you'd like
Vault Agent Auto-Auth AWS ec2 method should be as flexible as doing the authentication yourself, which supports providing your own nonce, or storing the one returned by vault. For providing your own, it would be easy to add a new optional field to the aws auto auth configuration, like this:

master...stephansnyt:patch-1

For storing a nonce provided by vault, it could be done by creating a new sink or extending the sink that's there.

Describe alternatives you've considered
I think I understand that this wasn't done for security (Due to the complexity of the Trust On First Use (TOFU) model used in the ec2 method...) but I think adding nonce support to aws ec2 auto-auth isn't less secure than what we're already doing.

@jefferai
Copy link
Member

The precursor to Agent (which was never open sourced) stored the nonce, but as we all sort of understand, storing the nonce is not only just as good as storing the token itself, it's even better since it can be used to get new tokens with fresh lifetimes. So we left it out of here for the moment until we could figure out some kind of security strategy.

We'd strongly prefer/suggest people use IAM instead with ec2 inferencing as it doesn't have this probem.

@jefferai
Copy link
Member

(Responding to some out-of-band questions:)

The main difference without the nonce is liveness -- as long as the box is still running, the identity doc can be used from any server. That's why the whole TOFU model exists, but it's also quite expensive operationally since we have to track all of those nonces and operators have to deal with them. Because the GetCallerIdentity queries are only valid for a certain period of time, we don't know who on the box is making the call, but we know it's coming from that box, and it's recent.

Of course, in some situations you might rather do a bootstrap and root protect the nonce because you absolutely want that TOFU model. That's why the docs are (probably too) detailed trying to explain the pros and cons.

As I said earlier, the agent doesn't support the nonce not because it's opinionated, but because we were just trying to see if we could figure out a better way of protecting it. One possibility, for instance, would be to response-wrap the nonce after it's received. That essentially keeps the TOFU mechanic going -- whatever process unwraps that token can get the nonce, but then if that was done illegitimately, the legitimate user can start throwing errors. Of course, this requires some permissions on the token being returned which the user may not have, so likely a plain-nonce fallback method would still be required. Other options might be to protect it via KMS encryption, which isn't necessarily better in any real process way but at least keeps the value from sitting on disk in plaintext.

The main thing is that these options may influence the design of the provider in the agent -- config, internal API, etc -- and we just didn't want to commit to something until we had a decent idea of how it should work. Even things like writing the nonce to a file -- should we put in a bunch of chmod/chown options? Should we just write it and let any provisioning scripts that started up agent take care of any permissions? We didn't have ideal answers at the time and didn't have the time to spend figuring them out.

That all said, the simplest strategy is of course just writing it somewhere without chmod and chown options. It wouldn't be hard to add, although I myself won't have time anytime super soon. I'll run it by the EM/PM and see if they want to prioritize it. If someone wants to PR it I'm happy to discuss design.

@stephansnyt
Copy link
Contributor Author

Thanks for taking the time to explain. I think I understand the problem now. You don't want to release and be stuck supporting a bad solution, and want to give the problem due consideration and design effort.

I've tested out the iam method and I think that will work for me.

@jefferai
Copy link
Member

Hi Stephan,

I'm going to keep this open for now because I do actually want us to have an answer -- keeping the ticket open and milestoning it will make sure that it doesn't fall off our radar.

@jefferai jefferai reopened this Sep 11, 2018
@jefferai jefferai added this to the near-term milestone Sep 11, 2018
@lcgkm
Copy link

lcgkm commented Jan 14, 2019

I really need this.
Because I can's use iam method:
We want to enables the role tags, but this constraint is valid only with the ec2 auth method and is not allowed when auth_type is iam.
https://www.vaultproject.io/api/auth/aws/index.html#create-role

@jefferai
Copy link
Member

Role tags aren't needed (or really, are less useful) for IAM. The reason they exist for EC2 is because of the super long nature of the credentials. You're much better off using constantly rotated IAM credentials than relying on EC2 identity doc and all the baggage that comes with it.

@lcgkm
Copy link

lcgkm commented Jan 15, 2019

I know constantly rotated IAM credentials is much better. But sometimes we have no choice...
Currently, we have to use AWS ec2 method.

Or could AWS ec2 method support nonce configuration, let us handle it by ourselves?
for example

pid_file = "./pidfile"

#exit_after_auth = true

auto_auth {
  method "aws" {
    mount_path = "auth/aws_ec2"

    config = {
      type = "ec2"
      role = "test"
      nonce = "xxx-xxx-xxx"
    }
  }

  sink "file" {
    wrap_ttl = "5m"

    config = {
      path = "/tmp/vault.token"
    }
  }
}

https://www.vaultproject.io/docs/agent/autoauth/methods/aws.html#configuration

The reference:
We can set Consul ACL token in vault configuration by ourselves.
https://www.vaultproject.io/docs/configuration/storage/consul.html

@cove
Copy link

cove commented Jan 22, 2019

My current use case is to distributed Teleport join tokens to instances on start up, so it would be nice to use TOFU since if an instance is compromised then the attacker could simply get the current join token and add a trusted node to the cluster.

@jefferai if I'm following the core issue is storing lots of nonces in a database an expiring them correct?

Perhaps these have already been discussed and ruled out for some reason I'm not seeing, but here's a couple of approaches that come to mind if they're helpful at all.

Instead of a traditional nonce, use a cryptographically signed token that contains a expire time in it to limit it's validity period. In fact use a JWT token with an expires time and a unique ID encoded in it since they're compact and are easy to support these days.

Partition dropping method: used nonces are stored in a path based on their expire time 15min window (e.g. /secret/2019-01-21-00:00-00:15/guid), every 15min the previous partitions are dropped in the tree efficiently. So testing for a replay, you validate the nonce's signature, get the expire time and check if the nonce exists at it's expire range window in the tree. If it doesn't exist then you create it, if it does then you reject the transaction.

Bloom filter method: building on the partitioning system above for quick culling, in this case you can reduce the number of keys stored in each partition with a bloom filter instead of storing them directly (e.g. /secret/2019-01-21-00:00-00:15/bloom filter). You make m (the key space) a large fixed size such that the chance of a false positive is acceptably low, 1 in 150 million for 1 million nonces for example (so it's more likely you'll get a failed ec2 instance boot than a nonce replay false positive). This would require m to be 10 million bits, however since you only need to store the nonces seen and there are some hash collisions, the amount of data that actually need to be stored in the database can be in the 10s of kb range based on systems I've built in the past. Additionally this requires k (the number of hashes) to be about 8 according to the calculator link below, but using murmur3 as the hash is exceptionally fast. Here's what the curves look like: https://hur.st/bloomfilter/?n=1000000&p=.00000001&m=10MB&k=

Also happy to chat oob to share more if it's helpful at all.

@lcgkm
Copy link

lcgkm commented Jun 28, 2019

@stephansnyt
Reference:
#6953

@jefferai
Copy link
Member

Closing as 6953 handles this.

@pbernal pbernal removed this from the near-term milestone May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants