Skip to content
This repository has been archived by the owner on Sep 21, 2020. It is now read-only.

[GPII-3326]: Apply audit config in gcp-project module #246

Merged
merged 15 commits into from
Jan 3, 2019

Conversation

natarajaya
Copy link
Contributor

This adds audit config with READ/WRITE logs for every resource type that we currently use.
Since AuditLogConfigs is still not supported by Terraform, this also adds custom script to modify freshly applied IAM policy.
And another change is disabling audit config that is being applied as part of gcp-secret-mgmt module.

@stepanstipl
Copy link
Contributor

stepanstipl commented Dec 19, 2018

I'll need to review this further, and I'm aware there's not support atm. in Terrafrom, but I don't like the run always approach (triggers = { nonce = "${var.nonce}"}) and always overwrite. To open a discussion in the meantime: There's a PR adding support for audit logs - hashicorp/terraform-provider-google#1531. Instead of implementing custom code, that does not really fit in Terrafrom model, what about building google provider with this PR. This would imho have several benefits:

  • Supporting upstream project we heavily use and depend on
  • We won't have to rework our approach (or not much at least) once the PR lands
  • We can potentially influence how the audit logs support will look like
  • We would use native TF approach (reconcile state only when needed, terraform plan functionality)
  • We use code shared and tested by many, that we don't have to maintain (as opposed to proprietary code developed and only tested [or not :)] by us)

Obviously there are some cons, main ones I see:

  • Necessity to setup build pipeline for the provider
  • We'll need to maintain this when we want to update provider (we don't do that very often and the PR will hopefully land in the 2.0.0)
  • The PR as of now does not seem to have tests (that sucks, but good opportunity to help)

@natarajaya
Copy link
Contributor Author

@stepanstipl I don't see any serious downsides of my implementation – changes to IAM policy is being overwritten anyway with google_project_iam_policy resource, I don't think forcefully applying it on every run may cause any unwanted side effects.

I would prefer not to complicate things with maintaining our own Terraform Google provider fork, and update this to pure Terraform implementation as soon as AuditLogConfigs are supported, this seems like a pretty straightforward change.

@mrtyler
Copy link
Contributor

mrtyler commented Dec 20, 2018

LGTM

Stepan's points warrant more discussion -- I am also worried about more custom code, and getting better at terraform provider development may be valuable since we are likely to remain heavy terraform users -- but I'm not sure this is the right time or the right feature to adopt that approach.

Copy link
Contributor

@stepanstipl stepanstipl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, in that case main point - we should not overwrite policies if they don't need to be overwritten. This has badly bitten us with the alerting and in general is not a good approach - it only increases chances of errors and various side-effects -> do not make changes when there's no need for them.

google_project_iam_policy resource does NOT overwrite policy if there are no changes to be made.

Also this warning comes up:

null_resource.add_audit_config (local-exec): Replace existing policy (Y/n)?
null_resource.add_audit_config (local-exec): The specified policy does not contain an "etag" field identifying a
null_resource.add_audit_config (local-exec): specific version to replace. Changing a policy without an "etag" can
null_resource.add_audit_config (local-exec): overwrite concurrent policy changes.

Now error handling, or the lack of it:

  • What if bindings=$(gcloud projects get-iam-policy ${google_project.project.project_id} --format json | jq -c -r .bindings fails for various reasons (jq seems to happily take empty input in this case echo "" | jq -c -r .bindings)?
    The bindings end up empty -> we overwrite policy with empty one with no bindings? Obviously if we do the overwriting every single time this is executed, chances of this happening are higher ;).

  • What if we get some unexpected output from the gcloud that jq can't parse? -> We never set the $bindings, therefore again end up overwriting policy with empty one.

  • What about timeouts - is this somehow handled (as TF doesn't seem to do so with local-exec)? Recently we had issue with destryong PVCs call that would happily run for hours.

  • What if the gcloud projects set-iam-policy call fails? This error is never propagated back to TF level, so everything will seem just fine. although there was an error.

@stepanstipl
Copy link
Contributor

stepanstipl commented Dec 20, 2018

I forgot to mention the usability/security aspect of this: imagine we actually want to find out when something changed (maybe an issue with CI to debug, maybe there was an incident and we discovered that someone gained extra IAM permissions they should not have).

If we modify things every thime, we would have to go through many (easily hundreds) setIamPolicy log events and compare if anything has been actually modified.

@natarajaya
Copy link
Contributor Author

@mrtyler @stepanstipl Thanks for reviews!

Now error handling, or the lack of it

If anything happens during current IAM policy retrieval, policy bindings variable will not be populated with valid json (jq returns empty string in case required attribute is missing in the input). And when we generate IAM policy json file, jq expects valid json with --argjson parameter, so it will fail. Same scenario if we mess something up in auditConfigs.json file. Even if we have valid json everywhere, but gcloud is not happy with policy definition, it will fail and nothing will get updated.

What about timeouts

There are only two gcloud calls that may get stuck due to network. I believe default network timeout in alpine is 60 seconds.

What if the gcloud projects set-iam-policy call fails? This error is never propagated back to TF level

This is not true. If anything fails inside local-exec block, failure propagates to null_resource and Terraform stops execution (this is how we get failed builds when Locust is not happy with test results, etc).

I do not understand how this case is similar to alerting policies case, but good point about unnecessary policy update events in logs – let me implement some measures to prevent that.

@stepanstipl
Copy link
Contributor

stepanstipl commented Dec 21, 2018

thanks @natarajaya, definitely much better not changing the policy when not needed.

jq expects valid json with ` parameter

Thanks for explaining the --argjson, that makes sense and does help with some of the error cases.

This is not true. If anything fails inside local-exec block, failure propagates to null_resource and Terraform stops execution

This is not the case. Please try this simple manifest:

resource "null_resource" "add_audit_config" {
  provisioner "local-exec" {
    command = <<EOF
      false
      true
    EOF
  }
}

Running terraform apply will end up with:

...


null_resource.add_audit_config: Creating...
null_resource.add_audit_config: Provisioning with 'local-exec'...
null_resource.add_audit_config (local-exec): Executing: ["/bin/sh" "-c" "      false\n      true\n    "]
null_resource.add_audit_config: Creation complete after 0s (ID: 2586165858741763649)

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

So Terrafrom is happy, while there is obviously a command with non-zero return code (you can check that Terraform uses /bin/sh -c as default interpreter - https://github.com/hashicorp/terraform/blob/master/builtin/provisioners/local-exec/resource_provisioner.go#L82). So this is still an issue - if the set-iam-policy fails it will never get propagated to the TF level.

Simple solution might be to tell TF to use /bin/sh -ce - note the -e flag, a bit nicer would be to handle the error and still clean up after the scipt.

There are only two gcloud calls that may get stuck due to network. I believe default network timeout in alpine is 60 seconds.

I'm not familiar with all the implementation details, and I guess this would depend on inner implementation of gcloud and python2 resp., but I was easily able to make gcloud command get stuck for more than 60s by simulating network loss:

# normal
$ time gcloud projects get-iam-policy gpii-gcp-dev-stepan > /dev/null
real    0m 0.72s
user    0m 0.31s
sys     0m 0.04s

# slow & lossy network
$ tc qdisc change dev eth0 root netem loss 80%
$ time gcloud projects get-iam-policy gpii-gcp-dev-stepan > /dev/null
^C

Command killed by keyboard interrupt

Command terminated by signal 2
real    7m 45.56s
user    0m 0.31s
sys     0m 0.03s

I've terminated the command after 7m, but it seemed to be stuck for good. Given TF's local-exec does not handle this, and recent issue with destroying PVCs where we saw the pipeline keep running for 4+h till cancelled, I think we should handle this explicitly.

@stepanstipl
Copy link
Contributor

On another note - audit config support should be in the master of TF google provider... - hashicorp/terraform-provider-google#1531 (comment)

@natarajaya
Copy link
Contributor Author

@stepanstipl Thanks for the review!

I was easily able to make gcloud command get stuck for more than 60s by simulating network loss

Is this simulation realistic? Do we really expect the code to run on network with 80% packet loss? Again, I don't see how this is similar to situation with PVCs (where issue was some kind of race condition in K8s). But, I agree this makes a point, it sucks that Terraform does not support timeout configuration for null_resource. I can think of some simple forced timeout implementation on shell level, but it will look pretty ugly and I don't think this is really necessary here.

So Terrafrom is happy, while there is obviously a command with non-zero return code

Looks like I was not very accurate in my previous statement, what I meant is "if local-exec ends with non-zero exit code, failure will be escalated and Terraform will stop with error", so there is clearly an easy way to deal with this:

null_resource.add_audit_config (local-exec): Applying audit configs...
null_resource.add_audit_config (local-exec): ERROR: (gcloud.projects.set-iam-policy) INVALID_ARGUMENT: Invalid value at 'policy.audit_configs[0].audit_log_configs[0].log_type' (TYPE_ENUM), "DATA_DROOL"
null_resource.add_audit_config (local-exec): - '@type': type.googleapis.com/google.rpc.BadRequest
null_resource.add_audit_config (local-exec):   fieldViolations:
null_resource.add_audit_config (local-exec):   - description: Invalid value at 'policy.audit_configs[0].audit_log_configs[0].log_type'
null_resource.add_audit_config (local-exec):       (TYPE_ENUM), "DATA_DROOL"
null_resource.add_audit_config (local-exec):     field: policy.audit_configs[0].audit_log_configs[0].log_type
google_project_services.project: Still modifying... (ID: gpii-gcp-dev-natarajaya, 10s elapsed)
null_resource.add_audit_config: Still creating... (10s elapsed)
google_project_services.project: Modifications complete after 12s (ID: gpii-gcp-dev-natarajaya)

Error: Error applying plan:

1 error(s) occurred:

* null_resource.add_audit_config: Error running command

@mrtyler
Copy link
Contributor

mrtyler commented Dec 26, 2018

Do we really expect the code to run on network with 80% packet loss?

Fallacies of Distributed Software, #1: The network is reliable.

This adds audit config with READ/WRITE logs for every resource type that we currently use.

How will we know when this list of audit logs drifts from the list of resources that we currently use?

@natarajaya
Copy link
Contributor Author

Fallacies of Distributed Software, #1: The network is reliable.

@mrtyler Already agreed that it makes a point. Unfortunately, there is no straightforward way to set timeouts.

How will we know when this list of audit logs drifts from the list of resources that we currently use?

Do you want me to start a README section on that?

@mrtyler
Copy link
Contributor

mrtyler commented Dec 27, 2018

Do you want me to start a README section on that?

What I would like is a single source of truth, so there is only one thing to update or so that this list of resources can be calculated automatically. Next best would be some kind of alert if there is drift between resources in use and resources for which we have audit logs.

However, I'm not sure how to do those things now so I suppose documentation is the next best thing.

@mrtyler
Copy link
Contributor

mrtyler commented Dec 27, 2018

Looking again, can we leverage the list of APIs in common/modules/gcp-project/main.tf somehow?

@stepanstipl
Copy link
Contributor

Is this simulation realistic? Do we really expect the code to run on network with 80% packet loss?

Just a quick example to see how does gcloud handle this. Point being that we should consider how to handle timeouts to prevent never-ending local-execs since TF does not do this. That was also the parallel with the PVCs - the local-exec there would keep running for hours till terminated and block the CI pipeline.

I can think of some simple forced timeout implementation on shell level, but it will look pretty ugly and I don't think this is really necessary here.

If we continue to use local-exec (which is already "ugly" way to use TF), imho we should think about how to handle this problem. I was thinking how does TF handle timeouts block for null_resource, but does not: null_resource.test: [ERR] Error decoding timeout: Timeout Key (delete) is not supported :(.

@natarajaya
Copy link
Contributor Author

What I would like is a single source of truth

Looking again, can we leverage the list of APIs in common/modules/gcp-project/main.tf somehow?

@mrtyler Unfortunately, this is not possible. There are 21 APIs that we use and enable in common/modules/gcp-project/main.tf, but Audit Logs are available only for 10 of them.

Next best would be some kind of alert if there is drift between resources in use and resources for which we have audit logs

I spent some time thinking how we could possibly implement that and failed to come up with any reasonable solution. Without single source of truth, any check would also need to be updated when we add / remove active APIs, which basically makes it useless... So, let me start a README section on Audit Logs.

@natarajaya
Copy link
Contributor Author

If we continue to use local-exec (which is already "ugly" way to use TF),

@stepanstipl While I generally agree with your argument about "ugliness", I think we should accept reality here, and reality is such that with current state of things we can not completely get rid of local-exec resources in our infra code due to number of reasons. I think we should focus on getting to FERPA now, and revisit this question later, at better times.

imho we should think about how to handle this problem

Do you think that current implementation is not good enough and I should add some timeouts mechanic on shell level (it would look ugly, but it will work)?

@mrtyler
Copy link
Contributor

mrtyler commented Dec 28, 2018

There are 21 APIs that we use and enable in common/modules/gcp-project/main.tf, but Audit Logs are available only for 10 of them.

How about:

apis_with_audit_logs = [
  ...
]
apis_without_audit_logs = [
  ...
]
apis = concat(${apis_with_audit_logs}, ${apis_without_audit_logs})
...

?

@natarajaya
Copy link
Contributor Author

natarajaya commented Dec 28, 2018

apis_without_audit_logs

@mrtyler We would need to generate audit config on the fly. Let me think on that.

@stepanstipl
Copy link
Contributor

stepanstipl commented Dec 31, 2018

thanks @natarajaya

Do you think that current implementation is not good enough and I should add some timeouts mechanic on shell level (it would look ugly, but it will work)?

Do you think it's good to introduce more code (whether local-exec style or not) that is less robust than the majority of our code (and that does not fit the core pattern - TF plan/apply)?

I think it's ok (resp. not worth more effort) in the context that we will replace this pretty much as soon as TF Google provider is ready with TF native code.

@natarajaya
Copy link
Contributor Author

@mrtyler Please review new approach to generate audit config on the fly from TF API lists. I think it looks pretty good, but there is one unfortunate glitch: storage API that is required by the projects has storage-api.googleapis.com name, but the same API in audit configuration is just storage.googleapis.com. The reason for this is probably that there are 3 different storage APIs (storage-component, storage-api, storagetransfer) all having same audit configuration. I had to introduce another list variable because of that.

@stepanstipl I am not sure what are you proposing here, but I found another somewhat cleaner solution – added timeout to all gcloud calls. Seems like it does the job perfectly.

Copy link
Contributor

@mrtyler mrtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I suggested some different variable names for clarity, but I like the new approach.

common/modules/gcp-project/audit.tf Outdated Show resolved Hide resolved
common/modules/gcp-project/audit.tf Outdated Show resolved Hide resolved
common/modules/gcp-project/audit.tf Outdated Show resolved Hide resolved
common/modules/gcp-project/audit.tf Outdated Show resolved Hide resolved
gcp/modules/gcp-secret-mgmt/main.tf Show resolved Hide resolved
common/modules/gcp-project/main.tf Outdated Show resolved Hide resolved
common/modules/gcp-project/main.tf Outdated Show resolved Hide resolved
common/modules/gcp-project/main.tf Show resolved Hide resolved
common/modules/gcp-project/main.tf Outdated Show resolved Hide resolved
Copy link
Contributor

@stepanstipl stepanstipl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the timeout, definitely better, thanks.

common/modules/gcp-project/audit.tf Outdated Show resolved Hide resolved
common/modules/gcp-project/audit.tf Outdated Show resolved Hide resolved
common/modules/gcp-project/audit.tf Outdated Show resolved Hide resolved
@mrtyler
Copy link
Contributor

mrtyler commented Jan 3, 2019

LGTM

@natarajaya natarajaya merged commit 6759fa6 into gpii-ops:master Jan 3, 2019
@natarajaya natarajaya deleted the audit_config branch January 8, 2019 01:37
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants