[GPII-3326]: Apply audit config in gcp-project module #246

natarajaya · 2018-12-19T03:56:57Z

This adds audit config with READ/WRITE logs for every resource type that we currently use.
Since AuditLogConfigs is still not supported by Terraform, this also adds custom script to modify freshly applied IAM policy.
And another change is disabling audit config that is being applied as part of gcp-secret-mgmt module.

stepanstipl · 2018-12-19T14:33:22Z

I'll need to review this further, and I'm aware there's not support atm. in Terrafrom, but I don't like the run always approach (triggers = { nonce = "${var.nonce}"}) and always overwrite. To open a discussion in the meantime: There's a PR adding support for audit logs - hashicorp/terraform-provider-google#1531. Instead of implementing custom code, that does not really fit in Terrafrom model, what about building google provider with this PR. This would imho have several benefits:

Supporting upstream project we heavily use and depend on
We won't have to rework our approach (or not much at least) once the PR lands
We can potentially influence how the audit logs support will look like
We would use native TF approach (reconcile state only when needed, terraform plan functionality)
We use code shared and tested by many, that we don't have to maintain (as opposed to proprietary code developed and only tested [or not :)] by us)

Obviously there are some cons, main ones I see:

Necessity to setup build pipeline for the provider
We'll need to maintain this when we want to update provider (we don't do that very often and the PR will hopefully land in the 2.0.0)
The PR as of now does not seem to have tests (that sucks, but good opportunity to help)

…nfig

natarajaya · 2018-12-19T23:55:42Z

@stepanstipl I don't see any serious downsides of my implementation – changes to IAM policy is being overwritten anyway with google_project_iam_policy resource, I don't think forcefully applying it on every run may cause any unwanted side effects.

I would prefer not to complicate things with maintaining our own Terraform Google provider fork, and update this to pure Terraform implementation as soon as AuditLogConfigs are supported, this seems like a pretty straightforward change.

mrtyler · 2018-12-20T05:50:41Z

LGTM

Stepan's points warrant more discussion -- I am also worried about more custom code, and getting better at terraform provider development may be valuable since we are likely to remain heavy terraform users -- but I'm not sure this is the right time or the right feature to adopt that approach.

common/modules/gcp-project/audit.tf

stepanstipl

Ok, in that case main point - we should not overwrite policies if they don't need to be overwritten. This has badly bitten us with the alerting and in general is not a good approach - it only increases chances of errors and various side-effects -> do not make changes when there's no need for them.

google_project_iam_policy resource does NOT overwrite policy if there are no changes to be made.

Also this warning comes up:

null_resource.add_audit_config (local-exec): Replace existing policy (Y/n)?
null_resource.add_audit_config (local-exec): The specified policy does not contain an "etag" field identifying a
null_resource.add_audit_config (local-exec): specific version to replace. Changing a policy without an "etag" can
null_resource.add_audit_config (local-exec): overwrite concurrent policy changes.

Now error handling, or the lack of it:

What if bindings=$(gcloud projects get-iam-policy ${google_project.project.project_id} --format json | jq -c -r .bindings fails for various reasons (jq seems to happily take empty input in this case echo "" | jq -c -r .bindings)?
The bindings end up empty -> we overwrite policy with empty one with no bindings? Obviously if we do the overwriting every single time this is executed, chances of this happening are higher ;).
What if we get some unexpected output from the gcloud that jq can't parse? -> We never set the $bindings, therefore again end up overwriting policy with empty one.
What about timeouts - is this somehow handled (as TF doesn't seem to do so with local-exec)? Recently we had issue with destryong PVCs call that would happily run for hours.
What if the gcloud projects set-iam-policy call fails? This error is never propagated back to TF level, so everything will seem just fine. although there was an error.

stepanstipl · 2018-12-20T13:09:57Z

I forgot to mention the usability/security aspect of this: imagine we actually want to find out when something changed (maybe an issue with CI to debug, maybe there was an incident and we discovered that someone gained extra IAM permissions they should not have).

If we modify things every thime, we would have to go through many (easily hundreds) setIamPolicy log events and compare if anything has been actually modified.

natarajaya · 2018-12-21T00:10:23Z

@mrtyler @stepanstipl Thanks for reviews!

Now error handling, or the lack of it

If anything happens during current IAM policy retrieval, policy bindings variable will not be populated with valid json (jq returns empty string in case required attribute is missing in the input). And when we generate IAM policy json file, jq expects valid json with --argjson parameter, so it will fail. Same scenario if we mess something up in auditConfigs.json file. Even if we have valid json everywhere, but gcloud is not happy with policy definition, it will fail and nothing will get updated.

What about timeouts

There are only two gcloud calls that may get stuck due to network. I believe default network timeout in alpine is 60 seconds.

What if the gcloud projects set-iam-policy call fails? This error is never propagated back to TF level

This is not true. If anything fails inside local-exec block, failure propagates to null_resource and Terraform stops execution (this is how we get failed builds when Locust is not happy with test results, etc).

I do not understand how this case is similar to alerting policies case, but good point about unnecessary policy update events in logs – let me implement some measures to prevent that.

stepanstipl · 2018-12-21T10:24:32Z

thanks @natarajaya, definitely much better not changing the policy when not needed.

jq expects valid json with ` parameter

Thanks for explaining the --argjson, that makes sense and does help with some of the error cases.

This is not true. If anything fails inside local-exec block, failure propagates to null_resource and Terraform stops execution

This is not the case. Please try this simple manifest:

resource "null_resource" "add_audit_config" {
  provisioner "local-exec" {
    command = <<EOF
      false
      true
    EOF
  }
}

Running terraform apply will end up with:

...


null_resource.add_audit_config: Creating...
null_resource.add_audit_config: Provisioning with 'local-exec'...
null_resource.add_audit_config (local-exec): Executing: ["/bin/sh" "-c" "      false\n      true\n    "]
null_resource.add_audit_config: Creation complete after 0s (ID: 2586165858741763649)

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

So Terrafrom is happy, while there is obviously a command with non-zero return code (you can check that Terraform uses /bin/sh -c as default interpreter - https://github.com/hashicorp/terraform/blob/master/builtin/provisioners/local-exec/resource_provisioner.go#L82). So this is still an issue - if the set-iam-policy fails it will never get propagated to the TF level.

Simple solution might be to tell TF to use /bin/sh -ce - note the -e flag, a bit nicer would be to handle the error and still clean up after the scipt.

There are only two gcloud calls that may get stuck due to network. I believe default network timeout in alpine is 60 seconds.

I'm not familiar with all the implementation details, and I guess this would depend on inner implementation of gcloud and python2 resp., but I was easily able to make gcloud command get stuck for more than 60s by simulating network loss:

# normal
$ time gcloud projects get-iam-policy gpii-gcp-dev-stepan > /dev/null
real    0m 0.72s
user    0m 0.31s
sys     0m 0.04s

# slow & lossy network
$ tc qdisc change dev eth0 root netem loss 80%
$ time gcloud projects get-iam-policy gpii-gcp-dev-stepan > /dev/null
^C

Command killed by keyboard interrupt

Command terminated by signal 2
real    7m 45.56s
user    0m 0.31s
sys     0m 0.03s

I've terminated the command after 7m, but it seemed to be stuck for good. Given TF's local-exec does not handle this, and recent issue with destroying PVCs where we saw the pipeline keep running for 4+h till cancelled, I think we should handle this explicitly.

stepanstipl · 2018-12-21T10:57:38Z

On another note - audit config support should be in the master of TF google provider... - hashicorp/terraform-provider-google#1531 (comment)

…nfig

natarajaya · 2018-12-24T22:54:58Z

@stepanstipl Thanks for the review!

I was easily able to make gcloud command get stuck for more than 60s by simulating network loss

Is this simulation realistic? Do we really expect the code to run on network with 80% packet loss? Again, I don't see how this is similar to situation with PVCs (where issue was some kind of race condition in K8s). But, I agree this makes a point, it sucks that Terraform does not support timeout configuration for null_resource. I can think of some simple forced timeout implementation on shell level, but it will look pretty ugly and I don't think this is really necessary here.

So Terrafrom is happy, while there is obviously a command with non-zero return code

Looks like I was not very accurate in my previous statement, what I meant is "if local-exec ends with non-zero exit code, failure will be escalated and Terraform will stop with error", so there is clearly an easy way to deal with this:

null_resource.add_audit_config (local-exec): Applying audit configs...
null_resource.add_audit_config (local-exec): ERROR: (gcloud.projects.set-iam-policy) INVALID_ARGUMENT: Invalid value at 'policy.audit_configs[0].audit_log_configs[0].log_type' (TYPE_ENUM), "DATA_DROOL"
null_resource.add_audit_config (local-exec): - '@type': type.googleapis.com/google.rpc.BadRequest
null_resource.add_audit_config (local-exec):   fieldViolations:
null_resource.add_audit_config (local-exec):   - description: Invalid value at 'policy.audit_configs[0].audit_log_configs[0].log_type'
null_resource.add_audit_config (local-exec):       (TYPE_ENUM), "DATA_DROOL"
null_resource.add_audit_config (local-exec):     field: policy.audit_configs[0].audit_log_configs[0].log_type
google_project_services.project: Still modifying... (ID: gpii-gcp-dev-natarajaya, 10s elapsed)
null_resource.add_audit_config: Still creating... (10s elapsed)
google_project_services.project: Modifications complete after 12s (ID: gpii-gcp-dev-natarajaya)

Error: Error applying plan:

1 error(s) occurred:

* null_resource.add_audit_config: Error running command

common/modules/gcp-project/audit.tf

mrtyler · 2018-12-26T18:07:33Z

Do we really expect the code to run on network with 80% packet loss?

Fallacies of Distributed Software, #1: The network is reliable.

This adds audit config with READ/WRITE logs for every resource type that we currently use.

How will we know when this list of audit logs drifts from the list of resources that we currently use?

natarajaya · 2018-12-26T22:52:40Z

Fallacies of Distributed Software, #1: The network is reliable.

@mrtyler Already agreed that it makes a point. Unfortunately, there is no straightforward way to set timeouts.

How will we know when this list of audit logs drifts from the list of resources that we currently use?

Do you want me to start a README section on that?

mrtyler · 2018-12-27T03:45:42Z

Do you want me to start a README section on that?

What I would like is a single source of truth, so there is only one thing to update or so that this list of resources can be calculated automatically. Next best would be some kind of alert if there is drift between resources in use and resources for which we have audit logs.

However, I'm not sure how to do those things now so I suppose documentation is the next best thing.

mrtyler · 2018-12-27T03:49:39Z

Looking again, can we leverage the list of APIs in common/modules/gcp-project/main.tf somehow?

common/modules/gcp-project/audit.tf

stepanstipl · 2018-12-27T15:42:33Z

Is this simulation realistic? Do we really expect the code to run on network with 80% packet loss?

Just a quick example to see how does gcloud handle this. Point being that we should consider how to handle timeouts to prevent never-ending local-execs since TF does not do this. That was also the parallel with the PVCs - the local-exec there would keep running for hours till terminated and block the CI pipeline.

I can think of some simple forced timeout implementation on shell level, but it will look pretty ugly and I don't think this is really necessary here.

If we continue to use local-exec (which is already "ugly" way to use TF), imho we should think about how to handle this problem. I was thinking how does TF handle timeouts block for null_resource, but does not: null_resource.test: [ERR] Error decoding timeout: Timeout Key (delete) is not supported :(.

natarajaya · 2018-12-28T05:21:01Z

What I would like is a single source of truth

Looking again, can we leverage the list of APIs in common/modules/gcp-project/main.tf somehow?

@mrtyler Unfortunately, this is not possible. There are 21 APIs that we use and enable in common/modules/gcp-project/main.tf, but Audit Logs are available only for 10 of them.

Next best would be some kind of alert if there is drift between resources in use and resources for which we have audit logs

I spent some time thinking how we could possibly implement that and failed to come up with any reasonable solution. Without single source of truth, any check would also need to be updated when we add / remove active APIs, which basically makes it useless... So, let me start a README section on Audit Logs.

natarajaya · 2018-12-28T05:30:14Z

If we continue to use local-exec (which is already "ugly" way to use TF),

@stepanstipl While I generally agree with your argument about "ugliness", I think we should accept reality here, and reality is such that with current state of things we can not completely get rid of local-exec resources in our infra code due to number of reasons. I think we should focus on getting to FERPA now, and revisit this question later, at better times.

imho we should think about how to handle this problem

Do you think that current implementation is not good enough and I should add some timeouts mechanic on shell level (it would look ugly, but it will work)?

mrtyler · 2018-12-28T14:21:25Z

There are 21 APIs that we use and enable in common/modules/gcp-project/main.tf, but Audit Logs are available only for 10 of them.

How about:

apis_with_audit_logs = [
  ...
]
apis_without_audit_logs = [
  ...
]
apis = concat(${apis_with_audit_logs}, ${apis_without_audit_logs})
...

?

natarajaya · 2018-12-28T22:06:42Z

apis_without_audit_logs

@mrtyler We would need to generate audit config on the fly. Let me think on that.

stepanstipl · 2018-12-31T18:17:49Z

thanks @natarajaya

Do you think that current implementation is not good enough and I should add some timeouts mechanic on shell level (it would look ugly, but it will work)?

Do you think it's good to introduce more code (whether local-exec style or not) that is less robust than the majority of our code (and that does not fit the core pattern - TF plan/apply)?

I think it's ok (resp. not worth more effort) in the context that we will replace this pretty much as soon as TF Google provider is ready with TF native code.

…nfig

natarajaya · 2019-01-03T02:08:43Z

@mrtyler Please review new approach to generate audit config on the fly from TF API lists. I think it looks pretty good, but there is one unfortunate glitch: storage API that is required by the projects has storage-api.googleapis.com name, but the same API in audit configuration is just storage.googleapis.com. The reason for this is probably that there are 3 different storage APIs (storage-component, storage-api, storagetransfer) all having same audit configuration. I had to introduce another list variable because of that.

@stepanstipl I am not sure what are you proposing here, but I found another somewhat cleaner solution – added timeout to all gcloud calls. Seems like it does the job perfectly.

mrtyler

LGTM

I suggested some different variable names for clarity, but I like the new approach.

common/modules/gcp-project/audit.tf

gcp/modules/gcp-secret-mgmt/main.tf

common/modules/gcp-project/main.tf

stepanstipl

+1 for the timeout, definitely better, thanks.

common/modules/gcp-project/audit.tf

mrtyler · 2019-01-03T21:37:15Z

LGTM

…nfig

natarajaya added 2 commits December 18, 2018 19:49

Add audit config to gcp-project module

41b13e0

Set apply_audit_config to false for gcp-secret-mgmt module

3dddd41

natarajaya self-assigned this Dec 19, 2018

natarajaya mentioned this pull request Dec 19, 2018

[GPII-3326]: Make adding audit config in gcp-secret-mgmt module configurable gpii-ops/exekube#32

Merged

Merge branch 'master' of github.com:gpii-ops/gpii-infra into audit_co…

1dd9110

…nfig

stepanstipl reviewed Dec 20, 2018

View reviewed changes

common/modules/gcp-project/audit.tf Outdated Show resolved Hide resolved

stepanstipl reviewed Dec 20, 2018

View reviewed changes

Do not set-iam-policy in case there is no change in audit configs

6be1ac1

Merge branch 'master' of github.com:gpii-ops/gpii-infra into audit_co…

f646433

…nfig

Propagate gcloud exit status to null resource

40823cc

mrtyler reviewed Dec 26, 2018

View reviewed changes

common/modules/gcp-project/audit.tf Outdated Show resolved Hide resolved

Use set -e instead of setting exit code explicitly

626856b

stepanstipl reviewed Dec 27, 2018

View reviewed changes

common/modules/gcp-project/audit.tf Outdated Show resolved Hide resolved

natarajaya added 4 commits January 2, 2019 16:35

Merge branch 'master' of github.com:gpii-ops/gpii-infra into audit_co…

8c30c38

…nfig

Generate audit config on the fly

465516a

Improve comments

f57652a

Use gcloud calls with timeout

fdd8ed1

mrtyler reviewed Jan 3, 2019

View reviewed changes

stepanstipl reviewed Jan 3, 2019

View reviewed changes

common/modules/gcp-project/audit.tf Outdated Show resolved Hide resolved

common/modules/gcp-project/audit.tf Outdated Show resolved Hide resolved

common/modules/gcp-project/audit.tf Outdated Show resolved Hide resolved

natarajaya added 3 commits January 3, 2019 13:08

Screen shell vars with quotes

768083f

Use premium, improved clarity vars

63b275b

Add comment

267223c

Merge branch 'master' of github.com:gpii-ops/gpii-infra into audit_co…

321e355

…nfig

natarajaya merged commit 6759fa6 into gpii-ops:master Jan 3, 2019

natarajaya deleted the audit_config branch January 8, 2019 01:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPII-3326]: Apply audit config in gcp-project module #246

[GPII-3326]: Apply audit config in gcp-project module #246

natarajaya commented Dec 19, 2018

stepanstipl commented Dec 19, 2018 •

edited

Loading

natarajaya commented Dec 19, 2018

mrtyler commented Dec 20, 2018

stepanstipl left a comment •

edited

Loading

stepanstipl commented Dec 20, 2018 •

edited

Loading

natarajaya commented Dec 21, 2018

stepanstipl commented Dec 21, 2018 •

edited

Loading

stepanstipl commented Dec 21, 2018

natarajaya commented Dec 24, 2018

mrtyler commented Dec 26, 2018

natarajaya commented Dec 26, 2018

mrtyler commented Dec 27, 2018

mrtyler commented Dec 27, 2018

stepanstipl commented Dec 27, 2018

natarajaya commented Dec 28, 2018

natarajaya commented Dec 28, 2018

mrtyler commented Dec 28, 2018 •

edited

Loading

natarajaya commented Dec 28, 2018 •

edited

Loading

stepanstipl commented Dec 31, 2018 •

edited

Loading

natarajaya commented Jan 3, 2019

mrtyler left a comment

stepanstipl left a comment

mrtyler commented Jan 3, 2019

[GPII-3326]: Apply audit config in gcp-project module #246

[GPII-3326]: Apply audit config in gcp-project module #246

Conversation

natarajaya commented Dec 19, 2018

stepanstipl commented Dec 19, 2018 • edited Loading

natarajaya commented Dec 19, 2018

mrtyler commented Dec 20, 2018

stepanstipl left a comment • edited Loading

Choose a reason for hiding this comment

stepanstipl commented Dec 20, 2018 • edited Loading

natarajaya commented Dec 21, 2018

stepanstipl commented Dec 21, 2018 • edited Loading

stepanstipl commented Dec 21, 2018

natarajaya commented Dec 24, 2018

mrtyler commented Dec 26, 2018

natarajaya commented Dec 26, 2018

mrtyler commented Dec 27, 2018

mrtyler commented Dec 27, 2018

stepanstipl commented Dec 27, 2018

natarajaya commented Dec 28, 2018

natarajaya commented Dec 28, 2018

mrtyler commented Dec 28, 2018 • edited Loading

natarajaya commented Dec 28, 2018 • edited Loading

stepanstipl commented Dec 31, 2018 • edited Loading

natarajaya commented Jan 3, 2019

mrtyler left a comment

Choose a reason for hiding this comment

stepanstipl left a comment

Choose a reason for hiding this comment

mrtyler commented Jan 3, 2019

stepanstipl commented Dec 19, 2018 •

edited

Loading

stepanstipl left a comment •

edited

Loading

stepanstipl commented Dec 20, 2018 •

edited

Loading

stepanstipl commented Dec 21, 2018 •

edited

Loading

mrtyler commented Dec 28, 2018 •

edited

Loading

natarajaya commented Dec 28, 2018 •

edited

Loading

stepanstipl commented Dec 31, 2018 •

edited

Loading