-
Notifications
You must be signed in to change notification settings - Fork 15
[GPII-3326]: Apply audit config in gcp-project module #246
Conversation
I'll need to review this further, and I'm aware there's not support atm. in Terrafrom, but I don't like the run always approach (
Obviously there are some cons, main ones I see:
|
@stepanstipl I don't see any serious downsides of my implementation – changes to IAM policy is being overwritten anyway with I would prefer not to complicate things with maintaining our own Terraform Google provider fork, and update this to pure Terraform implementation as soon as AuditLogConfigs are supported, this seems like a pretty straightforward change. |
LGTM Stepan's points warrant more discussion -- I am also worried about more custom code, and getting better at terraform provider development may be valuable since we are likely to remain heavy terraform users -- but I'm not sure this is the right time or the right feature to adopt that approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, in that case main point - we should not overwrite policies if they don't need to be overwritten. This has badly bitten us with the alerting and in general is not a good approach - it only increases chances of errors and various side-effects -> do not make changes when there's no need for them.
google_project_iam_policy
resource does NOT overwrite policy if there are no changes to be made.
Also this warning comes up:
null_resource.add_audit_config (local-exec): Replace existing policy (Y/n)?
null_resource.add_audit_config (local-exec): The specified policy does not contain an "etag" field identifying a
null_resource.add_audit_config (local-exec): specific version to replace. Changing a policy without an "etag" can
null_resource.add_audit_config (local-exec): overwrite concurrent policy changes.
Now error handling, or the lack of it:
-
What if
bindings=$(gcloud projects get-iam-policy ${google_project.project.project_id} --format json | jq -c -r .bindings
fails for various reasons (jq
seems to happily take empty input in this caseecho "" | jq -c -r .bindings
)?
The bindings end up empty -> we overwrite policy with empty one with no bindings? Obviously if we do the overwriting every single time this is executed, chances of this happening are higher ;). -
What if we get some unexpected output from the
gcloud
thatjq
can't parse? -> We never set the$bindings
, therefore again end up overwriting policy with empty one. -
What about timeouts - is this somehow handled (as TF doesn't seem to do so with
local-exec
)? Recently we had issue with destryong PVCs call that would happily run for hours. -
What if the
gcloud projects set-iam-policy
call fails? This error is never propagated back to TF level, so everything will seem just fine. although there was an error.
I forgot to mention the usability/security aspect of this: imagine we actually want to find out when something changed (maybe an issue with CI to debug, maybe there was an incident and we discovered that someone gained extra IAM permissions they should not have). If we modify things every thime, we would have to go through many (easily hundreds) |
@mrtyler @stepanstipl Thanks for reviews!
If anything happens during current IAM policy retrieval, policy bindings variable will not be populated with valid json (
There are only two
This is not true. If anything fails inside I do not understand how this case is similar to alerting policies case, but good point about unnecessary policy update events in logs – let me implement some measures to prevent that. |
thanks @natarajaya, definitely much better not changing the policy when not needed.
Thanks for explaining the
This is not the case. Please try this simple manifest:
Running
So Terrafrom is happy, while there is obviously a command with non-zero return code (you can check that Terraform uses Simple solution might be to tell TF to use
I'm not familiar with all the implementation details, and I guess this would depend on inner implementation of # normal
$ time gcloud projects get-iam-policy gpii-gcp-dev-stepan > /dev/null
real 0m 0.72s
user 0m 0.31s
sys 0m 0.04s
# slow & lossy network
$ tc qdisc change dev eth0 root netem loss 80%
$ time gcloud projects get-iam-policy gpii-gcp-dev-stepan > /dev/null
^C
Command killed by keyboard interrupt
Command terminated by signal 2
real 7m 45.56s
user 0m 0.31s
sys 0m 0.03s I've terminated the command after 7m, but it seemed to be stuck for good. Given TF's |
On another note - audit config support should be in the master of TF google provider... - hashicorp/terraform-provider-google#1531 (comment) |
@stepanstipl Thanks for the review!
Is this simulation realistic? Do we really expect the code to run on network with 80% packet loss? Again, I don't see how this is similar to situation with PVCs (where issue was some kind of race condition in K8s). But, I agree this makes a point, it sucks that Terraform does not support timeout configuration for
Looks like I was not very accurate in my previous statement, what I meant is "if
|
Fallacies of Distributed Software, #1: The network is reliable.
How will we know when this list of audit logs drifts from the list of resources that we currently use? |
@mrtyler Already agreed that it makes a point. Unfortunately, there is no straightforward way to set timeouts.
Do you want me to start a README section on that? |
What I would like is a single source of truth, so there is only one thing to update or so that this list of resources can be calculated automatically. Next best would be some kind of alert if there is drift between resources in use and resources for which we have audit logs. However, I'm not sure how to do those things now so I suppose documentation is the next best thing. |
Looking again, can we leverage the list of APIs in |
Just a quick example to see how does
If we continue to use |
@mrtyler Unfortunately, this is not possible. There are 21 APIs that we use and enable in
I spent some time thinking how we could possibly implement that and failed to come up with any reasonable solution. Without single source of truth, any check would also need to be updated when we add / remove active APIs, which basically makes it useless... So, let me start a README section on Audit Logs. |
@stepanstipl While I generally agree with your argument about "ugliness", I think we should accept reality here, and reality is such that with current state of things we can not completely get rid of
Do you think that current implementation is not good enough and I should add some timeouts mechanic on shell level (it would look ugly, but it will work)? |
How about:
? |
@mrtyler We would need to generate audit config on the fly. Let me think on that. |
thanks @natarajaya
Do you think it's good to introduce more code (whether I think it's ok (resp. not worth more effort) in the context that we will replace this pretty much as soon as TF Google provider is ready with TF native code. |
@mrtyler Please review new approach to generate audit config on the fly from TF API lists. I think it looks pretty good, but there is one unfortunate glitch: storage API that is required by the projects has @stepanstipl I am not sure what are you proposing here, but I found another somewhat cleaner solution – added timeout to all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I suggested some different variable names for clarity, but I like the new approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the timeout, definitely better, thanks.
LGTM |
This adds audit config with READ/WRITE logs for every resource type that we currently use.
Since AuditLogConfigs is still not supported by Terraform, this also adds custom script to modify freshly applied IAM policy.
And another change is disabling audit config that is being applied as part of
gcp-secret-mgmt
module.