Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add option to clear rules from default VPC security group and ACL #516

Merged
merged 25 commits into from
May 5, 2023

Conversation

toddgiguere
Copy link
Contributor

Description

Add new option, in the form of boolean input variables, to clear (remove) all rules from the default Security Group and ACL that are automatically created by the VPC provider. These default network objects are very permissive, and can affect security requirements and compliance.

IBM Cloud does not allow the removal of the default Security Group or ACL resources, so these changes will instead remove all rules inside those resources so that they will deny all by default.

Some notes about these changes:

  • If set, the clear options will remove ALL RULES from the resource
  • Removal of rules accomplished by two new Python3 scripts
  • Python scripts call REST APIs only, to keep pip requirements minimal
  • Python scripts authenticate using the Terraform provider refresh_token
  • The two new boolean inputs to trigger the clear routines are default = "false", due to an existing input variable to allow consumers to specify rules to be added to the default security group and ACL, so setting default to "false" is safest way to make sure a consumer does not accidentally clear any rules they have just added
  • Scripts will only be triggered upon first deployment, and will only trigger again if the IDs of either default resource are changed (most likely due to redeploy)

Types of changes in this PR

No release required

  • Examples or tests (addition or updates of examples or tests)
  • Documentation update
  • CI-related update (pipeline, etc.)
  • Other changes that don't affect Terraform code

Release required

  • Bug fix (patch release (x.x.X): Change that fixes an issue and is compatible with earlier versions)
  • New feature (minor release (x.X.x): Change that adds functionality and is compatible with earlier versions)
  • Breaking change (major release (X.x.x): Change that is likely incompatible with previous versions)
Release notes content

Added new option, in the form of boolean input variables, to clear (remove) all rules from the default Security Group and ACL that are automatically created by the VPC provider. These default network objects are very permissive, and can affect security requirements and compliance. The defaults cannot be directly removed, so this change will instead remove all rules contained inside of them so that they deny all traffic by default. This will allow a consumer to then add their own least permissive network Security Groups and ACLs to their VPC.


Checklist for reviewers

  • If relevant, a test for the change is included or updated with this PR.
  • If relevant, documentation for the change is included or updated with this PR.

Merge actions for mergers

  • Merge by using "Squash and merge".

  • Use a relevant conventional commit message that is based on the PR contents and any release notes provided by the PR author.

    The commit message determines whether a new version of the module is needed, and if so, which semver increment to use (major, minor, or patch).

import logging
import os

import requests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not come out of the box with python. Is there anything else we can use? Otherwise we will have to do a pip install (not ideal)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest revision has removed the "requests" external dependency, and uses the native "urllib" instead. Now both python scripts are only using native imports.

Tested these new scripts in schematics, everything is working there as well.

@vburckhardt
Copy link
Member

I'm pondering whether we'd not be better off doing this using the ibmcloud cli in bash to simplify. This may introduce a dependency on jq, but we already have it in other modules anyways.

@toddgiguere
Copy link
Contributor Author

@vburckhardt I now have it working with pure python3 scripts, with no external dependencies other than python3 itself. SLZ had already had a dependency on python3 (uses scripts to do json conversions) so I figured this is safest.

IBM CLI with bash introduces some other developer machine dependencies, such as:

  • Version of bash (we've been bitten by this before, as MacOS has ancient version)
  • CLI would need to be installed on client
  • any additional CLI plugins needed for these resources (not sure if these in particular need plugins?)
  • jq is needed for json processing in bash
  • bash scripting with complex objects and loops (for instance looping through all rules to call restapi to delete) is a little tougher than in Python
  • CLI would require us to create an pass the ibmcloud_api_key into this module from the calling scaffold, I was trying to avoid that (these python scripts are using refresh_token from terraform provider)

Copy link
Member

@ocofaigh ocofaigh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we have some code duplication like get_http_response_str and get_bearer_token. Perhaps place those functions into a utils.py script and import into the 2 python scripts you have so that we remove the code duplication?

@vburckhardt
Copy link
Member

Hi @toddgiguere - you're correct on the dependencies, but those pre-reqs already exists for other modules, so I am not that worried about this. Schematics has got all of those installed as well. That said, my main worry here here is the complexity of the code, I think it's going to be a serious obstacle for maintenance going forward. It is also potentially a greater source of bugs.

def get_bearer_token(
refresh_token: str, cloud_base_domain: str = "cloud.ibm.com"
) -> str:
url = f"https://iam.{cloud_base_domain}/identity/token"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to break for customers going through the CSE endpoints or IAM VPE. The domain is private.iam.cloud.ibm.com in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the private endpoints for IAM and endpoints for IaaS resources (your other comment) related? If we have a single boolean that can be supplied to "use_private_endpoints" cover both cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest version now has support for private endpoints, enabled by setting new python parameter to "true" (default of "false"). Intent is for that input variable to come from the terraform call.

def get_security_group_rules(
sg_id: str, region: str, access_token: str, cloud_base_domain: str = "cloud.ibm.com"
) -> dict:
url = f"https://{region}.iaas.{cloud_base_domain}/v1/security_groups/{sg_id}/rules?version=2023-04-11&generation=2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to break for VPC VPE. Eg: the domain is eu-de.private.iaas.cloud.ibm.com for the VPC VPE in Frankfurt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest version now has support for private endpoints, enabled by setting new python parameter to "true" (default of "false"). Intent is for that input variable to come from the terraform call.

@toddgiguere
Copy link
Contributor Author

@vburckhardt Would you like me to convert this to BASH/CLI/JQ then? If so I will need to add the ibmcloud_api_key as a new input to the module. Technically the api key is not always required, only if you need to clean the scripts, so if it is not required var I'll need to verify using validation, and if we do make it required it will be a breaking change.

Also, will the CLI method take care of the private/public endpoints you commented about?

@ocofaigh
Copy link
Member

@toddgiguere CLI can be set up to use private. See https://cloud.ibm.com/docs/cli?topic=cli-service-connection

@toddgiguere
Copy link
Contributor Author

Latest changes:

  • consolidated common python functions into one module, referenced from the other two scripts, simplifying the code
  • added python input to support using private endpoints for API calls
  • added terraform input variable to support private endpoint usage, called ibmcloud_api_visibility, which expects the same values as the IBM provider visibility value. Even though python is looking for a simple true/false, this will make it easier if we switch to CLI by using these values.

@toddgiguere
Copy link
Contributor Author

I have added two shell scripts that implement the same removal of rules as the python scripts, written in BASH/CLI/JQ, for comparison.

THESE ARE NOT CALLED FROM TERRAFORM YET! I am including them for a comparison, and we can then pick the version we would like to go with.

I tested both scripts manually, including in bash 3.2 (mac native), and both scripts work, replacing the python in terraform will be trivial.

ONE FINAL NOTE: I have the CLI login configured to do optional "private" endpoint, but I am unclear how this works with the ibmcloud is operations? Looking at the docs and the CLI help it is unclear how I should set those operations to use private endpoints, I'll need some guidance there.

variables.tf Outdated
}

variable "ibmcloud_api_visibility" {
description = "IBM Cloud API visibility used for internal scripts. Must be 'public', 'private', or 'public-and-private'"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "internal" might be confusing here (since this is a public module). Perhaps say "used for scripts run by this module" ?

@vburckhardt
Copy link
Member

vburckhardt commented Apr 25, 2023

+1 on bash scripts. Thanks for putting this together. Much of the python logic is low level and essentially trying to re-implement functions (eg: token exchange) that already exists in cli, so worried about carrying that baggage going forwards. This will help from an evolutionary perspective in term of maintaining same level of support in future - eg: things like authenticating with trusted profile, etc

I think at some stage we should consolidate the common scripts across module, but this is outside the scope of this PR - eg: the login logic exists here https://github.com/terraform-ibm-modules/terraform-ibm-base-ocp-vpc/blob/main/scripts/reset_iks_api_key.sh#L27 , and internally as well (csutils module) https://github.ibm.com/GoldenEye/csutils-module/blob/master/scripts/ibm_cloud_login.sh . There is an existing issue in our internal board.

@toddgiguere
Copy link
Contributor Author

+1 on bash scripts.

I'll need to add ibmcloud_api_key as an input variable to this module for bash scripts. In this case, it could be "optional" and I can do one of our regex verifications that it needs to be included if the "clear" booleans are true, or we can make it required and have a potential breaking change. Thoughts?

Also I'm still not sure how to use private endpoints through CLI for the is operations.

@vburckhardt
Copy link
Member

vburckhardt commented Apr 25, 2023

I'd go with optional api key for now as input, with a check in the module that error out if clean_default_acl / sg is set.

On private endpoint - this may help:
https://cloud.ibm.com/docs/vpc?topic=vpc-end-to-end-private-connectivity-vpe&interface=cli#vpe-api-private-endpoint-setup-cli

@toddgiguere
Copy link
Contributor Author

Right now I'm checking for API Key != null inside of scripts. If the "clean" booleans are true, but API key variable is null in terraform, the following terraform error will appear. We can adjust the wording of the error.

│ Error: local-exec provisioner error
│
│   with module.workload_vpc.module.workload_vpc.null_resource.clean_default_acl[0],
│   on ../../main.tf line 162, in resource "null_resource" "clean_default_acl":
│  162:   provisioner "local-exec" {
│
│ Error running command '../../scripts/fix_access_control_list.sh -k 'IBMCLOUD_API_KEY' -a 'r014-ca08c4b0-f4f6-4f96-81cf-5ebcfc9dd0f2' -r 'us-east'
│ -v 'public'': exit status 1. Output: getting input...
│ ERROR: ibmcloud_api_key is not set! Please provide a valid key in terraform input variable for this feature.

converted the python scripts used for default acl and sg rule clean into bash/ibmcloud-cli/jq style of scripting.

This also means I needed to add optional ibmcloud_api_key inputs in all modules.
@toddgiguere
Copy link
Contributor Author

Latest version I've removed the python scripts and replaced with shell scripts written in BASH, utilizing IBM Cloud CLI and JQ to do the rule clearing.

I've split the IBM Cloud login to its own file, if we like how it functions we might be able to use it as a common script later. I tried to design it so it would be flexible, complete with optional parameters like region and resource group.

The bash scripts should properly catch all errors and fail the terraform block, with no "silent fails".

I've also added a short section in the main README to alert users that they will need the CLI installed to run this terraform.

to see if this fixes random "no region targeted" errors in further scripts
@toddgiguere
Copy link
Contributor Author

Getting a strange error in the automated tests that I do not see when testing locally or in schematics, have not figured it out yet.

In the security group clean, only for the management VPC, the CLI is returning "No region targeted". Is this something to do with the way the login to CLI works? Again, is not happening in either a local run or in a schematic run.

│ Error: local-exec provisioner error
│ 
│   with module.management_vpc.module.management_vpc.null_resource.clean_default_security_group[0],
│   on ../../main.tf line 149, in resource "null_resource" "clean_default_security_group":
│  149:   provisioner "local-exec" {
│ 
│ Error running command '../../scripts/fix_security_group.sh
│ 'r038-5e4fa538-1208-4f34-8d96-bb6dbeedd9ca'': exit status 1. Output: No
│ region targeted.

@toddgiguere
Copy link
Contributor Author

toddgiguere commented Apr 27, 2023

I think I've solved the IBMCloud CLI issues.

The problem:
All IBMCloud CLI commands use a configuration file/directory that is stored on the host machine. Whenever ibmcloud login is invoked with any parameters, all commands run after that will reference that one central config file on the machine for login and configuration info. This was causing problems when multiple null_resource blocks were calling the ibmcloud login command at the same time, as they all fight for the central config file.

My solution:
SUMMARY: each null_resource block will have its own unique CLI config (not shared)
DETAIL:
There is an option for CLI to use an alternate home directory for configuration by setting the IBMCLOUD_HOME environment variable. I've altered the scripts to use this feature by doing the following:

  1. created two new functions, for "login" and "create temp config home", and put them in central shell file
  2. the "create temp home" function will do the following:
    a. use the mktemp -d command to create a new random named directory in the OS temp directory
    b. copy any plugins installed in the main home into the new temp location
    c. set this new location as the IBMCLOUD_HOME for that particular null_resource invocation shell
  3. both CLI scripts now call the "create temp" and "login" functions before executing their commands, and each invocation will be using their own unique config location

SPECIAL NOTE: this also solves another issue that will happen when we include this new VPC module version into SLZ parent. We will now be able to set the "resource_group" in the ibmcloud login process, since the configuration is not shared anymore. When it was a shared config, we would not have been able to target a resource group because SLZ would have been calling the VPC module three times using different resource groups, and a shared config file would have had conflicts.

variables.tf Show resolved Hide resolved
variables.tf Outdated
}

variable "ibmcloud_api_key" {
description = "IBM Cloud API Key that will be used for authentication in scripts run in this module."
Copy link
Member

@ocofaigh ocofaigh Apr 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps say only required if setting clean_default_security_group or clean_default_acl to true ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a more verbose description for api key variable.


# Create a temporary home for CLI config, used by both login and further commands.
# This ensures config separation between various terraform provision blocks on the same machine.
ibmcloud_config_home=$(mktemp -d)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably be cleaning up the temp directory after? But I guess it would have to be done after the scripts have completed, which is outside the scope of this function. Perhaps there is a cli command to get current home, and delete it (if its not the default home)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I forgot the cleanup. It should be easy since I have it in an ENV already I can just remove the whole directory at the end of the script. However, if the script errors, it won't remove. I could look into doing a bash-style error handling block to remove it there, if that is important.

One thing I will do though is ignore an error in the removal, I don't want that to trigger error in terraform.

Another note is that mktemp will always create the folder in the underlying OS official $TMPDIR location, so theoretically it will get cleaned up eventually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest version I've added a removal of the temp directory that will occur with normal script termination, and also if an error occurs anywhere (via bash trap ERR mechanic). I've tested error scenarios at every level and it seems to be getting caught and removed.

Copy link
Member

@ocofaigh ocofaigh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, left some final comments

@ocofaigh
Copy link
Member

ocofaigh commented May 2, 2023

LGTM - @toddgiguere let me know once happy to merge

@toddgiguere
Copy link
Contributor Author

@ocofaigh I have tested this branch with landing zone patterns and everything is working as expected, I think this is good to release and I can then finish on landing zone side.

@ocofaigh ocofaigh merged commit 0a575f4 into main May 5, 2023
@ocofaigh ocofaigh deleted the fix-default-acl-sg branch May 5, 2023 08:15
@terraform-ibm-modules-ops
Copy link
Contributor

🎉 This PR is included in version 7.1.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants