diff --git a/examples/README.md b/examples/README.md index da15a85ce2..9fd52977c8 100644 --- a/examples/README.md +++ b/examples/README.md @@ -4,7 +4,7 @@ This section contains **[foundational examples](./foundations/)** that bootstrap Currently available examples: -- **cloud operations** - [Resource tracking and remediation via Cloud Asset feeds](./cloud-operations/asset-inventory-feed-remediation), [Granular Cloud DNS IAM via Service Directory](./cloud-operations/dns-fine-grained-iam), [Granular Cloud DNS IAM for Shared VPC](./cloud-operations/dns-shared-vpc), [Compute Engine quota monitoring](./cloud-operations/quota-monitoring), [Scheduled Cloud Asset Inventory Export to Bigquery](./cloud-operations/scheduled-asset-inventory-export-bq), [Packer image builder](./cloud-operations/packer-image-builder), [On-prem SA key management](./cloud-operations/onprem-sa-key-management), [TCP healthcheck for unmanaged GCE instances](./cloud-operations/unmanaged-instances-healthcheck) +- **cloud operations** - [Resource tracking and remediation via Cloud Asset feeds](./cloud-operations/asset-inventory-feed-remediation), [Granular Cloud DNS IAM via Service Directory](./cloud-operations/dns-fine-grained-iam), [Granular Cloud DNS IAM for Shared VPC](./cloud-operations/dns-shared-vpc), [Compute Engine quota monitoring](./cloud-operations/quota-monitoring), [Scheduled Cloud Asset Inventory Export to Bigquery](./cloud-operations/scheduled-asset-inventory-export-bq), [Packer image builder](./cloud-operations/packer-image-builder), [On-prem SA key management](./cloud-operations/onprem-sa-key-management), [TCP healthcheck for unmanaged GCE instances](./cloud-operations/unmanaged-instances-healthcheck), [HTTP Load Balancer with Cloud Armor](./cloud-operations/glb_and_armor) - **data solutions** - [GCE/GCS CMEK via centralized Cloud KMS](./data-solutions/gcs-to-bq-with-least-privileges/), [Cloud Storage to Bigquery with Cloud Dataflow with least privileges](./data-solutions/gcs-to-bq-with-least-privileges/), [Data Platform Foundations](./data-solutions/data-platform-foundations/), [SQL Server AlwaysOn availability groups example](./data-solutions/sqlserver-alwayson), [Cloud SQL instance with multi-region read replicas](./data-solutions/cloudsql-multiregion/) - **factories** - [The why and the how of resource factories](./factories/README.md) - **foundations** - [single level hierarchy](./foundations/environments/) (environments), [multiple level hierarchy](./foundations/business-units/) (business units + environments) diff --git a/examples/cloud-operations/glb_and_armor/README.md b/examples/cloud-operations/glb_and_armor/README.md index a8b73af8c7..f106ae558d 100644 --- a/examples/cloud-operations/glb_and_armor/README.md +++ b/examples/cloud-operations/glb_and_armor/README.md @@ -1,48 +1,124 @@ # HTTP Load Balancer with Cloud Armor -Google Cloud HTTP(S) load balancing is implemented at the edge of Google's network in Google's points of presence (POP) around the world. User traffic directed to an HTTP(S) load balancer enters the POP closest to the user and is then load balanced over Google's global network to the closest backend that has sufficient capacity available. +## Introduction -Cloud Armor IP allowlist/denylist enable you to restrict or allow access to your HTTP(S) load balancer at the edge of the Google Cloud, as close as possible to the user and to malicious traffic. This prevents malicious users or traffic from consuming resources or entering your virtual private cloud (VPC) networks. +This repository contains all necessary Terraform modules to build a multi-regional infrastructure with horizontally scalable managed instance group backends, HTTP load balancing and Google’s advanced WAF security tool (Cloud Armor) on top to securely deploy an application at global scale. -In this lab, you configure an HTTP Load Balancer with global backends, as shown in the diagram below. Then, you stress test the Load Balancer and denylist the stress test IP with Cloud Armor. +This tutorial is general enough to fit in a variety of use-cases, from hosting a mobile app's backend to deploy proprietary workloads at scale. -![Architecture](architecture.png) +## Use cases -## Running the example +Even though there are many ways to implement an architecture, some workloads require high compute power or specific licenses while making sure the services are secured by a managed service and highly available across multiple regions. An architecture consisting of Managed Instance Groups in multiple regions available through an HTTP Load Balancer with Cloud Armor enabled is suitable for such use-cases. -Clone this repository or [open it in cloud shell](https://ssh.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https%3A%2F%2Fgithub.com%2Fterraform-google-modules%2Fcloud-foundation-fabric&cloudshell_print=cloud-shell-readme.txt&cloudshell_working_dir=examples%2Fcloud-operations%2Fglb-and-armor), then go through the following steps to create resources: +This architecture caters to multiple workloads ranging from the ones requiring compliance with specific data access restrictions to compute-specific proprietary applications with specific licensing and OS requirements. Descriptions of some possible use-cases are as follows: -* `terraform init` -* `terraform apply -var project_id=my-project-id` +* __Proprietary OS workloads__: Some applications require specific Operating systems (enterprise grade Linux distributions for example) with specific licensing requirements or low-level access to the kernel. In such cases, since the applications cannot be containerised and horizontal scaling is required, multi-region Managed Instance Group (MIG) with custom instance images are the ideal implementation. +* __Industry-specific applications__: Other applications may require high compute power alongside a sophisticated layer of networking security. This architecture satisfies both these requirements by promising configurable compute power on the instances backed by various features offered by Cloud Armor such as traffic restriction, DDoS protection etc. +* __Workloads requiring GDPR compliance__: Most applications require restricting data access and usage from outside a certain region (mostly to comply with data residency requirements). This architecture caters to such workloads as Cloud Armor allows you to lock access to your workloads from various fine-grained identifiers. +* __Medical Queuing systems__: Another great example usage for this architecture will be applications requiring high compute power, availability and limited memory access requirements such as a medical queuing system. +* __DDoS Protection and WAF__: Applications and workloads exposed to the internet expose themselves to the risk of DDoS attacks. While L3/L4 and protocol based attacks are handled at Google’s edge, L7 attacks can still be effective with botnets. A setup of an external Cloud Load Balancer with Cloud Armor and appropriate WAF rules can mitigate such attacks. +* __Geofencing__: If you want to restrict content served on your application due to licensing restrictions (similar to OTT content in the US), Geofencing allows you to create a virtual perimeter to stop the service from being accessed outside the region. The architecture of using a Cloud Load Balancer with Cloud Armor enables you to implement geofencing around your applications and services. -The following outputs will be available once everything is deployed: +## Architecture -* `glb_ip_address`, containing the IPv4 address of the HTTP Load Balancer -* `vm_siege_external_ip`, containing the external IPv4 address of the siege VM. +
-Once done testing, you can clean up resources by running `terraform destroy`. +The main components that we would be setting up are (to learn more about these products, click on the hyperlinks): -## Testing the example +* [Cloud Armor](https://cloud.google.com/armor) - Google Cloud Armor is the web-application firewall (WAF) and DDoS mitigation service that helps users defend their web apps and services at Google scale at the edge of Google’s network. +* [Cloud Load Balancer](https://cloud.google.com/load-balancing) - When your app usage spikes, it is important to scale, optimize and secure the app. Cloud Load Balancing is a fully distributed solution that balances user traffic to multiple backends to avoid congestion, reduce latency and increase security. Some important features it offers that we use here are: + * Single global anycast IP and autoscaling - CLB acts as a frontend to all your backend instances across all regions. It provides cross-region load balancing, automatic multi-region failover and scales to support increase in resources. + * Global Forwarding Rule - To route traffic to different regions, global load balancers use global forwarding rules, which bind the global IP address and a single target proxy. + * Target Proxy - For external HTTP(S) load balancers, proxies route incoming requests to a URL map. This is essentially how you can handle the connections. + * URL Map - URL Maps are used to route requests to a backend service based on the rules that you define for the host and path of an incoming URL. + * Backend Service - A Backend Service defines CLB distributes traffic. The backend service configuration consists of a set of values - protocols to connect to backends, session settings, health checks and timeouts. + * Health Check - Health check is a method provided to determine if the corresponding backends respond to traffic. Health checks connect to backends on a configurable, periodic basis. Each connection attempt is called a probe. Google Cloud records the success or failure of each probe. +* [Firewall Rules](https://cloud.google.com/vpc/docs/firewalls) - Firewall rules let you allow or deny connections to or from your VM instances based on a configuration you specify. +* [Managed Instance Groups (MIG)](https://cloud.google.com/compute/docs/instance-groups) - Instance group is a collection of VM instances that you can manage as a single entity. MIGs allow you to operate apps and workloads on multiple identical VMs. You can also leverage the various features like autoscaling, autohealing, regional / multi-zone deployments. -1. Connect to the siege VM and run the following command +## Costs - siege -c 250 -t150s http://$LB_IP`ß +Pricing Estimates - We have created a sample estimate based on some usage we see from new startups looking to scale. This estimate would give you an idea of how much this deployment would essentially cost per month at this scale and you extend it to the scale you further prefer. Here's the [link](https://cloud.google.com/products/calculator/#id=3105bbf2-4ee0-4289-978e-9ab6855d37ed). -2. In the Cloud Console, on the Navigation menu, click Network Services > Load balancing. -3. Click Backends. -4. Click http-backend. -5. Navigate to http-lb. -6. Click on the Monitoring tab. -7. Monitor the Frontend Location (Total inbound traffic) between North America and the two backends for 2 to 3 minutes. At first, traffic should just be directed to us-east1-mig but as the RPS increases, traffic is also directed to europe-west1-mig. This demonstrates that by default traffic is forwarded to the closest backend but if the load is very high, traffic can be distributed across the backends. -8. Re-run terraform as follows: +## Setup + +This solution assumes you already have a project created and set up where you wish to host these resources. If not, and you would like for the project to create a new project as well, please refer to the [github repository](https://github.com/GoogleCloudPlatform/cloud-foundation-fabric/tree/master/examples/data-solutions/gcs-to-bq-with-least-privileges) for instructions. + +### Prerequisites + +* Have an [organization](https://cloud.google.com/resource-manager/docs/creating-managing-organization) set up in Google cloud. +* Have a [billing account](https://cloud.google.com/billing/docs/how-to/manage-billing-account) set up. +* Have an existing [project](https://cloud.google.com/resource-manager/docs/creating-managing-projects) with [billing enabled](https://cloud.google.com/billing/docs/how-to/modify-project). + +### Roles & Permissions + +In order to spin up this architecture, you will need to be a user with the “__Project owner__” [IAM](https://cloud.google.com/iam) role on the existing project: + +Note: To grant a user a role, take a look at the [Granting and Revoking Access](https://cloud.google.com/iam/docs/granting-changing-revoking-access#grant-single-role) documentation. + +### Spinning up the architecture + +#### Step 1: Cloning the repository + +Click on the button below, sign in if required and when the prompt appears, click on “confirm”. + +[
](https://goo.gle/GoCloudArmor) + +This will clone the repository to your cloud shell and a screen like this one will appear: + +![cloud_shell](cloud_shell.png) + +Before we deploy the architecture, you will need the following information: + +* The __project ID__. + +#### Step 2: Deploying the resources + +1. After cloning the repo, and going through the prerequisites, head back to the cloud shell editor. +2. Make sure you’re in the following directory. if not, you can change your directory to it via the ‘cd’ command: + + cloudshell_open/cloud-foundation-fabric/examples/cloud-operations/glb_and_armor + +3. Run the following command to initialize the terraform working directory: + + terraform init + +4. Copy the following command into a console and replace __[my-project-id]__ with your project’s ID. Then run the following command to run the terraform script and create all relevant resources for this architecture: + + terraform apply -var project_id=[my-project-id] + +The resource creation will take a few minutes… but when it’s complete, you should see an output stating the command completed successfully with a list of the created resources. + +__Congratulations__! You have successfully deployed an HTTP Load Balancer with two Managed Instance Group backends and Cloud Armor security. + +## Testing your architecture + +1. Connect to the siege VM using SSH (from Cloud Console or CLI) and run the following command: + + siege -c 250 -t150s http://$LB_IP + +2. In the Cloud Console, on the Navigation menu, click __Network Services > Load balancing__. +3. Click __Backends__, then click __http-backend__ and navigate to __http-lb__ +4. Click on the __Monitoring__ tab. +5. Monitor the Frontend Location (Total inbound traffic) between North America and the two backends for 2 to 3 minutes. At first, traffic should just be directed to __us-east1-mig__ but as the RPS increases, traffic is also directed to __europe-west1-mig__. This demonstrates that by default traffic is forwarded to the closest backend but if the load is very high, traffic can be distributed across the backends. +6. Now, to test the IP deny-listing, rerun terraform as follows: terraform apply -var project_id=my-project-id -var enforce_security_policy=true - Like this we have applied a security policy to denylist the IP address of the siege VM +This, applies a security policy to denylist the IP address of the siege VM -9. From the siege VM run the following command and verify that you get a 403 Forbidden error code back. +7. To test this, from the siege VM run the following command and verify that you get a __403 Forbidden__ error code back. curl http://$LB_IP + +## Cleaning up your environment + +The easiest way to remove all the deployed resources is to run the following command in Cloud Shell: + + terraform destroy + +The above command will delete the associated resources so there will be no billable charges made afterwards. + ## Variables diff --git a/examples/cloud-operations/glb_and_armor/architecture.png b/examples/cloud-operations/glb_and_armor/architecture.png index 4a2b5b376d..64b1e186d5 100644 Binary files a/examples/cloud-operations/glb_and_armor/architecture.png and b/examples/cloud-operations/glb_and_armor/architecture.png differ diff --git a/examples/cloud-operations/glb_and_armor/cloud_shell.png b/examples/cloud-operations/glb_and_armor/cloud_shell.png new file mode 100644 index 0000000000..21bb72e018 Binary files /dev/null and b/examples/cloud-operations/glb_and_armor/cloud_shell.png differ diff --git a/examples/cloud-operations/glb_and_armor/shell_button.png b/examples/cloud-operations/glb_and_armor/shell_button.png new file mode 100644 index 0000000000..21a3f3de9d Binary files /dev/null and b/examples/cloud-operations/glb_and_armor/shell_button.png differ diff --git a/examples/data-solutions/cloudsql-multiregion/images/button.png b/examples/data-solutions/cloudsql-multiregion/images/button.png index 5ad257d225..21a3f3de9d 100644 Binary files a/examples/data-solutions/cloudsql-multiregion/images/button.png and b/examples/data-solutions/cloudsql-multiregion/images/button.png differ diff --git a/examples/data-solutions/gcs-to-bq-with-least-privileges/README.md b/examples/data-solutions/gcs-to-bq-with-least-privileges/README.md index 1043b1bf0f..11a1e3141c 100644 --- a/examples/data-solutions/gcs-to-bq-with-least-privileges/README.md +++ b/examples/data-solutions/gcs-to-bq-with-least-privileges/README.md @@ -1,140 +1,192 @@ -# Cloud Storage to Bigquery with Cloud Dataflow with least privileges +# Spinning up a foundation data pipeline on Google Cloud using Cloud Storage, Dataflow and BigQuery -This example creates the infrastructure needed to run a [Cloud Dataflow](https://cloud.google.com/dataflow) pipeline to import data from [GCS](https://cloud.google.com/storage) to [Bigquery](https://cloud.google.com/bigquery). The example will create different service accounts with least privileges on resources. To run the pipeline, users listed in `data_eng_principals` can impersonate all those service accounts. +## Introduction -The solution will use: -- internal IPs for GCE and Cloud Dataflow instances -- Cloud NAT to let resources egress to the Internet, to run system updates and install packages -- rely on [Service Account Impersonation](https://cloud.google.com/iam/docs/impersonating-service-accounts) to avoid the use of service account keys -- Service Accounts with least privilege on each resource -- (Optional) CMEK encription for GCS bucket, DataFlow instances and BigQuery tables +This repository contains the necessary Terraform modules to securely deploy a basic ETL pipeline that will dump data from a Google Cloud Storage (GCS) bucket to tables in BigQuery. -The example is designed to match real-world use cases with a minimum amount of resources and some compromises listed below. It can be used as a starting point for more complex scenarios. +An ETL pipeline is defined in three steps: -This is the high level diagram: +* Extraction: retrieving data from sources. +* Transformation: cleaning the data, putting it into a common format, calculating other fields, taking out duplicates or erroneous records so it can be stored into a target. +* Loading: inserting the formatted data into the target database, data store, data warehouse or data lake. + +You can learn more about cloud-based ETL [here](https://cloud.google.com/learn/what-is-etl). + +## Use cases + +Whether you’re transferring from another Cloud Service Provider or you’re taking your first steps into the cloud with Google Cloud, building a data pipeline sets a good foundation to begin deriving insights for your business. + +* __Anomaly Detection__: building data pipelines to identify cyber security threats or fraudulent transactions using machine learning (ML) models. +* __Interactive Data Analysis__: carry out interactive data analysis with BigQuery BI Engine that enables you to analyze large and complex datasets interactively with sub-second query response time and high concurrency. +* __Predictive Forecasting__: building solid pipelines to capture real-time data for ML modeling and using it as a forecasting engine for situations ranging from weather predictions to market forecasting. +* __Create Machine Learning models__: using BigQueryML you can create and execute machine learning models in BigQuery using standard SQL queries. Create a variety of models pre-built into BigQuery that you train with your data. + +## Architecture ![GCS to Biquery High-level diagram](diagram.png "GCS to Biquery High-level diagram") -## Move to real use case consideration -In the example we implemented some compromise to keep the example minimal and easy to read. On a real word use case, you may evaluate the option to: - - Configure a Shared-VPC - - Use only Identity Groups to assigne roles - - Use Authorative IAM role assignement - - Split resources in different project: Data Landing, Data Transformation, Data Lake, ... - - Use VPC-SC to mitigate data exfiltration +The main components that we would be setting up are (to learn more about these products, click on the hyperlinks): + +* [Cloud Storage (GCS) bucket](https://cloud.google.com/storage/): data lake solution to store extracted raw data that must undergo some kind of transformation. +* [Cloud Dataflow pipeline](https://cloud.google.com/dataflow): to build fully managed batch and streaming pipelines to transform data stored in GCS buckets ready for processing in the Data Warehouse using Apache Beam. +* [BigQuery datasets and tables](https://cloud.google.com/bigquery): to store the transformed data in and query it using SQL, use it to make reports or begin training [machine learning](https://cloud.google.com/bigquery-ml/docs/introduction) models without having to take your data out. +* [Service accounts](https://cloud.google.com/iam/docs/service-accounts) (__created with least privilege on each resource__): one for uploading data into the GCS bucket, one for Orchestration, one for Dataflow instances and one for the BigQuery tables. You can also configure users or groups of users to assign them a viewer role on the created resources and the ability to impersonate service accounts to test the Dataflow pipelines before automating them with a tool like [Cloud Composer](https://cloud.google.com/composer). + +For a full list of the resources that will be created, please refer to the [github repository](https://github.com/GoogleCloudPlatform/cloud-foundation-fabric/tree/master/examples/data-solutions/gcs-to-bq-with-least-privileges) for this project. If you're migrating from another Cloud Provider, refer to [this](https://cloud.google.com/free/docs/aws-azure-gcp-service-comparison) documentation to see equivalent services and comparisons in Microsoft Azure and Amazon Web Services + +## Costs + +Pricing Estimates - We have created a sample estimate based on some usage we see from new startups looking to scale. This estimate would give you an idea of how much this deployment would essentially cost per month at this scale and you extend it to the scale you further prefer. Here's the [link](https://cloud.google.com/products/calculator#id=44710202-c9d4-49d5-a378-99d7dd34f5e2). + +## Setup + +This solution assumes you already have a project created and set up where you wish to host these resources. If not, and you would like for the project to create a new project as well, please refer to the [github repository](https://github.com/GoogleCloudPlatform/cloud-foundation-fabric/tree/master/examples/data-solutions/gcs-to-bq-with-least-privileges) for instructions. + +### Prerequisites + +* Have an [organization](https://cloud.google.com/resource-manager/docs/creating-managing-organization) set up in Google cloud. +* Have a [billing account](https://cloud.google.com/billing/docs/how-to/manage-billing-account) set up. +* Have an existing [project](https://cloud.google.com/resource-manager/docs/creating-managing-projects) with [billing enabled](https://cloud.google.com/billing/docs/how-to/modify-project), we’ll call this the __service project__. + +### Roles & Permissions + +In order to spin up this architecture, you will need to be a user with the “__Project owner__” [IAM](https://cloud.google.com/iam) role on the existing project: + +__Note__: To grant a user a role, take a look at the [Granting and Revoking Access](https://cloud.google.com/iam/docs/granting-changing-revoking-access#grant-single-role) documentation. + +### Spinning up the architecture + +#### Step 1: Cloning the repository + +Click on the button below, sign in if required and when the prompt appears, click on “confirm”. + +[
](https://goo.gle/GoDataPipe) + +This will clone the repository to your cloud shell and a screen like this one will appear: + +![cloud_shell](cloud_shell.png) + +Before you deploy the architecture, make sure you run the following command to move your cloudshell session into your service project: -## Managed resources and services + gcloud config set project [SERVICE_PROJECT_ID] -This sample creates several distinct groups of resources: +Once you can see your service project id in the yellow parenthesis, you’re ready to start. -- projects - - Service Project configured for GCS buckets, Dataflow instances and BigQuery tables and orchestration -- networking - - VPC network - - One subnet - - Firewall rules for [SSH access via IAP](https://cloud.google.com/iap/docs/using-tcp-forwarding) and open communication within the VPC -- IAM - - One service account for uploading data into the GCS landing bucket - - One service account for Orchestration - - One service account for Dataflow instances - - One service account for Bigquery tables -- GCS - - One bucket -- BQ - - One dataset - - One table. Tables are defined in Terraform for the porpuse of the example. Probably, in real scenario, would handle Tables creation in a separate Terraform State or using a different tool/pipeline (for example: Dataform). +Before we deploy the architecture, you will need the following information: -In this example you can also configure users or group of user to assign them viewer role on the resources created and the ability to imprsonate service accounts to test dataflow pipelines before autometing them with Composer or any other orchestration systems. +* The __service project ID__. +* A __unique prefix__ that you want all the deployed resources to have (for example: awesomestartup). This must be a string with no spaces or tabs. +* A __list of Groups or Users__ with Service Account Token creator role on Service Accounts in IAM format, eg 'group:group@domain.com'. -## Deploy your enviroment +#### Step 2: Deploying the resources -We assume the identiy running the following steps has the following role: - - `resourcemanager.projectCreator` in case a new project will be created. - - `owner` on the project in case you use an existing project. +1. Once you have the required information, head back to the cloud shell editor. Make sure you’re in the following directory: -Run Terraform init: + cloudshell_open/cloud-foundation-fabric/examples/data-solutions/gcs-to-bq-with-least-privileges -``` -$ terraform init -``` +2. In the editor, edit the terraform.tfvars.sample file with the variables you gathered in the step above. -Configure the Terraform variable in your `terraform.tfvars` file. You need to spefify at least the following variables: +![editor](editor.png) -``` -data_eng_principals = ["user:data-eng@domain.com"] -project_id = "datalake-001" -prefix = "prefix" -``` +* a. Fill in __data_eng_principals__ with the list of Users or Groups to impersonate service accounts. -You can run now: +* b. Fill in __project_id__ with the service project ID. -``` -$ terraform apply -``` +* c. Fill in the prefix with your chosen unique prefix for resources. -You should see the output of the Terraform script with resources created and some command pre-created for you to run the example following steps below. +* d. Save the file with __Ctrl(or ⌘)+S__ or by going to __File → Save__. -### Virtual Private Cloud (VPC) design +3. Then, run the following commands: -As is often the case in real-world configurations, this example accepts as input an existing [Shared-VPC](https://cloud.google.com/vpc/docs/shared-vpc) via the `network_config` variable. + terraform init + + terraform apply -var-file=terraform.tfvars.sample -auto-approve -If the `network_config` variable is not provided, one VPC will be created in each project that supports network resources (load, transformation and orchestration). +The resource creation will take a few minutes, at the end this is the output you should expect for successful completion along with a list of the created resources: -When `network_config` variable is configured, the identity running the Terraform script need to have the following roles: - - `roles/compute.xpnAdmin` on the host project folder or org - - `roles/resourcemanager.projectIamAdmin` on the host project, either with no conditions or with a condition allowing [delegated role grants](https://medium.com/google-cloud/managing-gcp-service-usage-through-delegated-role-grants-a843610f2226#:~:text=Delegated%20role%20grants%20is%20a,setIamPolicy%20permission%20on%20a%20resource.) for `roles/compute.networkUser`, `roles/composer.sharedVpcAgent`, `roles/container.hostServiceAgentUser` +![output](output.png) -## Test your environment with Cloud Dataflow +__Congratulations!__ You have successfully deployed the foundation for running your first ETL pipeline on Google Cloud. -We assume all those steps are run using a user listed on `data_eng_principals`. You can authenticate as the user using the following command: +### Testing your architecture -``` -$ gcloud init -$ gcloud auth application-default login -``` +For the purpose of demonstrating how the ETL pipeline flow works, we’ve set up an example pipeline for you to run. First of all, we assume all the steps are run using a user listed on the __data_eng_principles__ variable (or a user that belongs to one of the groups you specified). Authenticate the user using the following command and make sure your active cloudshell session is set to the __service project__: + + gcloud auth application-default login + +Follow the instructions in the cloudshell to authenticate the user. + +To make the next steps easier, create two environment variables with the service project id and the prefix: + + export SERVICE_PROJECT_ID=[SERVICE_PROJECT_ID] + export PREFIX=[PREFIX] + +Again, make sure you’re in the following directory: + + cloudshell_open/cloud-foundation-fabric/examples/data-solutions/gcs-to-bq-with-least-privileges For the purpose of the example we will import from GCS to Bigquery a CSV file with the following structure: -``` -name,surname,timestam -``` + name,surname,timestamp -We need to create 3 file: - - A `person.csv` file containing your data in the form `name,surname,timestam`. Here an example line `Lorenzo,Caggioni,1637771951'. - - A `person_udf.js` containing the UDF javascript file used by the Dataflow template. - - A `person_schema.json` file containing the table schema used to import the CSV. - -You can find an example of those file in the folder `./data-demo`. You can copy the example files in the GCS bucket using the command returned in the terraform output as `command_01_gcs`. Below an example: +We need to create 3 files: -```bash -gsutil -i gcs-landing@PROJECT.iam.gserviceaccount.com cp data-demo/* gs://LANDING_BUCKET -``` +* A person.csv file containing your data in the form name,surname,timestamp. For example: `Eva,Rivarola,1637771951'. +* A person_udf.js containing the [UDF javascript file](https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions) used by the Dataflow template. +* A person_schema.json file containing the table schema used to import the CSV. -We can now run the Dataflow pipeline using the `gcloud` returned in the terraform output as `command_02_dataflow`. Below an example: +An example of those files can be found in the folder ./data-demo. Inside the same repository where you ran the terraform commands. -```bash -gcloud --impersonate-service-account=orch-test@PROJECT.iam.gserviceaccount.com dataflow jobs run test_batch_01 \ +You can copy the example files into the GCS bucket by running: + + gsutil -i gcs-landing@$SERVICE_PROJECT_ID.iam.gserviceaccount.com cp data-demo/* gs://$PREFIX-data + +Once this is done, the 3 files necessary to run the Dataflow Job will have been copied to the GCS bucket that was created along with the resources. + +Run the following command to start the dataflow job: + + gcloud --impersonate-service-account=orchestrator@$SERVICE_PROJECT_ID.iam.gserviceaccount.com dataflow jobs run test_batch_01 \ --gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \ - --project PROJECT \ - --region REGION \ + --project $SERVICE_PROJECT_ID \ + --region europe-west1 \ --disable-public-ips \ - --subnetwork https://www.googleapis.com/compute/v1/projects/PROJECT/regions/REGION/subnetworks/subnet \ - --staging-location gs://PREFIX-df-tmp \ - --service-account-email df-loading@PROJECT.iam.gserviceaccount.com \ + --subnetwork https://www.googleapis.com/compute/v1/projects/$SERVICE_PROJECT_ID/regions/europe-west1/subnetworks/subnet \ + --staging-location gs://$PREFIX-df-tmp\ + --service-account-email df-loading@$SERVICE_PROJECT_ID.iam.gserviceaccount.com \ --parameters \ -javascriptTextTransformFunctionName=transform,\ -JSONPath=gs://PREFIX-data/person_schema.json,\ -javascriptTextTransformGcsPath=gs://PREFIX-data/person_udf.js,\ -inputFilePattern=gs://PREFIX-data/person.csv,\ -outputTable=PROJECT:datalake.person,\ -bigQueryLoadingTemporaryDirectory=gs://PREFIX-df-tmp -``` - -You can check data imported into Google BigQuery using the command returned in the terraform output as `command_03_bq`. Below an example: - -``` -bq query --use_legacy_sql=false 'SELECT * FROM `PROJECT.datalake.person` LIMIT 1000' -``` + javascriptTextTransformFunctionName=transform,\ + JSONPath=gs://$PREFIX-data/person_schema.json,\ + javascriptTextTransformGcsPath=gs://$PREFIX-data/person_udf.js,\ + inputFilePattern=gs://$PREFIX-data/person.csv,\ + outputTable=$SERVICE_PROJECT_ID:datalake.person,\ + bigQueryLoadingTemporaryDirectory=gs://$PREFIX-df-tmp + +This command will start a dataflow job called test_batch_01 that uses a Dataflow transformation script stored in the public GCS bucket: + + gs://dataflow-templates/latest/GCS_Text_to_BigQuery. + +The expected output is the following: + +![second_output](second_output.png) + +Then, if you navigate to Dataflow on the console, you will see the following: + +![dataflow_console](dataflow_console.png) + +This shows the job you started from the cloudshell is currently running in Dataflow. +If you click on the job name, you can see the job graph created and how every step of the Dataflow pipeline is moving along: + +![dataflow_execution](dataflow_execution.png) + +Once the job completes, you can navigate to BigQuery in the console and under __SERVICE_PROJECT_ID__ → datalake → person, you can see the data that was successfully imported into BigQuery through the Dataflow job. + +## Cleaning up your environment + +The easiest way to remove all the deployed resources is to run the following command in Cloud Shell: + + terraform destroy -var-file=terraform.tfvars.sample -auto-approve + +The above command will delete the associated resources so there will be no billable charges made afterwards. ## Variables diff --git a/examples/data-solutions/gcs-to-bq-with-least-privileges/cloud_shell.png b/examples/data-solutions/gcs-to-bq-with-least-privileges/cloud_shell.png new file mode 100644 index 0000000000..21bb72e018 Binary files /dev/null and b/examples/data-solutions/gcs-to-bq-with-least-privileges/cloud_shell.png differ diff --git a/examples/data-solutions/gcs-to-bq-with-least-privileges/dataflow_console.png b/examples/data-solutions/gcs-to-bq-with-least-privileges/dataflow_console.png new file mode 100644 index 0000000000..526aa05785 Binary files /dev/null and b/examples/data-solutions/gcs-to-bq-with-least-privileges/dataflow_console.png differ diff --git a/examples/data-solutions/gcs-to-bq-with-least-privileges/dataflow_execution.png b/examples/data-solutions/gcs-to-bq-with-least-privileges/dataflow_execution.png new file mode 100644 index 0000000000..690e498e0a Binary files /dev/null and b/examples/data-solutions/gcs-to-bq-with-least-privileges/dataflow_execution.png differ diff --git a/examples/data-solutions/gcs-to-bq-with-least-privileges/editor.png b/examples/data-solutions/gcs-to-bq-with-least-privileges/editor.png new file mode 100644 index 0000000000..b6626aa1d7 Binary files /dev/null and b/examples/data-solutions/gcs-to-bq-with-least-privileges/editor.png differ diff --git a/examples/data-solutions/gcs-to-bq-with-least-privileges/output.png b/examples/data-solutions/gcs-to-bq-with-least-privileges/output.png new file mode 100644 index 0000000000..3758c3145b Binary files /dev/null and b/examples/data-solutions/gcs-to-bq-with-least-privileges/output.png differ diff --git a/examples/data-solutions/gcs-to-bq-with-least-privileges/second_output.png b/examples/data-solutions/gcs-to-bq-with-least-privileges/second_output.png new file mode 100644 index 0000000000..618c341551 Binary files /dev/null and b/examples/data-solutions/gcs-to-bq-with-least-privileges/second_output.png differ diff --git a/examples/data-solutions/gcs-to-bq-with-least-privileges/shell_button.png b/examples/data-solutions/gcs-to-bq-with-least-privileges/shell_button.png new file mode 100644 index 0000000000..21a3f3de9d Binary files /dev/null and b/examples/data-solutions/gcs-to-bq-with-least-privileges/shell_button.png differ