Skip to content

Commit

Permalink
Merge pull request #1161 from aymanfarhat/enhancement/dpf-df-doc
Browse files Browse the repository at this point in the history
Additional documentation for the Data Platform Dataflow pipeline example
  • Loading branch information
lcaggio authored Feb 16, 2023
2 parents e64e8db + eac2065 commit 7c9841d
Showing 1 changed file with 62 additions and 3 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,66 @@ This demo serves as a simple example of building and launching a Flex Template D

![Dataflow pipeline overview](../../images/df_demo_pipeline.png "Dataflow pipeline overview")

## Example build run

## Local development run

For local development, the pipeline can be launched from the local machine for testing purposes using different runners depending on the scope of the test.

### Using the Beam DirectRunner
The below example uses the Beam DirectRunner. The use case for this runner is mainly for quick local tests on the development environment with low volume of data.

```
CSV_FILE=gs://[TEST-BUCKET]/customers.csv
JSON_SCHEMA=gs://[TEST-BUCKET]/customers_schema.json
OUTPUT_TABLE=[TEST-PROJ].[TEST-DATASET].customers
PIPELINE_STAGIN_PATH="gs://[TEST-STAGING-BUCKET]"
python src/csv2bq.py \
--runner="DirectRunner" \
--csv_file=$CSV_FILE \
--json_schema=$JSON_SCHEMA \
--output_table=$OUTPUT_TABLE \
--temp_location=$PIPELINE_STAGIN_PATH/tmp
```

*Note:* All paths mentioned can be local paths or on GCS. For cloud resources referenced (GCS and BigQuery), make sure that the user launching the command is authenticated to GCP via `gcloud auth application-default login` and has the required access privileges to those resources.

### Using the DataflowRunner with a local CLI launch

The below example triggers the pipeline on Dataflow from your local development environment. The use case for this is for running local tests on larger volumes of test data and verifying that the pipeline runs well on Dataflow, before compiling it into a template.

```
PROJECT_ID=[TEST-PROJECT]
REGION=[REGION]
SUBNET=[SUBNET-NAME]
DEV_SERVICE_ACCOUNT=[DEV-SA]
PIPELINE_STAGIN_PATH="gs://[TEST-STAGING-BUCKET]"
CSV_FILE=gs://[TEST-BUCKET]/customers.csv
JSON_SCHEMA=gs://[TEST-BUCKET]/customers_schema.json
OUTPUT_TABLE=[TEST-PROJ].[TEST-DATASET].customers
python src/csv2bq.py \
--runner="Dataflow" \
--project=$PROJECT_ID \
--region=$REGION \
--csv_file=$CSV_FILE \
--json_schema=$JSON_SCHEMA \
--output_table=$OUTPUT_TABLE \
--temp_location=$PIPELINE_STAGIN_PATH/tmp
--staging_location=$PIPELINE_STAGIN_PATH/stage \
--subnetwork="regions/$REGION/subnetworks/$SUBNET" \
--impersonate_service_account=$DEV_SERVICE_ACCOUNT \
--no_use_public_ips
```

In terms of resource access privilege, you can choose to impersonate another service account, which could be defined for development resource access. The authenticated user launching this pipeline will need to have the role `roles/iam.serviceAccountTokenCreator`. If you choose to launch the pipeline without service account impersonation, it will use the default compute service account assigned of the target project.

## Dataflow Flex Template run

For production, and as outline in the Data Platform demo, we build and launch the pipeline as a Flex Template, making it available for other cloud services(such as Apache Airflow) and users to trigger launch instances of it on demand.

### Build launch

Below is an example for triggering the Dataflow flex template build pipeline defined in `cloudbuild.yaml`. The Terraform output provides an example as well filled with the parameters values based on the generated resources in the data platform.

Expand All @@ -28,9 +87,9 @@ gcloud builds submit \

**Note:** For the scope of the demo, the launch of this build is manual, but in production, this build would be launched via a configured cloud build trigger when new changes are merged into the code branch of the Dataflow template.

## Example Dataflow pipeline launch in bash (from flex template)
### Dataflow Flex Template run

Below is an example of launching a dataflow pipeline manually, based on the built template. When launched manually, the Dataflow pipeline would be launched via the orchestration service account, which is what the Airflow DAG is also using in the scope of this demo.
After the build step succeeds. You can launch dataflow pipeline from CLI (outline in this example) or the API via Airflow's operator. For the use case of the data platform, the Dataflow pipeline would be launched via the orchestration service account, which is what the Airflow DAG is also using in the scope of this demo.

**Note:** In the data platform demo, the launch of this Dataflow pipeline is handled by the airflow operator (DataflowStartFlexTemplateOperator).

Expand Down

0 comments on commit 7c9841d

Please sign in to comment.