Skip to content

Commit

Permalink
Documentation for custom transformation (#875)
Browse files Browse the repository at this point in the history
* Documentation for custom transformation

* minor tweaks

* added complete documentation

* minor change

* reverting change

* minor changes

* addressing comments

* minor fixes

* minor fix

* addressing comments
  • Loading branch information
shreyakhajanchi authored Aug 1, 2024
1 parent 102fb44 commit 7cb785b
Show file tree
Hide file tree
Showing 9 changed files with 390 additions and 77 deletions.
4 changes: 4 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,10 @@ To launch reverse replication, refer details [here](./reverse-replication/Revers

To find out how to monitor your migration, refer [here](./monitoring/MonitoringUserGuide.md).

### Custom transformation

To find out how to configure custom transformations, refer [here](./transformations/CustomTransformation.md)

## Supported Sources and Targets

- **Schema Migrations**: SMT supports schema migrations for MySQL, PostgreSQL, SQLServer and Oracle.
Expand Down
10 changes: 8 additions & 2 deletions docs/reverse-replication/ReverseReplicationUserGuide.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,9 @@ The Dataflow job that writes to source database exposes the following per shard
| metadata_file_create_lag_retry_\<logical shard name\> | Count of file lookup retries done when the job that writes to GCS is lagging |
| mySQL_retry_\<logical shard name\> | Number of retries done when MySQL is not reachable|
| shard_failed_\<logical shard name\> | Published when there is a failure while processing the shard |
| custom_transformation_exception | Number of exception encountered in the custom transformation jar |
| filtered_events_\<logical shard name\> | Number of events filtered via custom transformation per shard |
| apply_custom_transformation_impl_latency_ms | Time taken for the execution of custom transformation logic. |

These can be used to track the pipeline progress.
However, there is a limit of 100 on the total number of metrics per project. So if this limit is exhausted, the Dataflow job will give a message like so:
Expand Down Expand Up @@ -378,10 +381,13 @@ In order to make it easier for users to customize the shard routing logic, the [
Steps to perfrom customization:
1. Write custom shard id fetcher logic [CustomShardIdFetcher.java](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/spanner-custom-shard/src/main/java/com/custom/CustomShardIdFetcher.java). Details of the ShardIdRequest class can be found [here](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/spanner-migrations-sdk/src/main/java/com/google/cloud/teleport/v2/spanner/utils/ShardIdRequest.java).
2. Build the [JAR](https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/v2/spanner-custom-shard) and upload the jar to GCS
3. Invoke the reverse replication flow by passing the [custom jar path and custom class path](RunnigReverseReplication.md#custom-jar).
3. Invoke the reverse replication flow by passing the [custom jar path and custom class path](RunnigReverseReplication.md#custom-shard-identification).
4. If any custom parameters are needed in the custom shard identification logic, they can be passed via the *readerShardingCustomParameters* input to the runner. These parameters will be passed to the *init* method of the custom class. The *init* method is invoked once per worker setup.


### Custom transformations
For cases where a user wants to handle a custom transformation logic, they need to specify the following parameters in the [GCS to Sourcedb](https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/v2/gcs-to-sourcedb) template - a GCS path that points to a custom jar, fully classified custom class name of the class containing custom transformation logic and custom parameters which might be used by the jar to invoke custom logic to perform transformation.

For details on how to implement and use custom transformations, please refer to the [Custom Transformation](../transformations/CustomTransformation.md) section.

## Cost

Expand Down
28 changes: 26 additions & 2 deletions docs/reverse-replication/RunnigReverseReplication.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,10 @@ The script takes in multiple arguments to orchestrate the pipeline. They are:
- `vpcHostProjectId`: project ID hosting the subnetwork. If unspecified, the 'projectId' parameter value will be used for subnetwork.
- `vpcNetwork`: name of the VPC network to be used for the dataflow jobs
- `vpcSubnetwork`: name of the VPC subnetwork to be used for the dataflow jobs. Subnet should exist in the same region as the 'dataflowRegion' parameter.
- `writeFilteredEventsToGcs`: Whether to write filtered events to GCS. Default is false.
- `writerTransformationCustomJarPath`: The GCS path to custom jar for custom transformation logic.
- `writerTransformationCustomClassName`: The fully qualified custom class name for custom transformation logic.
- `writerTransformationCustomParameters`: Any custom parameters to be supplied to custom transformation class.


## Pre-requisites
Expand Down Expand Up @@ -139,9 +143,9 @@ Launched writer job: id:"<>" project_id:"<>" name:"<>" current_state_time:{} cr
```

### Custom Jar
### Custom shard identification

In order to specify custom shard identification function, custom jar and class names need to give. The command to do that is below:
In order to specify custom shard identification function, custom jar and class names need to be given. The command to do that is below:

```
go run reverse-replication-runner.go -projectId=<project-id> -dataflowRegion=<region> -instanceId=<spanner-instance> -dbName=<spanner-database> -sourceShardsFilePath=gs://bucket-name/shards.json -sessionFilePath=gs://bucket-name/session.json -gcsPath=gs://bucket-name/<directory> -readerShardingCustomJarPath=gs://bucket-name/custom.jar -readerShardingCustomClassName=com.custom.classname -readerShardingCustomParameters='a=b,c=d'
Expand All @@ -154,6 +158,26 @@ gcloud dataflow flex-template run smt-reverse-replication-reader-2024-01-05t10-3
```


### Custom transformation

In order to specify custom transformation, custom jar and class names need to be given. The command to do that is below:

```
go run reverse-replication-runner.go -projectId=<project-id> -dataflowRegion=<region> -instanceId=<spanner-instance> -dbName=<spanner-database> -sourceShardsFilePath=gs://bucket-name/shards.json -sessionFilePath=gs://bucket-name/session.json -gcsPath=gs://bucket-name/<directory> -writerTransformationCustomJarPath=gs://bucket-name/custom.jar -writerTransformationCustomClassName=com.custom.classname -writerTransformationCustomParameters='a=b,c=d'
```

If a user wants to write the records filtered via custom transformation to GCS, they can use the command below:
```
go run reverse-replication-runner.go -projectId=<project-id> -dataflowRegion=<region> -instanceId=<spanner-instance> -dbName=<spanner-database> -sourceShardsFilePath=gs://bucket-name/shards.json -sessionFilePath=gs://bucket-name/session.json -gcsPath=gs://bucket-name/<directory> -writerTransformationCustomJarPath=gs://bucket-name/custom.jar -writerTransformationCustomClassName=com.custom.classname -writerTransformationCustomParameters='a=b,c=d' -writeFilteredEventsToGcs
```

The sample reader job gcloud command for the same

```
gcloud dataflow flex-template run smt-reverse-replication-writer-2024-04-17t08-00-37z-8dac-d95f --project=<project> --region=<region> --template-file-gcs-location=<template location> --parameters sessionFilePath=<session path>,sourceDbTimezoneOffset=+00:00,spannerProjectId=span-cloud-testing,runMode=regular,sourceShardsFilePath=<shard file path>,metadataTableSuffix=,GCSInputDirectoryPath=<gcs path>,metadataInstance=<spanner instance>,metadataDatabase=<spanner metadata database>,runIdentifier=2024-04-17t08-00-37z,transformationJarPath=<jar path>,transformationClassName=<custom class name>,transformationCustomParameters=a=b,c=d,writeFilteredEventsToGcs=true --num-workers=5 --worker-machine-type=n2-standard-4 --additional-experiments=use_runner_v2
```


### Network and Subnetwork specification

If the dataflow workers need to be run on a different network and subnetwork with custom network tags, sample command is below.
Expand Down
Loading

0 comments on commit 7cb785b

Please sign in to comment.