From 76175c22cf14f59ba4dd9f21c5b9eee52ce73617 Mon Sep 17 00:00:00 2001 From: Shreya Khajanchi Date: Fri, 19 Jul 2024 11:11:13 +0530 Subject: [PATCH 01/10] Documentation for custom transformation --- .../ReverseReplicationUserGuide.md | 8 ++++++++ docs/ui/prepare-migration/dataflow-tuning.md | 18 ++++++++++++++++++ 2 files changed, 26 insertions(+) diff --git a/docs/reverse-replication/ReverseReplicationUserGuide.md b/docs/reverse-replication/ReverseReplicationUserGuide.md index dbf0a3a3a..8e8e16201 100644 --- a/docs/reverse-replication/ReverseReplicationUserGuide.md +++ b/docs/reverse-replication/ReverseReplicationUserGuide.md @@ -381,6 +381,14 @@ Steps to perfrom customization: 3. Invoke the reverse replication flow by passing the [custom jar path and custom class path](RunnigReverseReplication.md#custom-jar). 4. If any custom parameters are needed in the custom shard identification logic, they can be passed via the *readerShardingCustomParameters* input to the runner. These parameters will be passed to the *init* method of the custom class. The *init* method is invoked once per worker setup. +### Custom transformations +For cases where a user wants to handle a custom transformation logic, they need to specify the following parameters in the [GCS to Sourcedb](https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/v2/gcs-to-sourcedb) - a GCS path that points to a custom jar, fully classified custom class name of the class containing custom transformation logic and custom parameters which might be used by the jar to invoke custom logic to perform transformation. + +Steps to perfrom customization: +1. Write custom transformation logic in [CustomShardIdFetcher.java](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/spanner-custom-shard/src/main/java/com/custom/CustomShardIdFetcher.java). Details of the ShardIdRequest class can be found [here](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/spanner-migrations-sdk/src/main/java/com/google/cloud/teleport/v2/spanner/utils/ShardIdRequest.java). +2. Build the [JAR](https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/v2/spanner-custom-shard) and upload the jar to GCS +3. Invoke the reverse replication flow by passing the [custom jar path and custom class path](RunnigReverseReplication.md#custom-jar). +4. If any custom parameters are needed in the custom shard identification logic, they can be passed via the *readerShardingCustomParameters* input to the runner. These parameters will be passed to the *init* method of the custom class. The *init* method is invoked once per worker setup. ## Cost diff --git a/docs/ui/prepare-migration/dataflow-tuning.md b/docs/ui/prepare-migration/dataflow-tuning.md index 29c46338d..2cfcab748 100644 --- a/docs/ui/prepare-migration/dataflow-tuning.md +++ b/docs/ui/prepare-migration/dataflow-tuning.md @@ -44,6 +44,24 @@ To tune dataflow, first specify the target database in the 'Configure Spanner Da SMT exposes the most frequently changed dataflow configurations to the user. Please reach out to us if you have a use-case that is not satisfied by the provided configurations. +### Custom JAR GCS Path +Specify the GCS path of the jar containing custom transformation logic. For custom transformation, specify both custom jar GCS path and fully classified class name. If no custom jar and class name are provided, only the default transformations will be used. + +Present under the Custom Transformations section of the form. + +### Custom Class Name +Specify the fully classified class name of the class containing custom transformation logic. For custom transformation, specify both custom jar GCS path and fully classified class name. If no custom jar and class name are provided, only the default transformations will be used. + +Present under the Custom Transformations section of the form. + +{: .highlight } +Specify both the custom class name and custom jar GCS path, or specify neither. + +### Custom Parameter +Specify the custom parameters to be passed to the custom transformation logic implementation. + +Present under the Custom Transformations section of the form. + ### VPC Host ProjectId Specify the project id of the VPC that you want to use. This is required in order to use private connectivity. By default, this is assumed to be the same as Spanner project. Ensure this is specified if also specifying a network and subnetwork. From 0cbd565603788ae5ae74b1cab03f15a1a366ad08 Mon Sep 17 00:00:00 2001 From: Shreya Khajanchi Date: Fri, 19 Jul 2024 17:42:46 +0530 Subject: [PATCH 02/10] minor tweaks --- docs/ui/prepare-migration/dataflow-tuning.md | 6 +++++- docs/ui/schema-conv/spanner-draft.md | 8 ++++++++ 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/docs/ui/prepare-migration/dataflow-tuning.md b/docs/ui/prepare-migration/dataflow-tuning.md index 2cfcab748..7a1e4564a 100644 --- a/docs/ui/prepare-migration/dataflow-tuning.md +++ b/docs/ui/prepare-migration/dataflow-tuning.md @@ -29,7 +29,7 @@ Some use cases when the user would want to tweak the jobs are: To tune dataflow, first specify the target database in the 'Configure Spanner Database' step. This enables the configure button for the remaining steps. -![](https://services.google.com/fh/gumdrop/preview/misc/dataflow-form.png) +![](https://services.google.com/fh/files/misc/dataflow-tuning.png)
@@ -49,6 +49,10 @@ Specify the GCS path of the jar containing custom transformation logic. For cust Present under the Custom Transformations section of the form. +{: .highlight } + +Please make sure that the custom JAR code is idempotent to manage transaction retries effectively. + ### Custom Class Name Specify the fully classified class name of the class containing custom transformation logic. For custom transformation, specify both custom jar GCS path and fully classified class name. If no custom jar and class name are provided, only the default transformations will be used. diff --git a/docs/ui/schema-conv/spanner-draft.md b/docs/ui/schema-conv/spanner-draft.md index f47f1ef3e..232f3f108 100644 --- a/docs/ui/schema-conv/spanner-draft.md +++ b/docs/ui/schema-conv/spanner-draft.md @@ -30,6 +30,14 @@ Column tab provides information on the columns that are a part of the selected t ![](https://services.google.com/fh/files/misc/column-info-edit.png) +#### Add Column + +In addition to editing the existing columns in the Spanner draft mapped from the source database, users can also add new columns to the selected table. + +![](https://services.google.com/fh/files/misc/add_column.png) +![](https://services.google.com/fh/files/misc/add_column_form.png) +![](https://services.google.com/fh/files/misc/new_column.png) + ### Primary Key Users can view and edit the primary key of a table from the primary key tab. They can remove/add a column from the primary key or change the order of columns in the primary key. Once these changes are made, the session file is updated and they can also be verified from the [SQL tab](#sql). From 28caed40160f667d43586cf4fa6f74bae6d0f054 Mon Sep 17 00:00:00 2001 From: Shreya Khajanchi Date: Mon, 22 Jul 2024 19:56:58 +0530 Subject: [PATCH 03/10] added complete documentation --- docs/minimal/transformations.md | 53 ++++++++ .../ReverseReplicationUserGuide.md | 50 ++++++- .../RunnigReverseReplication.md | 28 +++- docs/troubleshoot/minimal.md | 47 ++++--- docs/ui/prepare-migration/dataflow-tuning.md | 2 + .../reverse-replication-runner.go | 123 ++++++++++-------- 6 files changed, 223 insertions(+), 80 deletions(-) create mode 100644 docs/minimal/transformations.md diff --git a/docs/minimal/transformations.md b/docs/minimal/transformations.md new file mode 100644 index 000000000..7698772fc --- /dev/null +++ b/docs/minimal/transformations.md @@ -0,0 +1,53 @@ +--- +layout: default +title: Custom Transformation +parent: Minimal downtime migrations +nav_order: 4 +--- + +# Minimal downtime migrations for MySQL +{: .no_toc } + +For cases where a user wants to handle a custom transformation logic, they need to specify the following parameters in the [Datastream To Spanner](https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/v2/datastream-to-spanner) template - a GCS path that points to a custom jar, fully classified custom class name of the class containing custom transformation logic and custom parameters which might be used by the jar to invoke custom logic to perform transformation. + +Steps to perfrom customization: +1. Implement custom transformation logic for forward migration in the [toSpannerRow](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/spanner-custom-shard/src/main/java/com/custom/CustomTransformationFetcher.java#L42) method of the **CustomTransformationFetcher.java**. Details of the MigrationTransformationRequest class can be found [here](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/spanner-migrations-sdk/src/main/java/com/google/cloud/teleport/v2/spanner/utils/MigrationTransformationRequest.java). +2. Build the [JAR](https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/v2/spanner-custom-shard) and upload the jar to GCS +3. Invoke the datastream-to-spanner template by passing the custom jar path and custom class path. +4. If any custom parameters are needed in the custom transformation logic, they can be passed via the *customParameters* input to the template. These parameters will be passed to the *init* method of the custom class. The *init* method is invoked once per worker setup. + +Implementation details for custom transformation: +1. [MigrationTransformationRequest](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/spanner-migrations-sdk/src/main/java/com/google/cloud/teleport/v2/spanner/utils/MigrationTransformationRequest.java) contains the following information - + - tableName - Name of the source table to which the event belongs to. + - shardId - Logical shard id of the record. + - eventType - The event type can either be INSERT, UPDATE-INSERT, UPDATE, UPDATE_DELETE or DELETE. Please refer to the [datastream documentation](https://cloud.google.com/datastream/docs/events-and-streams) for more details. + - requestRow - It is a map where key is the source column name and value is source column value. +2. [MigrationTransformationResponse](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/spanner-migrations-sdk/src/main/java/com/google/cloud/teleport/v2/spanner/utils/MigrationTransformationResponse.java) contains the following information - + - responseRow - It is a map where key is the spanner column name and value is spanner column value. + - isEventFiltered - If set to true, event will be skipped and not written to spanner. +3. Values in the response row should be of same datatype as the spanner schema. +4. Please throw **InvalidTransformationException** in case of any error while processing a particular event in custom jar. + +Here is a table that details the source data type for **MySQL**, its corresponding request row object type, spanner datatype and the expected response format: +| Source datatype | Request object type | Spanner datatype | Response format | +|-------------------------|-------------------------------------|-----------------------|----------------------------------------------------------------------------------------| +| TINYINT | Long | INT64 | Long | +| INT | Long | INT64 | Long | +| BIGINT | Long | INT64 | Long | +| TIME | Long ([time-micros](https://avro.apache.org/docs/current/specification/_print/#time-microsecond-precision) ex: 45296000000 for 12:34:56) | STRING | Long| +| YEAR | Long | STRING | Long | +| FLOAT | Double | FLOAT32 | Double | +| DOUBLE | Double | FLOAT64 | Double | +| DECIMAL | String | NUMERIC | String | +| BOOLEAN | Long | BOOLEAN | Long | +| TEXT | String | STRING | String | +| ENUM | String | STRING | String | +| BLOB | String (hex encoded) | BYTES | Binary String | +| BINARY | String (hex encoded) | BYTES | Binary String | +| BIT | Long | BYTES | Long | +| DATE | String (Format: yyyy-MM-dd)) | DATE | String( Format: yyyy-MM-dd) | +| DATETIME | String (ex: 2024-01-01T12:34:56Z) | TIMESTAMP | String | +| TIMESTAMP | String (ex: 2024-01-01T12:34:56Z) | TIMESTAMP | String | + + +Please refer to the sample implementation of **toSpannerRow** for all MySQL datatype columns [here](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/spanner-custom-shard/src/main/java/com/custom/CustomTransformationWithShardForIT.java#L44). \ No newline at end of file diff --git a/docs/reverse-replication/ReverseReplicationUserGuide.md b/docs/reverse-replication/ReverseReplicationUserGuide.md index 8e8e16201..b676f293e 100644 --- a/docs/reverse-replication/ReverseReplicationUserGuide.md +++ b/docs/reverse-replication/ReverseReplicationUserGuide.md @@ -113,6 +113,9 @@ The Dataflow job that writes to source database exposes the following per shard | metadata_file_create_lag_retry_\ | Count of file lookup retries done when the job that writes to GCS is lagging | | mySQL_retry_\ | Number of retries done when MySQL is not reachable| | shard_failed_\ | Published when there is a failure while processing the shard | +| custom_transformation_exception | Number of exception encountered in the custom transformation jar | +| filtered_events_\ | Number of events filtered via custom transformation per shard | +| apply_custom_transformation_impl_latency_ms | Time taken for the execution of custom transformation logic. | These can be used to track the pipeline progress. However, there is a limit of 100 on the total number of metrics per project. So if this limit is exhausted, the Dataflow job will give a message like so: @@ -378,18 +381,53 @@ In order to make it easier for users to customize the shard routing logic, the [ Steps to perfrom customization: 1. Write custom shard id fetcher logic [CustomShardIdFetcher.java](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/spanner-custom-shard/src/main/java/com/custom/CustomShardIdFetcher.java). Details of the ShardIdRequest class can be found [here](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/spanner-migrations-sdk/src/main/java/com/google/cloud/teleport/v2/spanner/utils/ShardIdRequest.java). 2. Build the [JAR](https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/v2/spanner-custom-shard) and upload the jar to GCS -3. Invoke the reverse replication flow by passing the [custom jar path and custom class path](RunnigReverseReplication.md#custom-jar). +3. Invoke the reverse replication flow by passing the [custom jar path and custom class path](RunnigReverseReplication.md#custom-shard-identification). 4. If any custom parameters are needed in the custom shard identification logic, they can be passed via the *readerShardingCustomParameters* input to the runner. These parameters will be passed to the *init* method of the custom class. The *init* method is invoked once per worker setup. ### Custom transformations -For cases where a user wants to handle a custom transformation logic, they need to specify the following parameters in the [GCS to Sourcedb](https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/v2/gcs-to-sourcedb) - a GCS path that points to a custom jar, fully classified custom class name of the class containing custom transformation logic and custom parameters which might be used by the jar to invoke custom logic to perform transformation. +For cases where a user wants to handle a custom transformation logic, they need to specify the following parameters in the [GCS to Sourcedb](https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/v2/gcs-to-sourcedb) template - a GCS path that points to a custom jar, fully classified custom class name of the class containing custom transformation logic and custom parameters which might be used by the jar to invoke custom logic to perform transformation. Steps to perfrom customization: -1. Write custom transformation logic in [CustomShardIdFetcher.java](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/spanner-custom-shard/src/main/java/com/custom/CustomShardIdFetcher.java). Details of the ShardIdRequest class can be found [here](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/spanner-migrations-sdk/src/main/java/com/google/cloud/teleport/v2/spanner/utils/ShardIdRequest.java). +1. Implement custom transformation logic for reverse replication in the [toSourceRow](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/spanner-custom-shard/src/main/java/com/custom/CustomTransformationFetcher.java#L59) method of the **CustomTransformationFetcher.java**. Details of the MigrationTransformationRequest class can be found [here](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/spanner-migrations-sdk/src/main/java/com/google/cloud/teleport/v2/spanner/utils/MigrationTransformationRequest.java). 2. Build the [JAR](https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/v2/spanner-custom-shard) and upload the jar to GCS -3. Invoke the reverse replication flow by passing the [custom jar path and custom class path](RunnigReverseReplication.md#custom-jar). -4. If any custom parameters are needed in the custom shard identification logic, they can be passed via the *readerShardingCustomParameters* input to the runner. These parameters will be passed to the *init* method of the custom class. The *init* method is invoked once per worker setup. - +3. Invoke the reverse replication flow by passing the [custom jar path and custom class path](RunnigReverseReplication.md#custom-transformation). +4. If any custom parameters are needed in the custom transformation logic, they can be passed via the *writerTransformationCustomParameters* input to the runner. These parameters will be passed to the *init* method of the custom class. The *init* method is invoked once per worker setup. + +Implementation details for custom transformation: +1. [MigrationTransformationRequest](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/spanner-migrations-sdk/src/main/java/com/google/cloud/teleport/v2/spanner/utils/MigrationTransformationRequest.java) contains the following information - + - tableName - Name of the spanner table to which the event belongs to. + - shardId - Logical shard id of the record. + - eventType - The event type can either be INSERT, UPDATE or DELETE + - requestRow - It is a map where key is the spanner column name and value is spanner column value. +2. [MigrationTransformationResponse](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/spanner-migrations-sdk/src/main/java/com/google/cloud/teleport/v2/spanner/utils/MigrationTransformationResponse.java) contains the following information - + - responseRow - It is a map where key is the source column name and value is source column value. + - isEventFiltered - If set to true, event will be skipped and not written to source. +3. Values in the response row should be exactly in the format compatible with source schema, which means users also need to enclose string values in **single quotes** as they would normally do in an INSERT statement. +4. Please throw **InvalidTransformationException** in case of any error while processing a particular event in custom jar. + +Here is a table that details the source data type for **MySQL**, its corresponding request row object type, spanner datatype and the expected response format: +|Spanner datatype | Source datatype | Request object type | Response format | +|-----------------|-------------------------|-------------------------------------------|----------------------------------------------------------------------------------------| +| INT64 | TINYINT | String | String | +| INT64 | INT | String | String | +| INT64 | BIGINT | String | String | +| STRING | TIME | String ([time-micros](https://avro.apache.org/docs/current/specification/_print/#time-microsecond-precision) ex: 45296000000 for 12:34:56) | String(Format: Time value **enclosed in single quotes**, ex: '14:30:00')| +| STRING | YEAR | String | String | +| FLOAT32 | FLOAT | BigDecimal | String | +| FLOAT64 | DOUBLE | BigDecimal | String | +| NUMERIC | DECIMAL | String | String | +| BOOL | BOOLEAN | String( ex: "false") | String | +| STRING | TEXT | String | String( **enclosed in single quotes**, ex: 'Transformed text') | +| STRING | ENUM | String | String( **enclosed in single quotes**, ex: 'Enum value') | +| BYTES | BLOB | String (Base64 encoded) | Binary String | +| BYTES | BINARY | String (Base64 encoded) | Binary String | +| BYTES | BIT | String (Base64 encoded) | Binary String | +| DATE | DATE | String (Format: yyyy-MM-dd)) | String( Format: yyyy-MM-dd **enclosed in single quotes**, ex: '1995-01-13') | +| TIMESTAMP | DATETIME | String (ex: 2024-01-01T12:34:56Z) | String | +| TIMESTAMP | TIMESTAMP | String (ex: 2024-01-01T12:34:56Z) | String | + + +Please refer to the sample implementation of **toSourceRow** for all MySQL datatype columns [here](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/d7db191b49ba0e5ecfde6d15f60d2801b05f8cc2/v2/spanner-custom-shard/src/main/java/com/custom/CustomTransformationWithShardForIT.java#L145). ## Cost diff --git a/docs/reverse-replication/RunnigReverseReplication.md b/docs/reverse-replication/RunnigReverseReplication.md index 4174b6637..2dedb936b 100644 --- a/docs/reverse-replication/RunnigReverseReplication.md +++ b/docs/reverse-replication/RunnigReverseReplication.md @@ -69,6 +69,10 @@ The script takes in multiple arguments to orchestrate the pipeline. They are: - `vpcHostProjectId`: project ID hosting the subnetwork. If unspecified, the 'projectId' parameter value will be used for subnetwork. - `vpcNetwork`: name of the VPC network to be used for the dataflow jobs - `vpcSubnetwork`: name of the VPC subnetwork to be used for the dataflow jobs. Subnet should exist in the same region as the 'dataflowRegion' parameter. +- `writeFilteredEventsToGcs`: Whether to write filtered events to GCS. Default is false. +- `writerTransformationCustomJarPath`: The GCS path to custom jar for custom transformation logic. +- `writerTransformationCustomClassName`: The fully qualified custom class name for custom transformation logic. +- `writerTransformationCustomParameters`: Any custom parameters to be supplied to custom transformation class. ## Pre-requisites @@ -139,9 +143,9 @@ Launched writer job: id:"<>" project_id:"<>" name:"<>" current_state_time:{} cr ``` -### Custom Jar +### Custom shard identification -In order to specify custom shard identification function, custom jar and class names need to give. The command to do that is below: +In order to specify custom shard identification function, custom jar and class names need to be given. The command to do that is below: ``` go run reverse-replication-runner.go -projectId= -dataflowRegion= -instanceId= -dbName= -sourceShardsFilePath=gs://bucket-name/shards.json -sessionFilePath=gs://bucket-name/session.json -gcsPath=gs://bucket-name/ -readerShardingCustomJarPath=gs://bucket-name/custom.jar -readerShardingCustomClassName=com.custom.classname -readerShardingCustomParameters='a=b,c=d' @@ -154,6 +158,26 @@ gcloud dataflow flex-template run smt-reverse-replication-reader-2024-01-05t10-3 ``` +### Custom transformation + +In order to specify custom transformation, custom jar and class names need to be given. The command to do that is below: + +``` +go run reverse-replication-runner.go -projectId= -dataflowRegion= -instanceId= -dbName= -sourceShardsFilePath=gs://bucket-name/shards.json -sessionFilePath=gs://bucket-name/session.json -gcsPath=gs://bucket-name/ -writerTransformationCustomJarPath=gs://bucket-name/custom.jar -writerTransformationCustomClassName=com.custom.classname -writerTransformationCustomParameters='a=b,c=d' +``` + +If a user wants to write the records filtered via custom transformation to GCS, they can use the command below: +``` +go run reverse-replication-runner.go -projectId= -dataflowRegion= -instanceId= -dbName= -sourceShardsFilePath=gs://bucket-name/shards.json -sessionFilePath=gs://bucket-name/session.json -gcsPath=gs://bucket-name/ -writerTransformationCustomJarPath=gs://bucket-name/custom.jar -writerTransformationCustomClassName=com.custom.classname -writerTransformationCustomParameters='a=b,c=d' -writeFilteredEventsToGcs +``` + +The sample reader job gcloud command for the same + +``` +gcloud dataflow flex-template run smt-reverse-replication-writer-2024-04-17t08-00-37z-8dac-d95f --project= --region= --template-file-gcs-location=