Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DynamoDB as source #2932

Closed
daixba opened this issue Jun 26, 2023 · 2 comments · Fixed by #3349
Closed

Support DynamoDB as source #2932

daixba opened this issue Jun 26, 2023 · 2 comments · Fixed by #3349
Assignees
Labels
plugin - source A plugin to receive data from a service or location.
Milestone

Comments

@daixba
Copy link
Contributor

daixba commented Jun 26, 2023

Is your feature request related to a problem? Please describe.
Customers would like to use Data Prepper to sync the data in Amazon DynamoDB (as source) to destination such as OpenSearch. The sync will include a full historical data dump and/or incremental changes capture.

Describe the solution you'd like
Using DynamoDB as source will need to support:

  1. Data Export: Using DynamoDB point in time export for historical data.
  2. Change Data Capture: This includes but not limited to the use of DynamoDB Streams.

Also, it will be nice to have flexiable configurations to support different run types:

  1. Data Export only
  2. Streams Only
  3. Data Export (historical) + Streams (CDC)

For DynamoDB data export, the data will be stored in S3, hence it would be nice that this source will also trigger the S3 scan job to run.

Note that DynamoDB using Kinesis Data Streams will not in scope, for which the Kinesis Data Source should be used instead.

@asifsmohammed asifsmohammed added plugin - source A plugin to receive data from a service or location. and removed untriaged labels Jun 29, 2023
@daixba daixba mentioned this issue Sep 18, 2023
4 tasks
@daixba
Copy link
Contributor Author

daixba commented Sep 18, 2023

Below is the proposed usage of this source plugin.

cdc-pipeline:
  source:
    dynamodb:
      tables:
        - table_arn: "arn:aws:dynamodb:us-west-2:123456789012:table/table-a"
          export:
            s3_bucket: "test-bucket"
            s3_prefix: "xxx/"
          stream:
            start_position:  
        - table_arn: "arn:aws:dynamodb:us-west-2:123456789012:table/table-b"
          export:
            s3_bucket: "test-bucket"
            s3_prefix: "xxx/"
        - table_arn: "arn:aws:dynamodb:us-west-2:123456789012:table/table-c"
          stream:
            start_position: "BEGINNING"  
      aws:
        region: "us-west-2"
      coordinator:
        dynamodb:
          table_name: "coordinator-table"
          region: "us-west-2"

In this example, table-a requires both export and stream, table-b only needs export while table-c needs changes only.

@daixba
Copy link
Contributor Author

daixba commented Sep 18, 2023

Some of the key design points:

  1. It’s supported to have multiple tables in one single pipeline. Customer can then use conditional routing ( https://opensearch.org/docs/latest/data-prepper/pipelines/pipelines/#conditional-routing) to route the data into different destinations if needed, such different indices in OpenSearch.
  2. For tables that requires both export and stream, to ensure the data are loaded in right order, it's a must to finish the export before stream can start. To support this, the proposed approach is to start the export job using the time when the pipeline was first started, and then also start reading from the active shards of the stream from beginning, and only when the timestamp of the change event is greater than the export time, the event is sent to buffer, otherwise, just wait until the export is done.
  3. A custom coordinator should be used to support different types of coordination tasks. But the backend store can reuse the existing one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plugin - source A plugin to receive data from a service or location.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants