Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support AWS Kinesis Data Streams as a Source #1082

Closed
dlvenable opened this issue Feb 22, 2022 · 4 comments · Fixed by #4836, #5029, #5034, #5036 or #5046
Closed

Support AWS Kinesis Data Streams as a Source #1082

dlvenable opened this issue Feb 22, 2022 · 4 comments · Fixed by #4836, #5029, #5034, #5036 or #5046
Assignees
Labels
enhancement New feature or request plugin - source A plugin to receive data from a service or location. Roadmap:Observability/Log Analytics Project-wide roadmap label
Milestone

Comments

@dlvenable
Copy link
Member

dlvenable commented Feb 22, 2022

Is your feature request related to a problem? Please describe.

Some pipeline authors want to retrieve events from Amazon Kinesis Data Streams.

Describe the solution you'd like

Create a kinesis_data_streams source plugin. The Kinesis Client Library (KCL) can manage much of the client needs. So I propose that the Data Prepper source use KCL for reading from Kinesis.

KCL uses DynamoDB to coordinate consumers. Because KCL uses DynamoDB and Kinesis presumes an AWS account anyway, I propose that Data Prepper uses DynamoDB for consumer coordination.

Data Prepper should support configuring the AWS resources and access to the AWS resources that KCL needs. And also configuring the Kinesis stream name.

Example configuration:

source:
  kinesis_data_streams:
    stream_name: MyStream
    coordination_table_name: MyDynamoDbTable

Additional context

https://javadoc.io/doc/software.amazon.kinesis/amazon-kinesis-client/latest/index.html

@dlvenable dlvenable added enhancement New feature or request plugin - source A plugin to receive data from a service or location. labels Mar 21, 2022
@daixba
Copy link
Contributor

daixba commented Jun 13, 2023

I would like to propose below design for KDS as source

pipeline:
  source:
    kinesis:
      streams:
        - stream_name: "test-stream1"
          initial_position: "TRIM_HORIZON"
          consumer_strategy: "Polling"
        - stream_name: "test-stream2"
          initial_position: "LATEST"
          consumer_strategy: "Enhanced-Fan-Out"
      application_name: "my-app"
      aws:
        region: "us-west-2"
        sts_role_arn: "xxx"   

Here are the configuration options:

  • stream_name - (required) - The Kinesis Data Stream name to be consumed. A single pipeline can consume data from multiple streams.
  • initial_position - (optional) - Supported value: LATEST, TRIM_HORIZON. Default to LATEST.
  • consumer_strategy - (optional) - Supported value: Polling and Enhanced-Fan-Out, default to Polling.
  • application_name - (required) - Consumer client application name. This name will also be used as KCL coordination table name.
  • aws - (required) - AWS Auth to read from multiple streams. All streams must be in the same AWS region.

@cameronattard
Copy link

This would be great for processing logs from Cloudwatch logs.

@dlvenable
Copy link
Member Author

It would be nice to support multi-tenancy in the KCL library when awslabs/amazon-kinesis-client#1368 is available.

source:
  kinesis:
    tenancy: multi

The default value for tenancy could be single as this works for most use-cases.

Also, it would be nice to support the KCL configurations for DDB in data-prepper-config.yaml. This would allow multiple sources to share a single KCL library and thus the same DDB table.

@sb2k16
Copy link
Member

sb2k16 commented Sep 17, 2024

I would like to work on this issue. Could you please assign this to me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment