Skip to content

Files

Latest commit

e098cc0 · Apr 2, 2024

History

History
154 lines (113 loc) · 8.83 KB
·

File metadata and controls

154 lines (113 loc) · 8.83 KB
·
layout title parent grand_parent nav_order
default
kafka
Buffers
Pipelines
80

kafka

The kafka buffer buffers data into an Apache Kafka topic. It uses the Kafka topic to persist data while the data is in transit.

The following example shows how to run the Kafka buffer in an HTTP pipeline. It runs against a locally running Kafka cluster.

kafka-buffer-pipeline:
  source:
    http:
  buffer:
    kafka:
      bootstrap_servers: ["localhost:9092"]
      encryption:
        type: none
      topics:
        - name: my-buffer-topic
          group_id: data-prepper
          create_topic: true
  processor:
    - grok:
        match:
          message: [ "%{COMMONAPACHELOG}" ]
  sink:
    - stdout:

Configuration options

Use the following configuration options with the kafka buffer.

Option Required Type Description
authentication No Authentication Sets the authentication options for both the pipeline and Kafka. For more information, see Authentication.
aws No AWS The AWS configuration. For more information, see aws.
bootstrap_servers Yes String list The host and port for the initial connection to the Kafka cluster. You can configure multiple Kafka brokers by using the IP address or the port number for each broker. When using Amazon Managed Streaming for Apache Kafka (Amazon MSK) as your Kafka cluster, the bootstrap server information is obtained from Amazon MSK using the Amazon Resource Name (ARN) provided in the configuration.
encryption No Encryption The encryption configuration for encryption in transit. For more information, see Encryption.
producer_properties No Producer Properties A list of configurable Kafka producer properties.
topics Yes List A list of topics for the buffer to use. You must supply one topic per buffer.

topic

The topic option configures a single Kafka topic and tells the kafka buffer how to use that topic.

Option Required Type Description
name Yes String The name of the Kafka topic.
group_id Yes String Sets Kafka's group.id option.
workers No Integer The number of multithreaded consumers associated with each topic. Default is 2. The maximum value is 200.
encryption_key No String An Advanced Encryption Standard (AES) encryption key used to encrypt and decrypt data within Data Prepper before sending it to Kafka. This value must be plain text or encrypted using AWS Key Management Service (AWS KMS).
kms No AWS KMS key When configured, uses an AWS KMS key to encrypt data. See kms for more information.
auto_commit No Boolean When false, the consumer offset will not be periodically committed to Kafka in the background. Default is false.
commit_interval No Integer When auto_commit is set to true, sets how often, in seconds, the consumer offsets are auto-committed to Kafka through Kafka's auto.commit.interval.ms option. Default is 5s.
session_timeout No Integer The amount of time during which the source detects client failures when using Kafka's group management features, which can be used to balance the data stream. Default is 45s.
auto_offset_reset No String Automatically resets the offset to the earliest or the latest offset through Kafka's auto.offset.reset option. Default is latest.
thread_waiting_time No Integer The amount of time that a thread waits for the preceding thread to complete its task and to signal the next thread. The Kafka consumer API poll timeout value is set to half of this setting. Default is 5s.
max_partition_fetch_bytes No Integer Sets the maximum limit, in megabytes, for data returns from each partition through Kafka's max.partition.fetch.bytes setting. Default is 1mb.
heart_beat_interval No Integer The expected amount of time between heartbeats to the consumer coordinator when using Kafka's group management facilities through Kafka's heartbeat.interval.ms setting. Default is 5s.
fetch_max_wait No Integer The maximum amount of time during which the server blocks a fetch request when there isn't sufficient data to satisfy the fetch_min_bytes requirement through Kafka's fetch.max.wait.ms setting. Default is 500ms.
fetch_max_bytes No Integer The maximum record size accepted by the broker through Kafka's fetch.max.bytes setting. Default is 50mb.
fetch_min_bytes No Integer The minimum amount of data the server returns during a fetch request through Kafka's retry.backoff.ms setting. Default is 1b.
retry_backoff No Integer The amount of time to wait before attempting to retry a failed request to a given topic partition. Default is 10s.
max_poll_interval No Integer The maximum delay between invocations of a poll() when using group management through Kafka's max.poll.interval.ms option. Default is 300s.
consumer_max_poll_records No Integer The maximum number of records returned in a single poll() call through Kafka's max.poll.records setting. Default is 500.
max_message_bytes No Integer The maximum size of the message, in bytes. Default is 1 MB.

kms

When using AWS KMS, the AWS KMS key can decrypt the encryption_key so that it is not stored in plain text. To configure AWS KMS with the kafka buffer, use the following options.

Option Required Type Description
key_id Yes String The ID of the AWS KMS key. It may be the full key ARN or a key alias.
region No String The AWS Region of the AWS KMS key.
sts_role_arn No String The AWS Security Token Service (AWS STS) role ARN to use to access the AWS KMS key.
encryption_context No Map When provided, messages sent to the topic will include this map as an AWS KMS encryption context.

Authentication

The following option is required inside the authentication object.

Option Type Description
sasl JSON object The Simple Authentication and Security Layer (SASL) authentication configuration.

SASL

Use one of the following options when configuring SASL authentication.

Option Type Description
plaintext JSON object The PLAINTEXT authentication configuration.
aws_msk_iam String The Amazon MSK AWS Identity and Access Management (IAM) configuration. If set to role, the sts_role_arn set in the aws configuration is used. Default is default.

SASL PLAINTEXT

The following options are required when using the SASL PLAINTEXT protocol.

Option Type Description
username String The username for the PLAINTEXT authentication.
password String The password for the PLAINTEXT authentication.

Encryption

Use the following options when setting SSL encryption.

Option Required Type Description
type No String The encryption type. Use none to disable encryption. Default is ssl.
insecure No Boolean A Boolean flag used to turn off SSL certificate verification. If set to true, certificate authority (CA) certificate verification is turned off and insecure HTTP requests are sent. Default is false.

producer_properties

Use the following configuration options to configure a Kafka producer.

Option Required Type Description
max_request_size No Integer The maximum size of the request that the producer sends to Kafka. Default is 1 MB.

aws

Use the following options when setting up authentication for aws services.

Option Required Type Description
region No String The AWS Region to use for credentials. Defaults to the standard SDK behavior for determining the Region.
sts_role_arn No String The AWS STS role to assume for requests to Amazon Simple Queue Service (Amazon SQS) and Amazon Simple Storage Service (Amazon S3). Default is null, which will use the standard SDK behavior for credentials.
msk No JSON object The Amazon MSK configuration settings.

msk

Use the following options inside the msk object.

Option Required Type Description
arn Yes String The Amazon MSK ARN to use.
broker_connection_type No String The type of connector to use with the Amazon MSK broker, either public, single_vpc, or multi_vpc. Default is single_vpc.