Streaming Kafka reader, Parquet writer.

This Apache Spark Streaming job reads from Kafka, in time-based intervals, to save to Parquet files for offline processing.

This was tested on: Apache Spark 2.4.4, and AWS MSK.

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)

How to run.

Create the FAT/Uber JAR.

shell$ sbt assembly

Run the Spark job.

shell$ spark-submit --driver-memory 8g \
    --class StreamingKafkaReaderParquetWriter \
    --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4 \
    target/scala-2.11/StreamingKafkaReaderParquetWriter-assembly-1.0.0.jar

Why?

There are examples of reading from Kafka, and writing to Parquet format. Why write this? Most examples are from the context of spark-shell, or assume you know the Apache Spark ecosystem well. This is a fully working example (once you configure it) that should compile, and execute without a hitch.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
project		project
src/main/scala		src/main/scala
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Streaming Kafka reader, Parquet writer.

How to run.

Why?

About

Releases

Packages

Languages

colinlouie/streaming-kafka-reader-parquet-writer

Folders and files

Latest commit

History

Repository files navigation

Streaming Kafka reader, Parquet writer.

How to run.

Why?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages