apachespark

Here are 8 public repositories matching this topic...

martandsingh / ApacheSpark

This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.

sql database spark hive hadoop etl pyspark data-engineering spark-streaming data-analysis databricks datalake spark-sql timetravel apachespark etl-pipeline deltalake

Updated Jul 28, 2024
Python

Thapep / ApacheSpark

Star

Apache Spark project for Advanced Topics on Databases course

databases ntua spark-sql dataframes-api apachespark apachespark-rdd

Updated Mar 19, 2021
Python

ZeroTwoDataRW / DE-Stream-Project-Random-Generated-User-Data

Star

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

python docker airflow kafka cassandra-database apachespark postgesql

Updated Mar 22, 2024
Python

ravishankar324 / Washington-state-electric-vehicles-ETL-pipeline

Star

ETL Datapipeline to process Washington's EV data using Apache Spark, Docker, Snowflake, Airflow, AWS services and visualize the transformed parquet data by creating Tableau Dashboards.

python emr docker airflow ec2 s3 iam snowflake pyspark sparksql tableau apachespark

Updated Aug 24, 2024
Python

Cyang18 / MusicProducer

Star

This is a distributed system that utilizes Apache Spark through Dataproc. We use the Spotify API to send song data to Apache Spark, which then forwards the information to Google Cloud Services. The system processes this data to recommend songs based on the extracted information.

javascript hive apache python3 dataproc-cluster apachespark

Updated Oct 14, 2024
Python

gilga001 / HPCandBigDataPipeline

Star

A published paper in PEARC18: Combining HPC and Big Data Infrastructures in Large-Scale Post-Processing of SimulaBon Data: A Case Study

python simulation hpc bigdata mdtraj postprocessing apachespark

Updated Jul 23, 2018
Python

payamrastogi / SparkCourse

Star

python apachespark

Updated Sep 15, 2017
Python

urvashiforreal / Retail-Data-Analysis

Star

Developed a real-time streaming analytics pipeline using Apache Spark to calculate and store KPIs for e-commerce sales data, including total volume of sales, orders per minute, rate of return, and average transaction size. Used Spark Streaming to read data from Kafka, Spark SQL to calculate KPIs, and Spark DataFrame to write KPIs to JSON files.

sparksql sparkstreaming apachespark sparkdataframe

Updated Oct 15, 2023
Python

Improve this page

Add a description, image, and links to the apachespark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the apachespark topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apachespark

Here are 8 public repositories matching this topic...

martandsingh / ApacheSpark

Thapep / ApacheSpark

ZeroTwoDataRW / DE-Stream-Project-Random-Generated-User-Data

ravishankar324 / Washington-state-electric-vehicles-ETL-pipeline

Cyang18 / MusicProducer

gilga001 / HPCandBigDataPipeline

payamrastogi / SparkCourse

urvashiforreal / Retail-Data-Analysis

Improve this page

Add this topic to your repo