pyspark

Here are 3,703 public repositories matching this topic...

aehabV / Indeed-fake-job-posting-prediction

A machine learning model is built using PySpark's MLlib library to automatically flag suspicious job postings on Indeed.com. The dataset includes 18,000 job descriptions, out of which about 800 are fake.

nlp natural-language-processing pyspark indeed pyspark-mllib fake-jobposts-prediction job-postings

Updated May 18, 2023
Jupyter Notebook

basel-ay / Hands-on-Apache-Spark

Star

Writing dummy snippets of code to read, manipulate, and build a simple ML model with PySpark.

apache-spark linear-regression pyspark

Updated Jul 18, 2023
Jupyter Notebook

zuliani99 / All-Pairs-Docs-Similarity

Star

Given a set of documents and the minimum required similarity threshold find the number of document pairs that exceed the threshold

sklearn pyspark tf-idf cosine-similarity document-similarity beir

Updated May 26, 2023
Jupyter Notebook

JonathanPollyn / Spark

Star

This notebook contains detailed code for spark and machine learning and databricks

python spark pyspark spark-sql pyspark-python

Updated Mar 15, 2023
Jupyter Notebook

data-miner00 / spark

Star

A laboratory to carry out experiments with PySpark

python pyspark databricks

Updated Nov 5, 2023
Jupyter Notebook

khaledshabasy / Data-Modeling-Spark-udacity-capstone

Star

An ETL pipeline for I94 immigration, global land temperatures and US demographics datasets is created to form an analytics database on immigration events. A data model is established with pandas and pyspark to find patterns of immigration to the United States.

aws s3 pandas pyspark sas7bdat

Updated Apr 6, 2023
Jupyter Notebook

furkancets / PrescreiberPipelineSpark

Star

Trying best case apache spark working environment for robust data pipelines

spark apache-spark hadoop pyspark

Updated Apr 1, 2023
Python

simonediluna / Distributed-Data-Analysis-and-Mining

Star

An academic project carried out for the Distributed Data Analysis and Mining course (a. y. 2022/2023)

distributed-systems data-science pyspark

Updated May 18, 2023
Jupyter Notebook

Ayoub-etoullali / Activites-Pratiques-BigData

Star

MapReduce Job Development, RDDs Programming, Medical Data Management, Sales Analysis, And Efficient Data Integration For Big Data Analysis. Spark: Big Data Processing, SQOOP Integration, And Spark Structured Streaming For Real-Time Data.

real-time spark apache-spark pyspark data-integration mapreduce real-time-data sqoop mapreduce-jobs sales-analysis spark-structured-streaming mapreduce-java real-time-database big-data-processing rdds sqoop-export sqoop-import big-data-analysis medical-data-management

Updated Jun 7, 2023
Java

mohankrishna02 / pyspark-transformations

Star

This project demonstrates various data manipulation techniques on Spark dataframes such as reading and processing data from different file formats, applying filters and maps, and creating unified column lists.

python spark pyspark

Updated Apr 24, 2023
Python

milesgranger / pontem

Star

Treat Spark like pandas.

pandas pyspark dataframes dataframe-api spark-dataframes distributed-dataframe

Updated Sep 3, 2017
Python

SreekarJammula / tf-idf-

Star

The current assignment is to write the python scripts for Apache Spark. The tasks are divided into three parts as below: WordCount-To count the occurrences of words in a book on a per-book basis and compare the results with those of Assignment1. pyspark.ml. feature- To count the tf-idf values for the unigram and bigrams using the pyspark.ml.feat…

apache-spark pyspark tf-idf spark-ml