Implementation of MapReduce patterns in Spark Pyspark
Summarization pattern
- Min, max and count
Filter pattern
- Bloom filter
- Top 10
- Distinct
Data organization pattern
- structured to hirerachical
- Partitioning
- Binning
- Shuffling
Join pattern
- Map-side join
- Reduce-side join
- Replicated join
- composite join
- Cartesian join
Dataset: cs stackexcange dataset
Reference: MapReduce Design Patterns, Building Effective Algorithms and Analytics for Hadoop and Other Systems By Donald Miner, Adam Shook