diff --git a/map-reduce/README.md b/map-reduce/README.md new file mode 100644 index 000000000000..6187608fa2be --- /dev/null +++ b/map-reduce/README.md @@ -0,0 +1,124 @@ +--- +title: "MapReduce Pattern in Java" +shortTitle: MapReduce +description: "Learn the MapReduce pattern in Java with real-world examples, class diagrams, and tutorials. Understand its intent, applicability, benefits, and known uses to enhance your design pattern knowledge." +category: Performance optimization +language: en +tag: + - Data processing + - Code simplification + - Delegation + - Performance +--- + +# MapReduce in Java + +## Intent of Map Reduce Pattern + +The MapReduce design pattern is intended to simplify the processing of large-scale data sets by breaking down complex tasks into smaller, manageable units of work. This pattern leverages parallel processing to increase efficiency, scalability, and fault tolerance across distributed systems. + +## Detailed Explanation of Map Reduce with Real-World Examples + +1. Map Phase + The Map phase is the first step in the process. It takes the input data (often unstructured or semi-structured) and transforms it into intermediate key-value pairs. Each data element is processed independently and converted into one or more key-value pairs. +2. Reduce Phase + The Reduce phase aggregates the intermediate key-value pairs produced in the Map phase. All values associated with the same key are passed to the Reduce function, which processes them to produce a final output. + +Real-world example + +> Imagine you work for a company that processes millions of customer reviews, and you want to find out how often each word is used across all the reviews. Doing this on one machine would take a lot of time, so you want to use a MapReduce process to speed it up by distributing the work across multiple computers. + +In plain words + +> By abstracting the distribution and coordination of these tasks, MapReduce enables developers to focus on the processing logic, rather than the complexities of managing concurrency + +Wikipedia say +> MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. + +## Programmatic Example of DTO Pattern in Java + +Let's first start with our map method + +```java +public static List> map(List sentences) { + List> mapped = new ArrayList<>(); + for (String sentence : sentences) { + String[] words = sentence.split("\\s+"); + for (String word : words) { + mapped.add(new AbstractMap.SimpleEntry<>(word.toLowerCase(), 1)); + } + } + return mapped; +} +``` +The purpose of the Map function is to process the input data and convert it into key-value pairs. + +Now lets go with the reduce function: +```java + public static Map reduce(List> mappedWords) { + Map reduced = new HashMap<>(); + for (Map.Entry entry : mappedWords) { + reduced.merge(entry.getKey(), entry.getValue(), Integer::sum); + } + return reduced; + } +``` +In the Reduce phase, the grouped key-value pairs are processed to combine or summarize the data. The idea is to aggregate all the values for each key to produce the final result. + + + +An example of the console output after running the algorithm with the following input: +input +```java +List sentences = Arrays.asList( +"hello world", +"hello java java", +"map reduce pattern in java", +"world of java map reduce" +); +``` +Output +``` +reduce: 2 +java: 4 +world: 2 +in: 1 +of: 1 +pattern: 1 +hello: 2 +map: 2 +``` + +## When to Use the MapReduce Pattern in Java + +Use the MapReduce pattern when: + +* You are working with massive datasets that can't fit on a single machine. +* You need a batch-processing solution to analyze or transform large volumes of data. +* Fault tolerance and high availability are important for your use case. + +## MapReduce Pattern Java Tutorials + +* [MapReduce Algorithm (Baeldung)](https://www.baeldung.com/cs/mapreduce-algorithm) +* [MapReduce Tutorial (javatpoint)](https://www.javatpoint.com/mapreduce) + + +## Real-World Applications of MapReduce Pattern in Java +MapReduce is a powerful tool for processing large-scale data because it breaks down complex tasks into smaller, manageable parts, allowing for efficient parallel computation across distributed systems + +## Benefits of MapReduce pattern + +1. Developers only need to define two simple functions: Map and Reduce. They don't have to worry about how tasks are distributed, how failures are handled, or how the data is split. +2. MapReduce scales horizontally. You can add more machines (nodes) to the cluster to handle larger data sets or process more tasks in parallel. This makes it perfect for large datasets in the order of terabytes or petabytes. +3. The system is designed to handle failures. If a worker node crashes or a task fails, it automatically retries the task on another available node without developer intervention. + +## Trade-offs of Map Reduce pattern + +1. MapReduce is designed for batch processing, meaning it’s not suitable for real-time data processing or low-latency jobs. The process of splitting data, shuffling it between nodes, and reducing it can introduce significant delays. +2. If the dataset is small or can fit in memory on a single machine, MapReduce can be overkill. The overhead of distributing tasks across nodes outweighs the benefits for small-scale tasks. Single-node solutions like a relational database or in-memory processing tools. +3. MapReduce frequently reads from and writes to disk (between the Map and Reduce phases), leading to performance bottlenecks. This is particularly problematic for jobs with lots of intermediate data. + +## References and Credits + +* [Designing Data-intensive Applications](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321/ref=sr_1_1?s=books&sr=1-1) +* [MapReduce Design Patterns](https://www.amazon.com/MapReduce-Design-Patterns-Effective-Algorithms/dp/1449327176/ref=sr_1_1?crid=3N6I3219DQBM&dib=eyJ2IjoiMSJ9.v6J5LaH30wtWyGQ7t20oSWIhd3rZs9GOaU3r-fSfZbd11rwjP0d0lL4tdcsD_yMt-WY6-XDWWakgkvMv38W9YD7CZDIgJ1G-LuazC8rNILObJBIRg09-7-ugQHZbtkqZFEt1ZCyFiDV4E3Iq2Db41vOpjbrU_B-phwzNQoRU175m1i-WvzTdcWL5GwVcbIWClmYB99kszZ1wX76nfjfq9YUHAFZtlpvLNMavBY4KTjI.QhcDrdrN5Bdd5ZVRTf9cZw0lAXNX83ncVVws8UbVDKU&dib_tag=se&keywords=MapReduce+Design+Patterns&qid=1728522338&s=books&sprefix=mapreduce+design+patterns%2Cstripbooks-intl-ship%2C198&sr=1-1) \ No newline at end of file diff --git a/map-reduce/etc/map-reduce.png b/map-reduce/etc/map-reduce.png new file mode 100644 index 000000000000..fee76e22b311 Binary files /dev/null and b/map-reduce/etc/map-reduce.png differ diff --git a/map-reduce/etc/model-view-intent.urm.puml b/map-reduce/etc/model-view-intent.urm.puml new file mode 100644 index 000000000000..f9f3533121b7 --- /dev/null +++ b/map-reduce/etc/model-view-intent.urm.puml @@ -0,0 +1,10 @@ +@startuml +package com.iluwatar.mapreduce { + + class App { + + App() + + main(args : String[]) {static} + + map(sentences : List) : List> {static} + + reduce(mappedWords : List>) : Map {static} +} +@enduml \ No newline at end of file diff --git a/map-reduce/pom.xml b/map-reduce/pom.xml new file mode 100644 index 000000000000..43ee4bf78fd7 --- /dev/null +++ b/map-reduce/pom.xml @@ -0,0 +1,67 @@ + + + + + java-design-patterns + com.iluwatar + 1.26.0-SNAPSHOT + + 4.0.0 + marker-interface + + + org.junit.jupiter + junit-jupiter-engine + test + + + org.hamcrest + hamcrest-core + test + + + + + + org.apache.maven.plugins + maven-assembly-plugin + + + + + + App + + + + + + + + + diff --git a/map-reduce/src/main/java/com/iluwatar/mapreduce/App.java b/map-reduce/src/main/java/com/iluwatar/mapreduce/App.java new file mode 100644 index 000000000000..117247834083 --- /dev/null +++ b/map-reduce/src/main/java/com/iluwatar/mapreduce/App.java @@ -0,0 +1,80 @@ +package com.iluwatar.mapreduce; + +import lombok.extern.slf4j.Slf4j; +import java.util.AbstractMap; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +/** + * The main intent of the MapReduce design pattern is to allow for the processing of large data sets with a distributed algorithm, + * minimizing the overall time of computation by exploiting various parallel computing nodes. This design pattern simplifies the complexity + * of concurrency and hides the details of data distribution,fault tolerance, and load balancing, making it an effective model for + * processing vast amounts of data. + */ +@Slf4j +public class App { + + /** + * Program entry point. + * + * @param args command line args + */ + public static void main(String[] args) { + + // Sample input: List of sentences + List sentences = Arrays.asList( + "hello world", + "hello java java", + "map reduce pattern in java", + "world of java map reduce" + ); + + // Step 1: Map phase + List> mappedWords = map(sentences); + + // Step 2: Reduce phase + Map wordCounts = reduce(mappedWords); + + // Step 3: Output the final result + wordCounts.forEach((word, count) -> LOGGER.info("{}: {}", word, count)); + } + + /** + * The map function processes a list of input data and produces key-value pairs. + * + * @param sentences The input data to be processed by the map function. + * @return A List of maps entries containing keys (e.g., words) and their occurrences. + */ + public static List> map(List sentences) { + List> mapped = new ArrayList<>(); + for (String sentence : sentences) { + // Split the sentence into words using whitespace as a delimiter + String[] words = sentence.split("\\s+"); + for (String word : words) { + // Create a key-value pair where the key is the word and the value is 1 + mapped.add(new AbstractMap.SimpleEntry<>(word.toLowerCase(), 1)); + } + } + return mapped; + } + + /** + * The reduce function processes the grouped data and aggregates the values + * (e.g., sums up the occurrences for each word). + * + * @param mappedWords A List of maps where each key has a list of associated values. + * @return A final map with each key and its aggregated result. + */ + public static Map reduce(List> mappedWords) { + Map reduced = new HashMap<>(); + for (Map.Entry entry : mappedWords) { + // If the word is already in the map, increment the count, otherwise set it to 1 + reduced.merge(entry.getKey(), entry.getValue(), Integer::sum); + } + return reduced; + } + +} \ No newline at end of file diff --git a/map-reduce/src/test/java/com/iluwatar/AppTest.java b/map-reduce/src/test/java/com/iluwatar/AppTest.java new file mode 100644 index 000000000000..a6ec19f1cd88 --- /dev/null +++ b/map-reduce/src/test/java/com/iluwatar/AppTest.java @@ -0,0 +1,103 @@ +/* + * This project is licensed under the MIT license. Module model-view-viewmodel is using ZK framework licensed under LGPL (see lgpl-3.0.txt). + * + * The MIT License + * Copyright © 2014-2022 Ilkka Seppälä + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ +package com.iluwatar; + +import com.iluwatar.mapreduce.App; +import org.junit.jupiter.api.Test; + +import java.util.AbstractMap; +import java.util.Arrays; +import java.util.List; +import java.util.Map; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +class AppTest { + + @Test + void testMap_singleSentence() { + + List sentences = Arrays.asList("hello world"); + List> result = App.map(sentences); + + // Assert + assertEquals(2, result.size(), "Should have 2 entries"); + assertEquals("hello", result.get(0).getKey()); + assertEquals(1, result.get(0).getValue()); + assertEquals("world", result.get(1).getKey()); + assertEquals(1, result.get(1).getValue()); + } + + @Test + void testMap_multipleSentences() { + + List sentences = Arrays.asList("hello world", "hello java"); + List> result = App.map(sentences); + + // Assert + assertEquals(4, result.size(), "Should have 4 entries (2 words per sentence)"); + assertEquals("hello", result.get(0).getKey()); + assertEquals(1, result.get(0).getValue()); + assertEquals("world", result.get(1).getKey()); + assertEquals(1, result.get(1).getValue()); + assertEquals("hello", result.get(2).getKey()); + assertEquals(1, result.get(2).getValue()); + assertEquals("java", result.get(3).getKey()); + assertEquals(1, result.get(3).getValue()); + } + + @Test + void testReduce_singleWordMultipleEntries() { + + List> mappedWords = Arrays.asList( + new AbstractMap.SimpleEntry<>("hello", 1), + new AbstractMap.SimpleEntry<>("hello", 1) + ); + Map result = App.reduce(mappedWords); + + // Assert + assertEquals(1, result.size(), "Should only contain one unique word"); + assertEquals(2, result.get("hello"), "The count of 'hello' should be 2"); + } + + @Test + void testReduce_multipleWords() { + List> mappedWords = Arrays.asList( + new AbstractMap.SimpleEntry<>("hello", 1), + new AbstractMap.SimpleEntry<>("world", 1), + new AbstractMap.SimpleEntry<>("hello", 1), + new AbstractMap.SimpleEntry<>("java", 1) + ); + Map result = App.reduce(mappedWords); + + // Assert + assertEquals(3, result.size(), "Should contain 3 unique words"); + assertEquals(2, result.get("hello"), "The count of 'hello' should be 2"); + assertEquals(1, result.get("world"), "The count of 'world' should be 1"); + assertEquals(1, result.get("java"), "The count of 'java' should be 1"); + } + +} + diff --git a/pom.xml b/pom.xml index 39a36ce197e3..c4506f5070eb 100644 --- a/pom.xml +++ b/pom.xml @@ -217,6 +217,7 @@ virtual-proxy function-composition microservices-distributed-tracing + map-reduce