Word count is a fundamental task in Natural Language Processing (NLP). Zipf's Law posits that within any natural language corpus, the frequency of any word is inversely proportional to its rank in the frequency table. This means the most frequent word appears roughly twice as often as the second most frequent word, three times as often as the third, and so on.
To test this theory, an ideal dataset would be the collection of Wikipedia articles, given its vast and diverse content. Using Apache Spark, we can efficiently process this large corpus to identify and rank the top 20 most popular words, providing a practical demonstration of Zipf's Law in action.