Testing Zipf's Law on Wikipedia Corpus

Word count is a fundamental task in Natural Language Processing (NLP). Zipf's Law posits that within any natural language corpus, the frequency of any word is inversely proportional to its rank in the frequency table. This means the most frequent word appears roughly twice as often as the second most frequent word, three times as often as the third, and so on.

To test this theory, an ideal dataset would be the collection of Wikipedia articles, given its vast and diverse content. Using Apache Spark, we can efficiently process this large corpus to identify and rank the top 20 most popular words, providing a practical demonstration of Zipf's Law in action.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
Zipf_Law.ipynb		Zipf_Law.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Testing Zipf's Law on Wikipedia Corpus

About

Releases

Packages

Languages

sjjpo2002/Zipf_Law

Folders and files

Latest commit

History

Repository files navigation

Testing Zipf's Law on Wikipedia Corpus

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages