Skip to content

sjjpo2002/Zipf_Law

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

Testing Zipf's Law on Wikipedia Corpus

Word count is a fundamental task in Natural Language Processing (NLP). Zipf's Law posits that within any natural language corpus, the frequency of any word is inversely proportional to its rank in the frequency table. This means the most frequent word appears roughly twice as often as the second most frequent word, three times as often as the third, and so on.

To test this theory, an ideal dataset would be the collection of Wikipedia articles, given its vast and diverse content. Using Apache Spark, we can efficiently process this large corpus to identify and rank the top 20 most popular words, providing a practical demonstration of Zipf's Law in action.

About

Zipf's Law on Wikipedia Corpus

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published