My BSc thesis in Computer Science - IT Analyst written at Jagiellonian University in 2019.
The paper outlines common properties of networks, formalizes the concept of community detection by introducing a measure known as modularity and provides descriptions of the most popular modularity maximization algorithms, and their pseudocodes. Finally, it includes multiple tests and their analysis, showcasing how different algorithms compare in terms of execution time and the quality of partitions detected. The paper also exemplifies the existence of a resolution limit - the fact that modularity maximization algorithms fail to detect communities smaller than a certain scale. Lastly, it includes a thorough analysis of how the vertex ordering of the network impacts the execution time of the Blondel et al. algorithm and provides a useful heuristic for improving it.
This project consists of:
- The thesis itself
- The implementations of algorithms described in it, which include:
- the Blondel et al. algorithm
- the Clauset-Newman-Moore algorithm
- the Girvan-Newman algorithm
- the hierarchical clustering algorithm
- The tests allowing for comparison of the algorithms' execution time and accuracy in finding proper community decomposition of networks. Those include:
- randomized tests - run on networks generated in accordance with the Girvan-Newman (paper) and the Toivonen et al. models (paper)
- real-world tests - run on real-world networks taken from the Mark Newman's website and the Stanford Large Networks Dataset Collection
- resolution limit test - exemplifying the existence of a so-called resolution limit in community detection
In order to test the algorithms, use Python 3 to run the scripts located in subdirectories of networks/test directory. They do not take any arguments but some require the networkx library to load contents of .gml files with network data. The descriptions of what the scripts do is provided in comments to respective test files.
The algorithms can also be run directly by executing corresponding functions in each of the four files in networks/algorithms directory. They each take two arguments - a list of integers from 0 to n representing nodes of the network and a list of pairs of integers from 0 to n representing its edges. Each of the algorithms returns two values: a list of lists representing communities found by the algorithm and the modularity value of the community partition found.