Rank List Fusion using multiple Text Representations Strategies for Authorship Clustering.
This repository contains files related to Margueron's Authorship clustering master thesis (2021).
This study aim to select different methods, text representations and distance metrics to implement and evaluate different strategies to solve the authorship clustering problem.
In this case, having a set of
To represent the underlying style of each document various text representation have been suggested: words frequencies, lemmas frequencies, letters
In a first part, the rank lists obtained by each individual representation are used to solve the authorship clustering task. From a rank list, an automatic clustering algorithm can generate the corresponding clusters, hopefully regrouping all the texts written by a given author under the same cluster. The clustering task is achieved by using three different models based on hierarchical clustering. A Silhouette score based model, a distribution-based model and a regression-based clustering models are proposed and evaluated. The clustering obtained by these models are close to the best achievable clustering for the rank lists used. In a second part, a new rank list is created by fusing rank lists obtained using different strategies. Two approaches are explored for the fusion: one use Z-Score normalization and another one use logistic regression. The rank list obtained by fusion is shown to have better performances than the best individual rank list. In addition to the fusion, two veto strategy are proposed to try to enhance the fusion quality. The veto does not show any significant improvements with the rank lists used. Lastly, we showed that using rank lists obtained by fusion yield better clustering results than the ones obtained by individual methods.
.
├── README.md | This document
├── .githooks | Hook to avoid commiting Jupyter notebook outputs
├── code | Python source code for the experiments realized
│ ├── corpus | Corpora used
│ ├── *.py | Proposed implemented methods
│ ├── experiments.ipynb | Experiments source code
│ └── requirement.txt | Requirement files for external Python packages
└── report | Thesis report latex source code
- Python >= 3.8.8 : Programming language
- Pip >= 21.1.3 : Python packet manager
- the packages in requirements.txt : See Installation
cd code
pip install -r requirements.txt
The experiments (runnable scripts) were written in a Jupyter notebook file format.
Jupyter is installed with the requirements, if you do not already have it. After installing it, you can simply run, for example, JupyterLab, a web IDE for Jupyter notebooks, with the command below.
cd code
jupyter lab
This repository history may be not accurate. This is due to a history rewrite that happened on July the 12th. The repository was private until this date.
During the git history rewrite, copyrighted files were removed from the tree (copyrighted papers .pdf files), to be allowed on GitHub.
git filter-branch
was used to the rewrite the history.