NLP Auhorship Detection

Problem

Having texts from two song writers, detect the authorship of a new sample.

Approach

Preprocess the samples

clean the documents from apostrophes which were misleading the tokenizer
tokenize into words
lemmatize the words to consider different forms of the same word as one; use pymorphy2 morphological analyser github.com/pymorphy2/pymorphy2/issues/80 that works with Ukrainian language, as our data is all in Ukrainian

Extract features

build a vocabulary from all the texts we have
use bag-of-words technique to store the cout of words (present in vocabulary) computed with the Term-Frequency Inverse Document-frequency (TF-I-DF). TF-I-DF helps to avoid over-weighting the topic-specific words or some frequently-used parts of speech by assuming that a less-frequent word is more important to the text than a typical one.

Train and evaluate several cassifiers to determint a better model for this task.
Writeup on the outcomes ⬇️

Writeup

Whichever feature extraction technique we are using, the input for NLP model is usually very high dimensional. This causes deteriorated performance in many classifier models which are assuming that features are independent. For NLP tasks, however, the features (e.g. word occurrence) actually may have covariance - due to natural sentence hierachy and some common phrases used in texts. Moreover, such type of data often suffers from an empty space phenomenon (as more data is needed to 'fill' its space for a decent training - otherwise some sampes could be seen as outliers, and some cases could be underrepresented).

The first model that comes to mind that could deal with the problems above is Support Verctor Machine (SVM). It still works good in high-dimentional space because it doesn't use all the saples (vectors in the space) for the hyperplane - but only the support vectors which help to separate the classes.

And, as you can see with the AUC scores, the sklearn.svm.SVC (SVM model implementation with linear kernel by default) is performing better than other models.

Another theoretically good algorithm choice could be the Random Forest model, but it didn't quite confirmed with only 0.78 AUC score on average.

I aslo tried some other models to check if some would surprisingly turn out to work better on our specific dataset, but SVM still kept it's championship :)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
test		test
train		train
README.md		README.md
nlp_task.ipynb		nlp_task.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Auhorship Detection

About

Releases

Packages

Languages

LAnastasiia/nlp_authorship_detection

Folders and files

Latest commit

History

Repository files navigation

NLP Auhorship Detection

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages