from https://github.com/mannefedov/ru_kw_eval_datasets
habr -- HabraHabr https://habr.com/
ng -- Независимая Газета http://www.ng.ru/
rt -- Russia Today https://russian.rt.com/
cl -- Cyberleninka https://cyberleninka.ru/
- tokenization
- lemmatization
- extracting nouns and adjectives in nominative case: "ясная ночь", not "ясный ночь"
- ngrams spliteration (n=2)
Further improvements: implement udpipe for tokenization and lemmatization, combine with pymorphy to extract
- Simple TFIDF method
- SCAKE graph method https://arxiv.org/pdf/1811.10831v1.pdf
- NN approach (in progress)
usage: model_trainer.py [-h] [-d [{all,rt,ng,habr,cl}]] [-o OUTPUT_PATH]
[-m [{tfidf,scake}]]
Keyword extractor
optional arguments:
-h, --help show this help message and exit
-d [{all,rt,ng,habr,cl}], --dataset [{all,rt,ng,habr,cl}]
-o OUTPUT_PATH, --output OUTPUT_PATH
-m [{tfidf,scake}], --model [{tfidf,scake}]
Metric | Value |
---|---|
Precision | 0.1385 |
Recall | 0.2649 |
F1 | 0.1733 |
Jaccard | 0.1014 |
(much slower than tfidf)
Metric | Value |
---|---|
Precision | 0.1717 |
Recall | 0.2833 |
F1 | 0.2021 |
Jaccard | 0.1199 |
Model implemented (training is available), but it does not perform well. Further investigation needed
File "main.py" contains text that is normalized and then kws are extracted using different approaches in parallel with the usage of celery