This repository is a fork of the EWISER repository. The original README is below. The installation guide from the original README has been updated to reflect the changes made to the original repository.
Note that this fork repository is intended to be used within the context of my Master's thesis repository msc-thesis-ai-imp. The changes made to this repository are not intended to be used outside of this context.
The installation guide has been altered slightly to work with the pipenv
setup of the main repository. We'll need some pytorch-geometry dependencies, with pytorch and CUDA support for the versions used in the main repository. All other dependencies come from the requirements.txt
. The installation guide is as follows:
python -m pip install -r requirements.txt
python -m pip install torch-scatter torch-sparse -f https://pytorch-geometric.com/whl/torch-2.0.1+cu118.html
python -m pip install -e .
It is recommended to use a virtual environment, such as pipenv
or conda
. Now that we've installed all packages, we need to download a spaCy
model. This is done by running the following command:
python -m spacy download en_core_web_sm
Now you are ready to start!
- Add
bin/annotate_cwsd.py
to annotate a corpus with the EWISER model. - Add
bin/annotate_bookcorpus.py
to annotate the bookcorpus (and, probably, other huggingface datasets) with the EWISER model. - Add
find_packages()
tosetup.py
to include theewiser
package. - Add
spacy
as dependency inrequirements.txt
. - Update
gitignore
to ignorebuild
directory. - Update
README.md
to reflect changes made to the installation guide.
This repo hosts the code necessary to reproduce the results of our ACL 2020 paper, Breaking Through the 80% Glass Ceiling: Raising the State of the Art in Word Sense Disambiguation by Incorporating Knowledge Graph Information, by Michele Bevilacqua and Roberto Navigli, which you can read on ACL Anthology.
You will also find a simple spacy plugin that makes it easy to use EWISER in your own project!
EWISER relies on the fairseq
library.
Check out the Multilingual section below!
@inproceedings{bevilacqua-navigli-2020-breaking,
title = "Breaking Through the 80{\%} Glass Ceiling: {R}aising the State of the Art in Word Sense Disambiguation by Incorporating Knowledge Graph Information",
author = "Bevilacqua, Michele and Navigli, Roberto",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.255",
pages = "2854--2864",
abstract = "Neural architectures are the current state of the art in Word Sense Disambiguation (WSD). However, they make limited use of the vast amount of relational information encoded in Lexical Knowledge Bases (LKB). We present Enhanced WSD Integrating Synset Embeddings and Relations (EWISER), a neural supervised architecture that is able to tap into this wealth of knowledge by embedding information from the LKB graph within the neural architecture, and to exploit pretrained synset embeddings, enabling the network to predict synsets that are not in the training set. As a result, we set a new state of the art on almost all the evaluation settings considered, also breaking through, for the first time, the 80{\%} ceiling on the concatenation of all the standard all-words English WSD evaluation benchmarks. On multilingual all-words WSD, we report state-of-the-art results by training on nothing but English.",
}
EWISER English checkpoints:
EWISER multilingual checkpoints:
Datasets:
- WSD Evaluation Framework: contains the SemCor training corpus, along with the evaluation datasets from Senseval and SemEval.
- Multilingual Evaluation Datasets: the repo contains the French, German, Italian and Spanish datasets from SemEval 2013 and 2015.
- The other datasets used are in
res/corpora/*/orig
.
Pre-preprocessed SensEmBERT + LMMS embeddings (needed to train your own EWISER model):
EWISER supports all the languages for which you are able to create a mapping starting from BabelNet indices 4.0.1
.
- Download the BabelNet indices (ver. 4.0.1);
cd multilinguality
;- Set your BabelNet indices path in
multilinguality/config/babelnet.var.properties
; bash enable.sh
. The mapping is limited to the Princeton WordNet subgraph (so you need to use thewn
split if you plan to evaluate onmwsd-datasets
).
Please download the multilingual mapper from Google Drive and find the instructions contained there.
Evaluation is run using bin/eval_wsd.py
:
# Download the WSD framework
# wget -c http://lcl.uniroma1.it/wsdeval/data/WSD_Evaluation_Framework.zip -P res
# unzip
# WSD_FRAMEWORK=res/WSD_Evaluation_Framework
python bin/eval_wsd --checkpoints <your_checkpoint.pt> --xmls ${WSD_FRAMEWORK}/Evaluation_Datasets/ALL/ALL.data.xml ${WSD_FRAMEWORK}/Evaluation_Datasets/semeval2007/semeval2007.data.xml
EWISER can be used as a spacy
plugin. Please check bin/annotate.py
.
To train a model from scratch, you need to set up an experiment folder containing:
- the
dict.txt
file fromres/dictionaries/dict.txt
- the preprocessed training corpora with name
train
,train1
,train2
etc. - the preprocessed validation dataset with name
valid
.
We have included our experiment directories in res/experiments/
.
Should you need to preprocess your own corpus, you can use bin/preprocess_wsd.py
(check out python bin/preprocess_wsd.py --help
)!
To launch a training run, execute:
cd bin
bash bin/train-ewiser.sh
This will train EWISER on SemCor + tagged glosses + WordNet Examples. It assumes you have downloaded the LMMS+SensEmBERT embeddings and put them in res/embeddings/
.
You can modify hyperparameters or change the training corpora by modifyng train-ewiser.py
. Arguments are documented in ewiser/fairseq_ext/models/sequence_tagging.py
.
If you want to use your own sense embeddings in EWISER, you have to preprocess them as follows:
python bin/get_centroids.py ${EMBEDDINGS} ${EMBEDDINGS}.centroids.txt bin/sensekeys2offsets.txt
python bin/reduce_dims.py ${EMBEDDINGS}.centroids.txt ${EMBEDDINGS}.centroids.svd512.txt -d 512
The sense embeddings will have to be in Glove .txt format, without a header row, and with a WN 3.0 sensekey as identifiers.
The adjacency matrix A in EWISER is stored as an edgelist. Each line is an edge, with three \t
-separated values. Check res/edges/
for examples.
This project is released under the CC-BY-NC-SA 4.0 license (see LICENSE.txt
). If you use EWISER, please put a link to this repo.
The authors gratefully acknowledge the support of the ERC Consolidator Grant MOUSSE No. 726487 under the European Union's Horizon 2020 research and innovation programme.
This work was supported in part by the MIUR under the grant "Dipartimenti di eccellenza 2018-2022" of the Department of Computer Science of the Sapienza University of Rome.