Skip to content

A Word Sense Disambiguation system integrating implicit and explicit external knowledge.

License

Notifications You must be signed in to change notification settings

jesper-sk/ewiser

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EWISER Fork

This repository is a fork of the EWISER repository. The original README is below. The installation guide from the original README has been updated to reflect the changes made to the original repository.

Note that this fork repository is intended to be used within the context of my Master's thesis repository msc-thesis-ai-imp. The changes made to this repository are not intended to be used outside of this context.

Installation

The installation guide has been altered slightly to work with the pipenv setup of the main repository. We'll need some pytorch-geometry dependencies, with pytorch and CUDA support for the versions used in the main repository. All other dependencies come from the requirements.txt. The installation guide is as follows:

python -m pip install -r requirements.txt
python -m pip install torch-scatter torch-sparse -f https://pytorch-geometric.com/whl/torch-2.0.1+cu118.html
python -m pip install -e .

It is recommended to use a virtual environment, such as pipenv or conda. Now that we've installed all packages, we need to download a spaCy model. This is done by running the following command:

python -m spacy download en_core_web_sm

Now you are ready to start!

Changes

  • Add bin/annotate_cwsd.py to annotate a corpus with the EWISER model.
  • Add bin/annotate_bookcorpus.py to annotate the bookcorpus (and, probably, other huggingface datasets) with the EWISER model.
  • Add find_packages() to setup.py to include the ewiser package.
  • Add spacy as dependency in requirements.txt.
  • Update gitignore to ignore build directory.
  • Update README.md to reflect changes made to the installation guide.

EWISER (Enhanced WSD Integrating Synset Embeddings and Relations)

This repo hosts the code necessary to reproduce the results of our ACL 2020 paper, Breaking Through the 80% Glass Ceiling: Raising the State of the Art in Word Sense Disambiguation by Incorporating Knowledge Graph Information, by Michele Bevilacqua and Roberto Navigli, which you can read on ACL Anthology.

You will also find a simple spacy plugin that makes it easy to use EWISER in your own project!

EWISER relies on the fairseq library.

NEWS:

Check out the Multilingual section below!

How to Cite

@inproceedings{bevilacqua-navigli-2020-breaking,
    title = "Breaking Through the 80{\%} Glass Ceiling: {R}aising the State of the Art in Word Sense Disambiguation by Incorporating Knowledge Graph Information",
    author = "Bevilacqua, Michele  and Navigli, Roberto",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.255",
    pages = "2854--2864",
    abstract = "Neural architectures are the current state of the art in Word Sense Disambiguation (WSD). However, they make limited use of the vast amount of relational information encoded in Lexical Knowledge Bases (LKB). We present Enhanced WSD Integrating Synset Embeddings and Relations (EWISER), a neural supervised architecture that is able to tap into this wealth of knowledge by embedding information from the LKB graph within the neural architecture, and to exploit pretrained synset embeddings, enabling the network to predict synsets that are not in the training set. As a result, we set a new state of the art on almost all the evaluation settings considered, also breaking through, for the first time, the 80{\%} ceiling on the concatenation of all the standard all-words English WSD evaluation benchmarks. On multilingual all-words WSD, we report state-of-the-art results by training on nothing but English.",
}

Externally downloadable resources

EWISER English checkpoints:

EWISER multilingual checkpoints:

Datasets:

  • WSD Evaluation Framework: contains the SemCor training corpus, along with the evaluation datasets from Senseval and SemEval.
  • Multilingual Evaluation Datasets: the repo contains the French, German, Italian and Spanish datasets from SemEval 2013 and 2015.
  • The other datasets used are in res/corpora/*/orig.

Pre-preprocessed SensEmBERT + LMMS embeddings (needed to train your own EWISER model):

Multilinguality

EWISER supports all the languages for which you are able to create a mapping starting from BabelNet indices 4.0.1.

French/German/Italian/Spanish

  1. Download the BabelNet indices (ver. 4.0.1);
  2. cd multilinguality;
  3. Set your BabelNet indices path in multilinguality/config/babelnet.var.properties;
  4. bash enable.sh. The mapping is limited to the Princeton WordNet subgraph (so you need to use the wn split if you plan to evaluate on mwsd-datasets).

Other languages

Please download the multilingual mapper from Google Drive and find the instructions contained there.

Evaluate

Evaluation is run using bin/eval_wsd.py:

# Download the WSD framework
# wget -c http://lcl.uniroma1.it/wsdeval/data/WSD_Evaluation_Framework.zip -P res
# unzip
# WSD_FRAMEWORK=res/WSD_Evaluation_Framework

python bin/eval_wsd --checkpoints <your_checkpoint.pt> --xmls ${WSD_FRAMEWORK}/Evaluation_Datasets/ALL/ALL.data.xml ${WSD_FRAMEWORK}/Evaluation_Datasets/semeval2007/semeval2007.data.xml

Spacy plugin

EWISER can be used as a spacy plugin. Please check bin/annotate.py.

Train Your Model

Experiment Folder and Preprocessing

To train a model from scratch, you need to set up an experiment folder containing:

  • the dict.txt file from res/dictionaries/dict.txt
  • the preprocessed training corpora with name train, train1, train2 etc.
  • the preprocessed validation dataset with name valid.

We have included our experiment directories in res/experiments/.

Should you need to preprocess your own corpus, you can use bin/preprocess_wsd.py (check out python bin/preprocess_wsd.py --help)!

Training

To launch a training run, execute:

cd bin
bash bin/train-ewiser.sh

This will train EWISER on SemCor + tagged glosses + WordNet Examples. It assumes you have downloaded the LMMS+SensEmBERT embeddings and put them in res/embeddings/.

You can modify hyperparameters or change the training corpora by modifyng train-ewiser.py. Arguments are documented in ewiser/fairseq_ext/models/sequence_tagging.py.

Training Resources

Sense Embeddings

If you want to use your own sense embeddings in EWISER, you have to preprocess them as follows:

python bin/get_centroids.py ${EMBEDDINGS} ${EMBEDDINGS}.centroids.txt bin/sensekeys2offsets.txt
python bin/reduce_dims.py ${EMBEDDINGS}.centroids.txt ${EMBEDDINGS}.centroids.svd512.txt -d 512

The sense embeddings will have to be in Glove .txt format, without a header row, and with a WN 3.0 sensekey as identifiers.

Edges

The adjacency matrix A in EWISER is stored as an edgelist. Each line is an edge, with three \t-separated values. Check res/edges/ for examples.

License

This project is released under the CC-BY-NC-SA 4.0 license (see LICENSE.txt). If you use EWISER, please put a link to this repo.

Acknowledgements

The authors gratefully acknowledge the support of the ERC Consolidator Grant MOUSSE No. 726487 under the European Union's Horizon 2020 research and innovation programme.

This work was supported in part by the MIUR under the grant "Dipartimenti di eccellenza 2018-2022" of the Department of Computer Science of the Sapienza University of Rome.

About

A Word Sense Disambiguation system integrating implicit and explicit external knowledge.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.1%
  • Shell 1.9%