Skip to content

Latest commit

 

History

History
111 lines (83 loc) · 4.79 KB

README.md

File metadata and controls

111 lines (83 loc) · 4.79 KB

Modeling Fine-Grained Entity Types with Box Embeddings

Modeling Fine-Grained Entity Types with Box Embeddings
Yasumasa Onoe, Michael Boratko, Andrew McCallum, Greg Durrett
ACL 2021

@inproceedings{onoe2021boxet,
 title={Modeling Fine-Grained Entity Types with Box Embeddings},
 author={Yasumasa Onoe, Michael Boratko, Andrew McCallum, Greg Durrett},
 booktitle={ACL},
 year={2021}
}

Getting Started

Dependencies

$ git clone https://github.com/yasumasaonoe/Box4Types.git

This code has been tested with Python 3.7 and the following dependencies:

  • torch==1.7.1 (Please install the right version of Pytorch depending on your CUDA version.)
  • transformers==4.9.2
  • wandb==0.12.1

If you're using a conda environment, please use the following commands:

$ conda create -n box4et python=3.7
$ conda activate box4et
$ pip install  [package name]

File Descriptions

  • box4et/main.py: Main script for training and evaluating models, and writing predictions to an output file.
  • box4et/models.py: Defines a Transformer-based entity typing model.
  • box4et/data_utils.py: Contains data loader and utility functions.
  • box4et/constant.py: Defines paths etc.
  • box4et/scorer.py: Compute precision, recall, and F1 given an output file.
  • box4et/train_*.sh: Sample training command.
  • box4et/eval_*.sh: Sample evaluation command.

Datasets / Models

This code assumes 3 directories listed below. Paths to these directories are specified in box4et/constant.py.

  • ./data: This directory contains train/dev data files.
  • ./data/ontology: This directory contains type vocab files.
  • ./model: Trained models will be saved in this directory. When you run main.py with the test mode, the trained model is loaded from here.
  • Download model checkpoints (box and vector models for 4 datasets) from here (NOTE: total size is around 30GB).
  • UFET: We do not include the augmented UFET training set since it is derived from English Gigaword, which belongs to LDC. If you have a LDC membership and want to use the augmented data, please contact at [email protected].

Run this to download these folders.

$ bash download_data.sh

The data files are formatted as jsonlines. Here is an example from UFET:

{
    "ex_id": "dev_190", 
    "right_context": ["."], 
    "left_context": ["For", "this", "handpicked", "group", "of", "jewelry", "savvy", "Etsy", "artisans", ",", "their", "passion", "is", "The", "Hunger", "Games", ",", "the", "first", "of", "3", "best", "selling", "young", "adult", "books", "by"], 
    "right_context_text": ".", 
    "left_context_text": "For this handpicked group of jewelry savvy Etsy artisans , their passion is The Hunger Games , the first of 3 best selling young adult books by",
    "y_category": ["name", "person", "writer", "author"],
    "word": "Suzanne Collins", 
    "mention_as_list": ["Suzanne", "Collins"]
}

Field Description
ex_id Unique example ID.
right_context Tokenized right context of a mention.
left_context Tokenized left context of a mention.
word A mention.
right_context_text Right context of a mention.
left_context_text Left context of a mention.
y_category The gold entity types derived from Wikipedia categories.
y_title Wikipedia title of the gold Wiki entity.
mention_as_list A tokenized mention.

Entity Typing Training and Evaluation

Training

main.py is the primary script for training and evaluating models. See box4et/train_*.sh.

$ cd box4et
$ bash train_box.sh

Evaluation

If you would like to evaluate the trained model on another dataset, simply set --mode to test and point to the test data using --eval_data. Make sure put -load so that the trained model will be loaded. See box4et/eval_*.sh.

$ cd box4et
$ bash eval_box.sh