Yarn is a system for creating vectorial concept representations from an ontology containing descriptions of these concepts. These concept representations can then be used to disambiguate terms, and link them to the appropriate concept.
For more information, see the paper Using Distributed Representations to Disambiguate Biomedical and Clinical Concepts by Stéphan Tulkens, Simon Šuster and Walter Daelemans, which was presented at the BioNLP Workshop at ACL 2016.
MIT
Stéphan Tulkens, Simon Suster, and Walter Daelemans. If you use this work or build upon it, please cite our paper, as follows:
@inproceedings{tulkens2016using,
title={Using Distributed Representations to Disambiguate Biomedical and Clinical Concepts},
author={Tulkens, St{\'e}phan and {\v{S}}uster, Simon and Daelemans, Walter},
booktitle={Proceedings of the 15th Workshop on Biomedical Natural Language Processing},
pages={77--82},
year={2016}
}
- Python 3
- Numpy
- Reach
All are available from pip
Yarn requires:
- A set of word vectors
- A set of concepts, with their descriptions
- A set of documents with their ambiguous terms marked
The word vectors we used can be downloaded from the BioASQ website.
If you want to replicate the original experiments, you need to adhere to the formats below. If you want to use Yarn for your own experiments, e.g. just creating concept representations, you can choose your own format.
Concepts are represented by a top-level dictionary of terms, concepts that pertain to these terms, and a list of descriptions (strings), of these concepts.
{"term":
{"concept id_1":
[description_1,
description_2,
...
description_n]
},
{"concept_id_2":
[description_1,
description_2,
...
description_n]
}
}
Similarly, documents to be disambiguated are represented by a dictionary. Note that each document must contain at least one occurrence of the ambiguous term under which it is classified.
{"term":
{"concept id_1":
[document_1,
document_2,
...
document_n]
},
{"concept_id_2":
[document_1,
document_2,
...
document_n]
}
}
The original Yarn experiments were run with the MSH dataset (Jimeno-Yepes 2011) and the 2015AB release of the UMLS. Because these resources are not freely distributable, we were not able to redistribute them with this package.