This project provides a neural network(bi-LSTM + CRF) approach for biomedical Named Entity Recognition.
Our implementation is based on the Tensorflow library on python.
- TITLE : CollaboNet: collaboration of deep neural networks for biomedical named entity recognition
* Accepted for CIKM 2018 workshop - ACM 12th International Workshop on Data and Text Mining in Biomedical Informatics (DTMBIO2018). - AUTHOR : Wonjin Yoon1!, Chan Ho So2!, Jinhyuk Lee1 and Jaewoo Kang1*
- Author details
1 Department of Computer Science and Engineering, Korea University
2 Interdisciplinary Graduate Program in Bioinformatics, Korea University
! Equal contributor
- Author details
At least one CUDA compatible GPU device is strongly recommanded for execution of this project codes.
python 2.7
numpy 1.14.2
tensorflow-gpu 1.7.0
The code is distributed under MIT license.
Citeable paper can be found at pre-print server [here]
This software includes third party software.
See License-thirdparty.txt for details.
[LEFT] Character level word embedding using CNN and overview of Bidirectional LSTM with Conditional Random Field (BiLSTM-CRF).
[RIGHT] Structure of CollaboNet when Gene model act as a role of target model. Rhombus represents the CRF layer. Arrows show the flow of information when target model is training. Dashed arrows mean that information is not flowing when target model is under training.
We used datasets collected by Crichton et al.
These datasets by Crichton et al. are available here.
We found that the JNLPBA dataset from Crichton et al. contains sentences which were incorrectly split.
So we re-generated the dataset from the original corpus by Kim et al..
The details of each dataset is showed below:
Corpora | Entity type | No. sentence | No. annotations | Data Size |
---|---|---|---|---|
NCBI-Disease (Dogan et al., 2014) | Disease | 7,639 | 6,881 | 793 abstracts |
JNLPBA (Kim et al., 2004) | Gene/Proteins | 22,562 | 35,336 | 2,404 abstracts |
BC5CDR (Li et al., 2016) | Chemicals | 14,228 | 15,935 | 1,500 articles |
BC5CDR (Li et al., 2016) | Diseases | 14,228 | 12,852 | 1,500 articles |
BC4CHEMD (Krallinger et al., 2015a) | Chemicals | 86,679 | 84,310 | 10,000 abstracts |
BC2GM (Akhondi et al., 2014) | Gene/Proteins | 20,510 | 24,583 | 20,000 sentences |
The datasets are publicly available by executing download.sh.
We used pre-trained word embeddings from Pyysalo et al. which is trained on PubMed, PubMed Central(PMC) and Wikipedia text. It will be automatically downloaded by executing download.sh.
bash download.sh
Preperation phase (Phase 0) of CollaboNet
python run.py --ncbi --jnlpba --bc5_chem --bc5_disease --bc4 --bc2 --lr_pump --lr_decay 0.05
You can also refer to stm.sh for detailed usage.
You should produce pre-trained STM model by executing Preperation phase before running CollaboNet.
python run.py --ncbi --jnlpba --bc5_chem --bc5_disease --bc4 --bc2 --lr_pump --lr_decay 0.05 --pretrained STM_MODEL_DIRECTORY_NAME(ex 201806210605)
You can find STM_MODEL_DIRECTORY_NAME from ./modelSave folder.
You can also refer to collabo.sh for detailed usage.
Model | NCBI-disease | JNLPBA | BC5CDR-chem | BC5CDR-disease | BC4CHEMD | BC2GM | Average | |
---|---|---|---|---|---|---|---|---|
Habibi et al. (2017) STM | F1 Score | 84.44 | 77.25 | 90.63 | 83.49 | 86.62 | 77.82 | 83.38 |
Wang et al. (2018) STM | F1 Score | 83.92 | 72.17 | *89.85 | *82.68 | 88.75 | 80.00 | 82.90 |
Our STM | F1 Score | 84.69 | 77.39 | 92.74 | 82.61 | 88.40 | 79.27 | 84.03 |
- Scores in the asterisked (*) cells are obtained in the experiments that we conducted; these scores are not reported in the original papers.
- The best scores from these experiments are in bold.
NCBI-disease | JNLPBA | BC5CDR-chem | BC5CDR-disease | BC4CHEMD | BC2GM | Average | ||
---|---|---|---|---|---|---|---|---|
Wang et al. (2018) MTM | F1 Score | 86.14 | 73.52 | *91.29 | *83.33 | 89.37 | 80.74 | 84.07 |
Our CollaboNet | F1 Score | 86.36 | 78.58 | 93.31 | 84.08 | 88.85 | 79.73 | 85.15 |
- Scores in the asterisked (*) cells are obtained in the experiments that we conducted; these scores are not reported in the original papers.
- The best scores from these experiments are in bold.