CollaboNet: collaboration of deep neural networks for biomedical named entity recognition

This project provides a neural network(bi-LSTM + CRF) approach for biomedical Named Entity Recognition.
Our implementation is based on the Tensorflow library on python.

TITLE : CollaboNet: collaboration of deep neural networks for biomedical named entity recognition
* Accepted for CIKM 2018 workshop - ACM 12th International Workshop on Data and Text Mining in Biomedical Informatics (DTMBIO2018).
AUTHOR : Wonjin Yoon^1!, Chan Ho So^2!, Jinhyuk Lee¹ and Jaewoo Kang^1*
- Author details
  ¹ Department of Computer Science and Engineering, Korea University
  ² Interdisciplinary Graduate Program in Bioinformatics, Korea University
  ^! Equal contributor

Requirements

At least one CUDA compatible GPU device is strongly recommanded for execution of this project codes.
python 2.7
numpy 1.14.2
tensorflow-gpu 1.7.0

License

The code is distributed under MIT license.
Citeable paper can be found at pre-print server [here]

This software includes third party software.
See License-thirdparty.txt for details.

Model

[LEFT] Character level word embedding using CNN and overview of Bidirectional LSTM with Conditional Random Field (BiLSTM-CRF).
[RIGHT] Structure of CollaboNet when Gene model act as a role of target model. Rhombus represents the CRF layer. Arrows show the flow of information when target model is training. Dashed arrows mean that information is not flowing when target model is under training.

Data

Train, Test Data

We used datasets collected by Crichton et al.
These datasets by Crichton et al. are available here.
We found that the JNLPBA dataset from Crichton et al. contains sentences which were incorrectly split.
So we re-generated the dataset from the original corpus by Kim et al..

The details of each dataset is showed below:

Corpora	Entity type	No. sentence	No. annotations	Data Size
NCBI-Disease (Dogan et al., 2014)	Disease	7,639	6,881	793 abstracts
JNLPBA (Kim et al., 2004)	Gene/Proteins	22,562	35,336	2,404 abstracts
BC5CDR (Li et al., 2016)	Chemicals	14,228	15,935	1,500 articles
BC5CDR (Li et al., 2016)	Diseases	14,228	12,852	1,500 articles
BC4CHEMD (Krallinger et al., 2015a)	Chemicals	86,679	84,310	10,000 abstracts
BC2GM (Akhondi et al., 2014)	Gene/Proteins	20,510	24,583	20,000 sentences

The datasets are publicly available by executing download.sh.

Pre-trained Embeddings

We used pre-trained word embeddings from Pyysalo et al. which is trained on PubMed, PubMed Central(PMC) and Wikipedia text. It will be automatically downloaded by executing download.sh.

Usage

Download Data

bash download.sh

Single Task Model [STM] (6 datasets)

Preperation phase (Phase 0) of CollaboNet

python run.py --ncbi --jnlpba --bc5_chem --bc5_disease --bc4 --bc2 --lr_pump --lr_decay 0.05

You can also refer to stm.sh for detailed usage.

CollaboNet (6 datasets)

You should produce pre-trained STM model by executing Preperation phase before running CollaboNet.

python run.py --ncbi --jnlpba --bc5_chem --bc5_disease --bc4 --bc2 --lr_pump --lr_decay 0.05 --pretrained STM_MODEL_DIRECTORY_NAME(ex 201806210605)

You can find STM_MODEL_DIRECTORY_NAME from ./modelSave folder.
You can also refer to collabo.sh for detailed usage.

Performance

STM

Model		NCBI-disease	JNLPBA	BC5CDR-chem	BC5CDR-disease	BC4CHEMD	BC2GM	Average
Habibi et al. (2017) STM	F1 Score	84.44	77.25	90.63	83.49	86.62	77.82	83.38
Wang et al. (2018) STM	F1 Score	83.92	72.17	*89.85	*82.68	88.75	80.00	82.90
Our STM	F1 Score	84.69	77.39	92.74	82.61	88.40	79.27	84.03

Scores in the asterisked (*) cells are obtained in the experiments that we conducted; these scores are not reported in the original papers.
The best scores from these experiments are in bold.

CollaboNet

		NCBI-disease	JNLPBA	BC5CDR-chem	BC5CDR-disease	BC4CHEMD	BC2GM	Average
Wang et al. (2018) MTM	F1 Score	86.14	73.52	*91.29	*83.33	89.37	80.74	84.07
Our CollaboNet	F1 Score	86.36	78.58	93.31	84.08	88.85	79.73	85.15

Scores in the asterisked (*) cells are obtained in the experiments that we conducted; these scores are not reported in the original papers.
The best scores from these experiments are in bold.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

CollaboNet: collaboration of deep neural networks for biomedical named entity recognition

Quick Links

Requirements

License

Model

Data

Train, Test Data

Pre-trained Embeddings

Usage

Download Data

Single Task Model [STM] (6 datasets)

CollaboNet (6 datasets)

Performance

STM

CollaboNet

About

Licenses found

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
model		model
ops		ops
LICENSE.md		LICENSE.md
License-thirdparty.txt		License-thirdparty.txt
README.md		README.md
collabo.sh		collabo.sh
download.sh		download.sh
preprocessing.py		preprocessing.py
run.py		run.py
stm.sh		stm.sh

License

Licenses found

wonjininfo/CollaboNet

Folders and files

Latest commit

History

Repository files navigation

CollaboNet: collaboration of deep neural networks for biomedical named entity recognition

Quick Links

Requirements

License

Model

Data

Train, Test Data

Pre-trained Embeddings

Usage

Download Data

Single Task Model [STM] (6 datasets)

CollaboNet (6 datasets)

Performance

STM

CollaboNet

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages