Disease clustering from phenotypic literature data through Document Understanding, Comprehension and Knowledge
(doi: )
Typically, SNPs are studied in terms of a "one disease - one SNP" relationship. This results in researchers and clinicians with deep knowledge of a disease but often incomplete knowledge of all potentially relevant SNPs.
Knowledge of a larger set of potentially relevant SNPs to a collection of phenotypes would allow finding a novel set of relevant publications.
ClusterDuck is a tool to automatically identify genetically-relevant publications and returns relevant
- Python 3
- EDirect
-
Install python packages required:
pip3 install -r requirements.txt
-
Download the pubmed database and required data from nltk:
python3 setup.py
-
Use easy-to-start command line tool
ClusterDuck.py
python3 ClusterDuck.py "Autistic behavior" "Restrictive behavior" "Impaired social interactions" "Poor eye contact" "Impaired ability to form peer relationships" "No social interaction" "Impaired use of nonverbal behaviors" "Lack of peer relationships" "Stereotypy"
-
A case study
python3 generate_csv.py
-
Train Topic Models
After you have corpora, you can run the following function in
train_lda.py
to obtain topic models:lda1, lda2 = train_ldas(corpus1, corpus2, n_topics=N_TOPICS, alpha=ALPHA, eta=ETA)
where
N_TOPICS
,ALPHA
andETA
parameterize both topic models.
python3 ./dc/test_utils.py
Set of phenotypic terms from HPO ontology.
- A 'phenotypic' corpus of literature is extracted from PubMed using the user-input HPO phenotypic terms.
- All SNPs mentioned in the 'phenotypic clusters are idenfified.
- PubMed is queried using the phenotypically-relevant SNPs to extract a second 'phenotypic + genetic' corpus.
- Topic modeling is run on each corpus separately.
- Topic distributions are compared to discover new genetically-inspired and relevant topics.
A list of novel genetically-related topics to the initial phenotypic input.
- Synonyms search from user-input HPO provides a synonym list for each of their controlled vocabulary terms. This can be incorporated as a preprocessor with the user input to allow
- Make use of hierarchy HPO is an ontology of terms and user-input terms are likely to have sub- and super-class terms.
- Filtering different types of research articles Optionally add a [PT] query filter to the PubMed query to limit the types of publications returned.
- Use of EMR-type data to build corpus as oppose to PubMed An EMR-based corpus is more likely to be associated with diseases (especially to ICD terms) than a PubMed-based corpus.
- Jennifer Dong
- Larry Gray
- Joseph Halstead
- Yi Hsiao
- Wayne Pereanu
- Neelay Trivedi
- Nathan Wan
- Donghui Wu