-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add disease ontology #473
base: master
Are you sure you want to change the base?
Add disease ontology #473
Changes from 15 commits
2b65350
e952017
bc28cca
91667cc
f8efa54
9c12a2d
8d4f7f2
6a5cc0c
2aef466
06f125e
4832d53
702e9be
75f9256
86502c0
1329557
8e6f5ce
377841a
15cdeb1
370a2e5
75dc2d6
e783ba9
1788784
3db959a
c3eac4a
526266f
10bf338
010be38
03926bc
caa3e3e
4d0c493
5f0c1a1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
# Importing the Disease Ontology (DO) data | ||
|
||
## Table of Contents | ||
|
||
- [Importing the Disease Ontology (DO) data](#importing-the-disease-ontology-do-data) | ||
- [Table of Contents](#table-of-contents) | ||
- [About the Dataset](#about-the-dataset) | ||
- [Download Data](#download-data) | ||
- [Overview](#overview) | ||
- [Notes and Caveats](#notes-and-caveats) | ||
- [License](#license) | ||
- [About the import](#about-the-import) | ||
- [Artifacts](#artifacts) | ||
- [Scripts](#scripts) | ||
- [Files](#files) | ||
- [Examples](#examples) | ||
-[Run Tests](#run-tests) | ||
-[Import](#import) | ||
|
||
## About the Dataset | ||
[Disease Ontology](https://disease-ontology.org) (DO) is a standardized ontology for human disease that was developed "with the purpose of providing the biomedical community with consistent, reusable and sustainable descriptions of human disease terms, phenotype characteristics and related medical vocabulary disease concepts through collaborative efforts of biomedical researchers, coordinated by the University of Maryland School of Medicine, Institute for Genome Sciences. | ||
|
||
The Disease Ontology semantically integrates disease and medical vocabularies through extensive cross mapping of DO terms to MeSH, ICD, NCI’s thesaurus, SNOMED and OMIM." | ||
|
||
### Download Data | ||
|
||
The human disease ontology data can be downloaded from their official github repository [here](https://www.vmh.life/#human/all). The data is in `.owl` format and had to be parsed into a `.csv` format (see [Notes and Caveats](#notes-and-caveats) for additional information on formatting). | ||
|
||
### Overview | ||
|
||
This directory stores the script used to download, clean, and convert the Disease Ontology data into a `.csv` format, which is ready for ingestion into the Data Commons knowledge graph alongside a `.tmcf` file that maps the `.csv` to the defined schema. In this import the data is ingested as [Disease](https://datacommons.org/browser/Disease) entities into the graph. | ||
|
||
### Notes and Caveats | ||
|
||
The original format of the data was `.owl` and it was converted to a `.csv` file prior to ingestion into Data Commons. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please confirm that the only caveat of the dataset and data cleaning process is that it needs to be converted from an .owl file to a .csv file. Was there nothing acknowledged by Disease Ontology documentation itself or any strange things that you encountered in cleaning the data? E.g. here is where you should note that a node can have multiple parents. |
||
|
||
### License | ||
|
||
This data is under a Creative Commons Public Domain Dedication [CC0 1.0 Universal license](https://disease-ontology.org/resources/do-resources). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing link to Disease Ontology website where it states that it's under the Creative Commons Public Domain license There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @spiekos , I'm sorry I didn't quite understand that because https://disease-ontology.org/resources/do-resources directs the user to the license page on DO website. |
||
|
||
## About the import | ||
|
||
### Artifacts | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing test data and unittests |
||
|
||
#### Scripts | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add short descriptions to all scripts and files. Internally link the scripts and files to itself in the directory. |
||
|
||
##### Shell Script | ||
|
||
`download.sh` | ||
|
||
##### Python Script | ||
|
||
`format_disease_ontology.py` | ||
|
||
##### Test Script | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Update all of the script and file names to match what you added to the directory |
||
|
||
`format_disease_ontology_test.py` | ||
|
||
#### Files | ||
|
||
##### Test File | ||
|
||
`input_file.txt` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing small input file and expected output file that can be used to fun the script to test that it generates the expected output. |
||
|
||
`expected_output_file.txt` | ||
|
||
##### tMCF File | ||
|
||
`my_tmcf_file.tmcf` | ||
|
||
|
||
### Examples | ||
|
||
#### Run Tests | ||
|
||
To test format_refseq_chromosome_id_to_dcid.py run: | ||
|
||
``` | ||
python format_disease_ontology.py input_file.owl expected_output.csv | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Update this command to reflect file names from the download file and what you want the final csv file to be named |
||
``` | ||
|
||
#### Import | ||
|
||
1. Download data to scratch/. | ||
|
||
``` | ||
bash download.sh | ||
``` | ||
|
||
2. Clean and convert the downloaded Disease Ontology data into `.csv` format | ||
|
||
``` | ||
python format_disease_ontology.py humanDO.owl humanDO.csv | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
Node: E:DiseaseOntology->E1 | ||
typeOf: dcs:Disease | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Update script to include a column of text values with the disease ontology ids eg "DOID:0060329" then add to the tmcf the line |
||
dcid: C:DiseaseOntology->dcid | ||
name: C:DiseaseOntology->label | ||
parent: C:DiseaseOntology->subClassOf | ||
description: C:DiseaseOntology->IAO_0000115 | ||
alternativeDiseaseOntologyID : C:DiseaseOntology->hasAlternativeId | ||
diseaseSynonym: C:DiseaseOntology->hasExactSynonym | ||
internationalClassificationOfDiseaseID: C:DiseaseOntology->ICDO | ||
medicalSubjectHeadingDescriptorID: C:DiseaseOntology->MESH | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you please confirm that all of these IDs start with D. If not, then they aren't all MeSH Descriptors and we should switch to using the more general "medicalSubjectHeadingID" property here. |
||
nationalCancerInstituteID: C:DiseaseOntology->NCI | ||
snowmedCT: C:DiseaseOntology->SNOMEDCTUS20200901 | ||
snowmedCT: C:DiseaseOntology->SNOMEDCTUS20200301 | ||
snowmedCT: C:DiseaseOntology->SNOMEDCTUS20180301 | ||
snowmedCT: C:DiseaseOntology->SNOMEDCTUS20190901 | ||
unifiedMedicalLanguageSystemConceptUniqueIdentifier: C:DiseaseOntology->UMLSCUI | ||
icd10CMCode: C:DiseaseOntology->ICD10CM | ||
icd9CMCode: C:DiseaseOntology->ICD9CM | ||
orphaNumber: C:DiseaseOntology->ORDO | ||
geneticAndRareDiseasesID: C:DiseaseOntology->GARD | ||
omimID: C:DiseaseOntology->OMIM | ||
experimentalFactorOntologyID: C:DiseaseOntology->EFO | ||
medDraID: C:DiseaseOntology->MEDDRA |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# Copyright 2022 Google LLC | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# https://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
''' | ||
Author: Suhana Bedi | ||
Date: 07/08/2022 | ||
Name: disease_ontology_test.py | ||
Description: runs unit tests for format_disease_ontology.py | ||
Run: python3 disease_ontology_test.py | ||
''' | ||
|
||
import unittest | ||
from pandas.testing import assert_frame_equal | ||
from format_disease_ontology import * | ||
|
||
class TestParseMesh(unittest.TestCase): | ||
"""Test the functions in format_disease_ontology""" | ||
|
||
def test_main(self): | ||
"""Test in the main function""" | ||
# Read in the expected output files into pandas dataframes | ||
df1_expected = pd.read_csv('unit-tests/test-output.csv') | ||
df_actual = wrapper_fun('unit-tests/test-do.xml') | ||
# Run all the functions in format_mesh.py | ||
# Compare expected and actual output files | ||
assert_frame_equal(df1_expected.reset_index(drop=True), df_actual.reset_index(drop=True)) | ||
|
||
if __name__ == '__main__': | ||
unittest.main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a shell script, which downloads the data. If the data is converted from .owl to .csv outside of your
format_disease_ontology.py
script, then also do that here.