-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add disease ontology #473
base: master
Are you sure you want to change the base?
Add disease ontology #473
Changes from 9 commits
2b65350
e952017
bc28cca
91667cc
f8efa54
9c12a2d
8d4f7f2
6a5cc0c
2aef466
06f125e
4832d53
702e9be
75f9256
86502c0
1329557
8e6f5ce
377841a
15cdeb1
370a2e5
75dc2d6
e783ba9
1788784
3db959a
c3eac4a
526266f
10bf338
010be38
03926bc
caa3e3e
4d0c493
5f0c1a1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
# Importing the Disease Ontology (DO) data | ||
|
||
## Table of Contents | ||
|
||
- [Importing the Disease Ontology (DO) data](#importing-the-disease-ontology-do-data) | ||
- [Table of Contents](#table-of-contents) | ||
- [About the Dataset](#about-the-dataset) | ||
- [Download URL](#download-url) | ||
- [Overview](#overview) | ||
- [Schema Overview](#schema-overview) | ||
- [Notes and Caveats](#notes-and-caveats) | ||
- [About the import](#about-the-import) | ||
- [Artifacts](#artifacts) | ||
- [Scripts](#scripts) | ||
- [Examples](#examples) | ||
|
||
## About the Dataset | ||
|
||
### Download URL | ||
|
||
The human disease ontology data can be downloaded from their official github repository [here](https://www.vmh.life/#human/all). The data is in `.owl` format and had to be parsed into a `.csv` format (see [Notes and Caveats](#notes-and-caveats) for additional information on formatting). | ||
|
||
### Overview | ||
|
||
The Disease Ontology database provides a standardized ontology for human diseases, for the purposes of consistency and reusability. It contains extensive cross mapping of DO terms to other databases, namely, MeSH, ICD, NCI’s thesaurus, SNOMED and OMIM. More information on the database can be found [here](https://disease-ontology.org). | ||
|
||
This directory stores the script used to convert the dataset obtained from DO into a modified version, for effective ingestion of data into the Data Commons knowledge graph. | ||
|
||
### Schema Overview | ||
|
||
The schema representing reaction, metabolite and microbiome data from VMH is defined in [DO.mcf](https://raw.githubusercontent.com/suhana13/ISB-project/main/combined_list.mcf) and [DO.mcf](https://raw.githubusercontent.com/suhana13/ISB-project/main/combined_list_enum.mcf). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. schema will be stored in ChemicalComounds.mcf and ChemicalCompoundsEnum.mcf There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @spiekos , should it be chemical_compounds.mcf or disease.mcf? |
||
|
||
This dataset contains several instances of the class `Disease` and it has multiple properties namely, "parent", "diseaseDescription", "alternativeDOIDs", "diseaseSynonym", "commonName", "icdoID", "meshID", "nciID", "snowmedctusID", "umlscuiID", "icd10CMID", "icd9CMID", "orDOID", "gardID", "omimID", "efoID", "keggDiseaseID", and "medDraID" | ||
|
||
### Notes and Caveats | ||
|
||
The original format of the data was `.owl` and it was converted to a `.csv` file prior to ingestion into Data Commons. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please confirm that the only caveat of the dataset and data cleaning process is that it needs to be converted from an .owl file to a .csv file. Was there nothing acknowledged by Disease Ontology documentation itself or any strange things that you encountered in cleaning the data? E.g. here is where you should note that a node can have multiple parents. |
||
|
||
- ### License | ||
|
||
This data is under a Creative Commons Public Domain Dedication [CC0 1.0 Universal license](https://disease-ontology.org/resources/do-resources). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing link to Disease Ontology website where it states that it's under the Creative Commons Public Domain license There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @spiekos , I'm sorry I didn't quite understand that because https://disease-ontology.org/resources/do-resources directs the user to the license page on DO website. |
||
|
||
## About the import | ||
|
||
### Artifacts | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing test data and unittests |
||
|
||
#### Scripts | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add short descriptions to all scripts and files. Internally link the scripts and files to itself in the directory. |
||
|
||
`format_disease_ontology.py` | ||
|
||
## Examples | ||
|
||
To generate the formatted csv file from owl: | ||
|
||
``` | ||
python format_disease_ontology.py humanDO.owl humanDO.csv | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
Node: E:DiseaseOntology->E1 | ||
typeOf: dcs:Disease | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Update script to include a column of text values with the disease ontology ids eg "DOID:0060329" then add to the tmcf the line |
||
dcid: C:DiseaseOntology->dcid | ||
parent: C:DiseaseOntology->subClassOf | ||
diseaseDescription: C:DiseaseOntology->IAO_0000115 | ||
alternativeDOIDs : C:DiseaseOntology->hasAlternativeId | ||
diseaseSynonym: C:DiseaseOntology->hasExactSynonym | ||
commonName: C:DiseaseOntology->label | ||
icdoID: C:DiseaseOntology->ICDO | ||
meshID: C:DiseaseOntology->MESH | ||
nciID: C:DiseaseOntology->NCI | ||
snowmedctusID: C:DiseaseOntology->SNOMEDCTUS20200901 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are these SNOMEDCTUS20200901 different columns in the file? Why are they stored under such odd column names? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the columns represent different version of the snow med, so SNOMEDCTUS20200901 refers to the 09/01/2020 version of the data |
||
snowmedctusID: C:DiseaseOntology->SNOMEDCTUS20200301 | ||
snowmedctusID: C:DiseaseOntology->SNOMEDCTUS20180301 | ||
snowmedctusID: C:DiseaseOntology->SNOMEDCTUS20190901 | ||
umlscuiID: C:DiseaseOntology->UMLSCUI | ||
icd10CMID: C:DiseaseOntology->ICD10CM | ||
icd9CMID: C:DiseaseOntology->ICD9CM | ||
orDOID: C:DiseaseOntology->ORDO | ||
gardID: C:DiseaseOntology->GARD | ||
omimID: C:DiseaseOntology->OMIM | ||
efoID: C:DiseaseOntology->EFO | ||
medDraID: C:DiseaseOntology->MEDDRA |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,287 @@ | ||
# Copyright 2021 Google LLC | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
""" | ||
Author: Suhana Bedi | ||
Date: 08/03/2021 | ||
Name: format_disease_ontology | ||
Description: converts a .owl disease ontology file into | ||
a csv format, creates dcids for each disease and links the dcids | ||
of current MeSH and ICD10 codes to the corresponding properties in | ||
the dataset. | ||
@file_input: input .owl from Human DO database | ||
@file_output: formatted .csv with disease ontology | ||
""" | ||
|
||
from xml.etree import ElementTree | ||
from collections import defaultdict | ||
import pandas as pd | ||
import re | ||
import numpy as np | ||
import datacommons as dc | ||
import sys | ||
|
||
|
||
def format_tag(tag: str) -> str: | ||
"""Extract human-readable tag from xml tag | ||
Args: | ||
tag: tag of an element in xml file, | ||
containg human-readable string after '}' | ||
Returns: | ||
tag_readable: human-readble string after '}' | ||
spiekos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
""" | ||
tag_readable = tag.split("}")[1] | ||
return tag_readable | ||
|
||
|
||
def format_attrib(attrib: dict) -> str: | ||
"""Extract text from xml attributes dictionary | ||
Args: | ||
attrib: attribute of an xml element | ||
Returns: | ||
text: extracted text from attribute values, | ||
either after '#' or after the final '/' | ||
if '#' does not exist | ||
""" | ||
attrib = list(attrib.values())[0] | ||
text = None | ||
if "#" in attrib: | ||
text = attrib.split("#")[-1] | ||
else: | ||
text = attrib.split("/")[-1] | ||
return text | ||
|
||
|
||
def parse_do_info(info: list) -> dict: | ||
"""Parse owl class childrens | ||
to human-readble dictionary | ||
Args: | ||
info: list of owl class children | ||
Returns: | ||
info_dict: human_readable dictionary | ||
containing information of owl class children | ||
""" | ||
info_dict = defaultdict(list) | ||
for element in info: | ||
tag = format_tag(element.tag) | ||
if element.text == None: | ||
text = format_attrib(element.attrib) | ||
info_dict[tag].append(text) | ||
else: | ||
info_dict[tag].append(element.text) | ||
return info_dict | ||
|
||
|
||
def format_cols(df): | ||
""" | ||
Converts all columns to string type and | ||
replaces all special characters | ||
Args: | ||
df = dataframe to change | ||
Returns: | ||
none | ||
""" | ||
for i, col in enumerate(df.columns): | ||
df[col] = df[col].astype(str) | ||
spiekos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
df[col] = df[col].map(lambda x: re.sub(r'[\([{})\]]', '', x)) | ||
df.iloc[:, i] = df.iloc[:, i].str.replace("'", '') | ||
df.iloc[:, i] = df.iloc[:, i].str.replace('"', '') | ||
df[col] = df[col].replace('nan', np.nan) | ||
df['id'] = df['id'].str.replace(':', '_') | ||
|
||
|
||
def col_explode(df): | ||
""" | ||
Splits the hasDbXref column into multiple columns | ||
based on the prefix identifying the database from which | ||
the ID originates | ||
Args: | ||
df = dataframe to change | ||
Returns | ||
df = modified dataframe | ||
""" | ||
df = df.assign(hasDbXref=df.hasDbXref.str.split(",")).explode('hasDbXref') | ||
df[['A', 'B']] = df['hasDbXref'].str.split(':', 1, expand=True) | ||
df['A'] = df['A'].astype(str).map(lambda x: re.sub('[^A-Za-z0-9]+', '', x)) | ||
col_add = list(df['A'].unique()) | ||
for newcol in col_add: | ||
df[newcol] = np.nan | ||
df[newcol] = np.where(df['A'] == newcol, df['B'], np.nan) | ||
df[newcol] = df[newcol].astype(str).replace("nan", np.nan) | ||
return df | ||
|
||
|
||
def shard(list_to_shard, shard_size): | ||
""" | ||
Breaks down a list into smaller | ||
sublists, converts it into an array | ||
and appends the array to the master | ||
list | ||
Args: | ||
list_to_shard = original list | ||
shard_size = size of subist | ||
Returns: | ||
sharded_list = master list with | ||
smaller sublists | ||
""" | ||
sharded_list = [] | ||
for i in range(0, len(list_to_shard), shard_size): | ||
shard = list_to_shard[i:i + shard_size] | ||
arr = np.array(shard) | ||
sharded_list.append(arr) | ||
return sharded_list | ||
|
||
|
||
def col_string(df): | ||
""" | ||
Adds string quotes to columns in a dataframe | ||
Args: | ||
df = dataframe whose columns are modified | ||
Returns: | ||
None | ||
""" | ||
col_add = list(df['A'].unique()) | ||
for newcol in col_add: | ||
df[newcol] = str(newcol) + ":" + df[newcol].astype(str) | ||
col_rep = str(newcol) + ":" + "nan" | ||
df[newcol] = df[newcol].replace(col_rep, np.nan) | ||
col_names = [ | ||
'hasAlternativeId', 'hasExactSynonym', 'label', 'ICDO', 'MESH', 'NCI', | ||
'SNOMEDCTUS20200901', 'UMLSCUI', 'ICD10CM', 'ICD9CM', | ||
'SNOMEDCTUS20200301', 'ORDO', 'SNOMEDCTUS20180301', 'GARD', 'OMIM', | ||
'EFO', 'KEGG', 'MEDDRA', 'SNOMEDCTUS20190901' | ||
] | ||
for col in col_names: | ||
df.update('"' + df[[col]].astype(str) + '"') | ||
|
||
|
||
def mesh_query(df): | ||
""" | ||
Queries the MESH ids present in the dataframe, | ||
on datacommons, fetches their dcids and adds | ||
it to the same column. | ||
Args: | ||
df = dataframe to change | ||
Returns | ||
df = modified dataframe with MESH dcid added | ||
""" | ||
df_temp = df[df.MESH.notnull()] | ||
list_mesh = list(df_temp['MESH']) | ||
arr_mesh = shard(list_mesh, 1000) | ||
for i in range(len(arr_mesh)): | ||
query_str = """ | ||
SELECT DISTINCT ?id ?element_name | ||
WHERE {{ | ||
?element typeOf MeSHDescriptor . | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure if there is a typo or "MeSHDescriptor" is supposed to be cased like that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, its MeSHDescriptor - https://datacommons.org/browser/MeSHDescriptor |
||
?element dcid ?id . | ||
?element name ?element_name . | ||
?element name {0} . | ||
}} | ||
""".format(arr_mesh[i]) | ||
result = dc.query(query_str) | ||
result_df = pd.DataFrame(result) | ||
result_df.columns = ['id', 'element_name'] | ||
df.MESH.update(df.MESH.map(result_df.set_index('element_name').id)) | ||
return df | ||
|
||
|
||
def icd10_query(df): | ||
""" | ||
Queries the ICD10 ids present in the dataframe, | ||
on datacommons, fetches their dcids and adds | ||
it to the same column. | ||
Args: | ||
df = dataframe to change | ||
Returns | ||
df = modified dataframe with ICD dcid added | ||
""" | ||
df_temp = df[df.ICD10CM.notnull()] | ||
list_icd10 = "ICD10/" + df_temp['ICD10CM'].astype(str) | ||
arr_icd10 = shard(list_icd10, 1000) | ||
for i in range(len(arr_icd10)): | ||
query_str = """ | ||
SELECT DISTINCT ?id | ||
WHERE {{ | ||
?element typeOf ICD10Code . | ||
?element dcid ?id . | ||
?element dcid {0} . | ||
}} | ||
""".format(arr_icd10[i]) | ||
result1 = dc.query(query_str) | ||
result1_df = pd.DataFrame(result1) | ||
result1_df['element'] = result1_df['?id'].str.split(pat="/").str[1] | ||
result1_df.columns = ['id', 'element'] | ||
df.ICD10CM.update(df.ICD10CM.map(result1_df.set_index('element').id)) | ||
return df | ||
|
||
|
||
def remove_newline(df): | ||
df.loc[2505, 'IAO_0000115'] = df.loc[2505, 'IAO_0000115'].replace("\\n", "") | ||
df.loc[2860, 'IAO_0000115'] = df.loc[2860, 'IAO_0000115'].replace("\\n", "") | ||
df.loc[2895, 'IAO_0000115'] = df.loc[2895, 'IAO_0000115'].replace("\\n", "") | ||
df.loc[2934, 'IAO_0000115'] = df.loc[2934, 'IAO_0000115'].replace("\\n", "") | ||
df.loc[3036, 'IAO_0000115'] = df.loc[3036, 'IAO_0000115'].replace("\\n", "") | ||
df.loc[11305, 'IAO_0000115'] = df.loc[11305, | ||
'IAO_0000115'].replace("\\n", "") | ||
return df | ||
|
||
|
||
def wrapper_fun(file_input, file_output): | ||
file_input = sys.argv[1] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. too much is stored under main. Put into other functions and then call in main There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done! |
||
file_output = sys.argv[2] | ||
# Read disease ontology .owl file | ||
tree = ElementTree.parse(file_input) | ||
# Get file root | ||
root = tree.getroot() | ||
# Find owl classes elements | ||
all_classes = root.findall('{http://www.w3.org/2002/07/owl#}Class') | ||
# Parse owl classes to human-readble dictionary format | ||
parsed_owl_classes = [] | ||
for owl_class in all_classes: | ||
info = list(owl_class.getiterator()) | ||
parsed_owl_classes.append(parse_do_info(info)) | ||
# Convert to pandas Dataframe | ||
df_do = pd.DataFrame(parsed_owl_classes) | ||
format_cols(df_do) | ||
df_do = df_do.drop([ | ||
'Class', 'exactMatch', 'deprecated', 'hasRelatedSynonym', 'comment', | ||
'OBI_9991118', 'narrowMatch', 'hasBroadSynonym', 'disjointWith', | ||
'hasNarrowSynonym', 'broadMatch', 'created_by', 'creation_date', | ||
'inSubset', 'hasOBONamespace' | ||
], | ||
axis=1) | ||
df_do = col_explode(df_do) | ||
df_do = mesh_query(df_do) | ||
df_do = icd10_query(df_do) | ||
col_string(df_do) | ||
df_do = df_do.drop(['A', 'B', 'nan', 'hasDbXref', 'KEGG'], axis=1) | ||
df_do = df_do.drop_duplicates(subset='id', keep="last") | ||
df_do = df_do.reset_index(drop=True) | ||
df_do = df_do.replace('"nan"', np.nan) | ||
#generate dcids | ||
df_do['id'] = "bio/" + df_do['id'] | ||
##df_do.loc[2505, 'IAO_0000115'] = df_do.loc[2505, 'IAO_0000115'].replace("\\n", "") | ||
df_do = remove_newline(df_do) | ||
df_do['IAO_0000115'] = df_do['IAO_0000115'].str.replace("_", " ") | ||
df_do.to_csv(file_output) | ||
|
||
|
||
def main(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please confirm that this is the most up-to-date version of the script that includes changes like how to handle lists of texts values that have commas within a single cell value and other changes that we've previously discussed. |
||
file_input = sys.argv[1] | ||
file_output = sys.argv[2] | ||
wrapper_fun(file_input, file_output) | ||
|
||
|
||
if __name__ == '__main__': | ||
main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a shell script, which downloads the data. If the data is converted from .owl to .csv outside of your
format_disease_ontology.py
script, then also do that here.