Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add disease ontology #473

Open
wants to merge 31 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
2b65350
feat: add disease_ontology.tmcf
Aug 3, 2021
e952017
feat: add format_disease_ontology.py
Aug 3, 2021
bc28cca
feat: add README
Aug 3, 2021
91667cc
Merge branch 'master' into add_disease_ontology
spiekos Aug 6, 2021
f8efa54
Update README.md
spiekos Aug 6, 2021
9c12a2d
feat: add helper function
Aug 6, 2021
8d4f7f2
fix: nits
Aug 6, 2021
6a5cc0c
fix: property in tmcf
Sep 27, 2021
2aef466
feat: format cols
Oct 8, 2021
06f125e
Merge branch 'master' into add_disease_ontology
chejennifer Apr 29, 2022
4832d53
add unittests
Jul 8, 2022
702e9be
Update README.md
spiekos Jul 26, 2022
75f9256
Update README.md
spiekos Jul 26, 2022
86502c0
Merge branch 'master' into add_disease_ontology
spiekos Jul 27, 2022
1329557
Update .tmcf
spiekos Aug 1, 2022
8e6f5ce
update readme
Aug 5, 2022
377841a
feat: add download file
Aug 5, 2022
15cdeb1
add function edits to the script
Aug 5, 2022
370a2e5
fix: ICD10 formatting
Sep 19, 2022
75dc2d6
feat: update tmcf
Sep 19, 2022
e783ba9
fix: line number for formatting
Sep 19, 2022
1788784
Update disease_ontology.tmcf
spiekos Sep 20, 2022
3db959a
fix: column formatting
Sep 20, 2022
c3eac4a
Update disease_ontology.tmcf
spiekos Sep 20, 2022
526266f
add diseaseID column
Sep 22, 2022
10bf338
fix column formatting
Sep 26, 2022
010be38
fix unit tests
Oct 24, 2022
03926bc
remove old test file
Oct 24, 2022
caa3e3e
feat: add missing synonyms for disease terms
Feb 2, 2023
4d0c493
feat:update format_disease_ontology.py
Aug 1, 2023
5f0c1a1
feat: add illegal char check
Aug 14, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions scripts/biomedical/diseaseOntology/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Importing the Disease Ontology (DO) data

## Table of Contents

- [Importing the Disease Ontology (DO) data](#importing-the-disease-ontology-do-data)
- [Table of Contents](#table-of-contents)
- [About the Dataset](#about-the-dataset)
- [Download URL](#download-url)
- [Overview](#overview)
- [Schema Overview](#schema-overview)
- [Notes and Caveats](#notes-and-caveats)
- [About the import](#about-the-import)
- [Artifacts](#artifacts)
- [Scripts](#scripts)
- [Examples](#examples)

## About the Dataset

### Download URL

The human disease ontology data can be downloaded from their official github repository [here](https://www.vmh.life/#human/all). The data is in `.owl` format and had to be parsed into a `.csv` format (see [Notes and Caveats](#notes-and-caveats) for additional information on formatting).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a shell script, which downloads the data. If the data is converted from .owl to .csv outside of your format_disease_ontology.py script, then also do that here.


### Overview

The Disease Ontology database provides a standardized ontology for human diseases, for the purposes of consistency and reusability. It contains extensive cross mapping of DO terms to other databases, namely, MeSH, ICD, NCI’s thesaurus, SNOMED and OMIM. More information on the database can be found [here](https://disease-ontology.org).

This directory stores the script used to convert the dataset obtained from DO into a modified version, for effective ingestion of data into the Data Commons knowledge graph.

### Schema Overview

The schema representing reaction, metabolite and microbiome data from VMH is defined in [DO.mcf](https://raw.githubusercontent.com/suhana13/ISB-project/main/combined_list.mcf) and [DO.mcf](https://raw.githubusercontent.com/suhana13/ISB-project/main/combined_list_enum.mcf).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schema will be stored in ChemicalComounds.mcf and ChemicalCompoundsEnum.mcf

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spiekos , should it be chemical_compounds.mcf or disease.mcf?


This dataset contains several instances of the class `Disease` and it has multiple properties namely, "parent", "diseaseDescription", "alternativeDOIDs", "diseaseSynonym", "commonName", "icdoID", "meshID", "nciID", "snowmedctusID", "umlscuiID", "icd10CMID", "icd9CMID", "orDOID", "gardID", "omimID", "efoID", "keggDiseaseID", and "medDraID"

### Notes and Caveats

The original format of the data was `.owl` and it was converted to a `.csv` file prior to ingestion into Data Commons.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please confirm that the only caveat of the dataset and data cleaning process is that it needs to be converted from an .owl file to a .csv file. Was there nothing acknowledged by Disease Ontology documentation itself or any strange things that you encountered in cleaning the data? E.g. here is where you should note that a node can have multiple parents.


- ### License

This data is under a Creative Commons Public Domain Dedication [CC0 1.0 Universal license](https://disease-ontology.org/resources/do-resources).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing link to Disease Ontology website where it states that it's under the Creative Commons Public Domain license

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spiekos , I'm sorry I didn't quite understand that because https://disease-ontology.org/resources/do-resources directs the user to the license page on DO website.


## About the import

### Artifacts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test data and unittests


#### Scripts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add short descriptions to all scripts and files. Internally link the scripts and files to itself in the directory.


`format_disease_ontology.py`

## Examples

To generate the formatted csv file from owl:

```
python format_disease_ontology.py humanDO.owl humanDO.csv
```
23 changes: 23 additions & 0 deletions scripts/biomedical/diseaseOntology/disease_ontology.tmcf
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
Node: E:DiseaseOntology->E1
typeOf: dcs:Disease
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update script to include a column of text values with the disease ontology ids eg "DOID:0060329" then add to the tmcf the line diseaseOntologyID: C:DiseaseOntology->diseaseOntologyID

dcid: C:DiseaseOntology->dcid
parent: C:DiseaseOntology->subClassOf
diseaseDescription: C:DiseaseOntology->IAO_0000115
alternativeDOIDs : C:DiseaseOntology->hasAlternativeId
diseaseSynonym: C:DiseaseOntology->hasExactSynonym
commonName: C:DiseaseOntology->label
icdoID: C:DiseaseOntology->ICDO
meshID: C:DiseaseOntology->MESH
nciID: C:DiseaseOntology->NCI
snowmedctusID: C:DiseaseOntology->SNOMEDCTUS20200901
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these SNOMEDCTUS20200901 different columns in the file? Why are they stored under such odd column names?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the columns represent different version of the snow med, so SNOMEDCTUS20200901 refers to the 09/01/2020 version of the data

snowmedctusID: C:DiseaseOntology->SNOMEDCTUS20200301
snowmedctusID: C:DiseaseOntology->SNOMEDCTUS20180301
snowmedctusID: C:DiseaseOntology->SNOMEDCTUS20190901
umlscuiID: C:DiseaseOntology->UMLSCUI
icd10CMID: C:DiseaseOntology->ICD10CM
icd9CMID: C:DiseaseOntology->ICD9CM
orDOID: C:DiseaseOntology->ORDO
gardID: C:DiseaseOntology->GARD
omimID: C:DiseaseOntology->OMIM
efoID: C:DiseaseOntology->EFO
medDraID: C:DiseaseOntology->MEDDRA
287 changes: 287 additions & 0 deletions scripts/biomedical/diseaseOntology/format_disease_ontology.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,287 @@
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Author: Suhana Bedi
Date: 08/03/2021
Name: format_disease_ontology
Description: converts a .owl disease ontology file into
a csv format, creates dcids for each disease and links the dcids
of current MeSH and ICD10 codes to the corresponding properties in
the dataset.
@file_input: input .owl from Human DO database
@file_output: formatted .csv with disease ontology
"""

from xml.etree import ElementTree
from collections import defaultdict
import pandas as pd
import re
import numpy as np
import datacommons as dc
import sys


def format_tag(tag: str) -> str:
"""Extract human-readable tag from xml tag
Args:
tag: tag of an element in xml file,
containg human-readable string after '}'
Returns:
tag_readable: human-readble string after '}'
spiekos marked this conversation as resolved.
Show resolved Hide resolved

"""
tag_readable = tag.split("}")[1]
return tag_readable


def format_attrib(attrib: dict) -> str:
"""Extract text from xml attributes dictionary
Args:
attrib: attribute of an xml element
Returns:
text: extracted text from attribute values,
either after '#' or after the final '/'
if '#' does not exist
"""
attrib = list(attrib.values())[0]
text = None
if "#" in attrib:
text = attrib.split("#")[-1]
else:
text = attrib.split("/")[-1]
return text


def parse_do_info(info: list) -> dict:
"""Parse owl class childrens
to human-readble dictionary
Args:
info: list of owl class children
Returns:
info_dict: human_readable dictionary
containing information of owl class children
"""
info_dict = defaultdict(list)
for element in info:
tag = format_tag(element.tag)
if element.text == None:
text = format_attrib(element.attrib)
info_dict[tag].append(text)
else:
info_dict[tag].append(element.text)
return info_dict


def format_cols(df):
"""
Converts all columns to string type and
replaces all special characters
Args:
df = dataframe to change
Returns:
none
"""
for i, col in enumerate(df.columns):
df[col] = df[col].astype(str)
spiekos marked this conversation as resolved.
Show resolved Hide resolved
df[col] = df[col].map(lambda x: re.sub(r'[\([{})\]]', '', x))
df.iloc[:, i] = df.iloc[:, i].str.replace("'", '')
df.iloc[:, i] = df.iloc[:, i].str.replace('"', '')
df[col] = df[col].replace('nan', np.nan)
df['id'] = df['id'].str.replace(':', '_')


def col_explode(df):
"""
Splits the hasDbXref column into multiple columns
based on the prefix identifying the database from which
the ID originates
Args:
df = dataframe to change
Returns
df = modified dataframe
"""
df = df.assign(hasDbXref=df.hasDbXref.str.split(",")).explode('hasDbXref')
df[['A', 'B']] = df['hasDbXref'].str.split(':', 1, expand=True)
df['A'] = df['A'].astype(str).map(lambda x: re.sub('[^A-Za-z0-9]+', '', x))
col_add = list(df['A'].unique())
for newcol in col_add:
df[newcol] = np.nan
df[newcol] = np.where(df['A'] == newcol, df['B'], np.nan)
df[newcol] = df[newcol].astype(str).replace("nan", np.nan)
return df


def shard(list_to_shard, shard_size):
"""
Breaks down a list into smaller
sublists, converts it into an array
and appends the array to the master
list
Args:
list_to_shard = original list
shard_size = size of subist
Returns:
sharded_list = master list with
smaller sublists
"""
sharded_list = []
for i in range(0, len(list_to_shard), shard_size):
shard = list_to_shard[i:i + shard_size]
arr = np.array(shard)
sharded_list.append(arr)
return sharded_list


def col_string(df):
"""
Adds string quotes to columns in a dataframe
Args:
df = dataframe whose columns are modified
Returns:
None
"""
col_add = list(df['A'].unique())
for newcol in col_add:
df[newcol] = str(newcol) + ":" + df[newcol].astype(str)
col_rep = str(newcol) + ":" + "nan"
df[newcol] = df[newcol].replace(col_rep, np.nan)
col_names = [
'hasAlternativeId', 'hasExactSynonym', 'label', 'ICDO', 'MESH', 'NCI',
'SNOMEDCTUS20200901', 'UMLSCUI', 'ICD10CM', 'ICD9CM',
'SNOMEDCTUS20200301', 'ORDO', 'SNOMEDCTUS20180301', 'GARD', 'OMIM',
'EFO', 'KEGG', 'MEDDRA', 'SNOMEDCTUS20190901'
]
for col in col_names:
df.update('"' + df[[col]].astype(str) + '"')


def mesh_query(df):
"""
Queries the MESH ids present in the dataframe,
on datacommons, fetches their dcids and adds
it to the same column.
Args:
df = dataframe to change
Returns
df = modified dataframe with MESH dcid added
"""
df_temp = df[df.MESH.notnull()]
list_mesh = list(df_temp['MESH'])
arr_mesh = shard(list_mesh, 1000)
for i in range(len(arr_mesh)):
query_str = """
SELECT DISTINCT ?id ?element_name
WHERE {{
?element typeOf MeSHDescriptor .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if there is a typo or "MeSHDescriptor" is supposed to be cased like that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?element dcid ?id .
?element name ?element_name .
?element name {0} .
}}
""".format(arr_mesh[i])
result = dc.query(query_str)
result_df = pd.DataFrame(result)
result_df.columns = ['id', 'element_name']
df.MESH.update(df.MESH.map(result_df.set_index('element_name').id))
return df


def icd10_query(df):
"""
Queries the ICD10 ids present in the dataframe,
on datacommons, fetches their dcids and adds
it to the same column.
Args:
df = dataframe to change
Returns
df = modified dataframe with ICD dcid added
"""
df_temp = df[df.ICD10CM.notnull()]
list_icd10 = "ICD10/" + df_temp['ICD10CM'].astype(str)
arr_icd10 = shard(list_icd10, 1000)
for i in range(len(arr_icd10)):
query_str = """
SELECT DISTINCT ?id
WHERE {{
?element typeOf ICD10Code .
?element dcid ?id .
?element dcid {0} .
}}
""".format(arr_icd10[i])
result1 = dc.query(query_str)
result1_df = pd.DataFrame(result1)
result1_df['element'] = result1_df['?id'].str.split(pat="/").str[1]
result1_df.columns = ['id', 'element']
df.ICD10CM.update(df.ICD10CM.map(result1_df.set_index('element').id))
return df


def remove_newline(df):
df.loc[2505, 'IAO_0000115'] = df.loc[2505, 'IAO_0000115'].replace("\\n", "")
df.loc[2860, 'IAO_0000115'] = df.loc[2860, 'IAO_0000115'].replace("\\n", "")
df.loc[2895, 'IAO_0000115'] = df.loc[2895, 'IAO_0000115'].replace("\\n", "")
df.loc[2934, 'IAO_0000115'] = df.loc[2934, 'IAO_0000115'].replace("\\n", "")
df.loc[3036, 'IAO_0000115'] = df.loc[3036, 'IAO_0000115'].replace("\\n", "")
df.loc[11305, 'IAO_0000115'] = df.loc[11305,
'IAO_0000115'].replace("\\n", "")
return df


def wrapper_fun(file_input, file_output):
file_input = sys.argv[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too much is stored under main. Put into other functions and then call in main

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

file_output = sys.argv[2]
# Read disease ontology .owl file
tree = ElementTree.parse(file_input)
# Get file root
root = tree.getroot()
# Find owl classes elements
all_classes = root.findall('{http://www.w3.org/2002/07/owl#}Class')
# Parse owl classes to human-readble dictionary format
parsed_owl_classes = []
for owl_class in all_classes:
info = list(owl_class.getiterator())
parsed_owl_classes.append(parse_do_info(info))
# Convert to pandas Dataframe
df_do = pd.DataFrame(parsed_owl_classes)
format_cols(df_do)
df_do = df_do.drop([
'Class', 'exactMatch', 'deprecated', 'hasRelatedSynonym', 'comment',
'OBI_9991118', 'narrowMatch', 'hasBroadSynonym', 'disjointWith',
'hasNarrowSynonym', 'broadMatch', 'created_by', 'creation_date',
'inSubset', 'hasOBONamespace'
],
axis=1)
df_do = col_explode(df_do)
df_do = mesh_query(df_do)
df_do = icd10_query(df_do)
col_string(df_do)
df_do = df_do.drop(['A', 'B', 'nan', 'hasDbXref', 'KEGG'], axis=1)
df_do = df_do.drop_duplicates(subset='id', keep="last")
df_do = df_do.reset_index(drop=True)
df_do = df_do.replace('"nan"', np.nan)
#generate dcids
df_do['id'] = "bio/" + df_do['id']
##df_do.loc[2505, 'IAO_0000115'] = df_do.loc[2505, 'IAO_0000115'].replace("\\n", "")
df_do = remove_newline(df_do)
df_do['IAO_0000115'] = df_do['IAO_0000115'].str.replace("_", " ")
df_do.to_csv(file_output)


def main():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please confirm that this is the most up-to-date version of the script that includes changes like how to handle lists of texts values that have commas within a single cell value and other changes that we've previously discussed.

file_input = sys.argv[1]
file_output = sys.argv[2]
wrapper_fun(file_input, file_output)


if __name__ == '__main__':
main()