Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add disease ontology #473

Open
wants to merge 31 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
2b65350
feat: add disease_ontology.tmcf
Aug 3, 2021
e952017
feat: add format_disease_ontology.py
Aug 3, 2021
bc28cca
feat: add README
Aug 3, 2021
91667cc
Merge branch 'master' into add_disease_ontology
spiekos Aug 6, 2021
f8efa54
Update README.md
spiekos Aug 6, 2021
9c12a2d
feat: add helper function
Aug 6, 2021
8d4f7f2
fix: nits
Aug 6, 2021
6a5cc0c
fix: property in tmcf
Sep 27, 2021
2aef466
feat: format cols
Oct 8, 2021
06f125e
Merge branch 'master' into add_disease_ontology
chejennifer Apr 29, 2022
4832d53
add unittests
Jul 8, 2022
702e9be
Update README.md
spiekos Jul 26, 2022
75f9256
Update README.md
spiekos Jul 26, 2022
86502c0
Merge branch 'master' into add_disease_ontology
spiekos Jul 27, 2022
1329557
Update .tmcf
spiekos Aug 1, 2022
8e6f5ce
update readme
Aug 5, 2022
377841a
feat: add download file
Aug 5, 2022
15cdeb1
add function edits to the script
Aug 5, 2022
370a2e5
fix: ICD10 formatting
Sep 19, 2022
75dc2d6
feat: update tmcf
Sep 19, 2022
e783ba9
fix: line number for formatting
Sep 19, 2022
1788784
Update disease_ontology.tmcf
spiekos Sep 20, 2022
3db959a
fix: column formatting
Sep 20, 2022
c3eac4a
Update disease_ontology.tmcf
spiekos Sep 20, 2022
526266f
add diseaseID column
Sep 22, 2022
10bf338
fix column formatting
Sep 26, 2022
010be38
fix unit tests
Oct 24, 2022
03926bc
remove old test file
Oct 24, 2022
caa3e3e
feat: add missing synonyms for disease terms
Feb 2, 2023
4d0c493
feat:update format_disease_ontology.py
Aug 1, 2023
5f0c1a1
feat: add illegal char check
Aug 14, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 99 additions & 0 deletions scripts/biomedical/diseaseOntology/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Importing the Disease Ontology (DO) data

## Table of Contents

- [Importing the Disease Ontology (DO) data](#importing-the-disease-ontology-do-data)
- [Table of Contents](#table-of-contents)
- [About the Dataset](#about-the-dataset)
- [Download Data](#download-data)
- [Overview](#overview)
- [Notes and Caveats](#notes-and-caveats)
- [License](#license)
- [About the import](#about-the-import)
- [Artifacts](#artifacts)
- [Scripts](#scripts)
- [Files](#files)
- [Examples](#examples)
-[Run Tests](#run-tests)
-[Import](#import)

## About the Dataset

[Disease Ontology](https://disease-ontology.org) (DO) is a standardized ontology for human disease that was developed "with the purpose of providing the biomedical community with consistent, reusable and sustainable descriptions of human disease terms, phenotype characteristics and related medical vocabulary disease concepts through collaborative efforts of biomedical researchers, coordinated by the University of Maryland School of Medicine, Institute for Genome Sciences.

The Disease Ontology semantically integrates disease and medical vocabularies through extensive cross mapping of DO terms to MeSH, ICD, NCI’s thesaurus, SNOMED and OMIM."

### Download Data

The human disease ontology data can be downloaded from their official github repository [here](https://github.com/DiseaseOntology/HumanDiseaseOntology/tree/main/src/ontology). The data is in `.owl` format and had to be parsed into a `.csv` format (see [Notes and Caveats](#notes-and-caveats) for additional information on formatting). One can also download the data by simply running the bash script [`download.sh`](download.sh).

### Overview

This directory stores the script used to download, clean, and convert the Disease Ontology data into a `.csv` format, which is ready for ingestion into the Data Commons knowledge graph alongside a `.tmcf` file that maps the `.csv` to the defined schema. In this import the data is ingested as [Disease](https://datacommons.org/browser/Disease) entities into the graph.

The disease ontology ID is mapped to other ontologies, namely ICDO (International Classification of Diseases for Oncology), NCI (National Cancer Institute), SNOWMED ( Systematized Nomenclature of Medicine), UMLSCUI (Unified Medical Language System), ORDO (Orphanet Rare Disease Ontology), GARD (Genetic and Rare Diseases), OMIM (Online Mendelian Inheritance in Man),
EFO (Experimental Factor Ontology), MEDDRA (Medical Dictionary for Regulatory Activities) and MeSH (Medical Subject Headings).

In addition, the data stores the parent class and alternative IDs for the disease of interest.

### Notes and Caveats

The original format of the data was `.owl` and it was converted to a `.csv` file prior to ingestion into Data Commons. One of the key issues encountered during the import was that all other ontologies were grouped under the same tag. So, to divide each ontology into its separate group or column, the prefixes for each ID were used. In addition, the disease description tag was misformatted with various special characteristics that had to be programmatically removed.

### License

This data is under a Creative Commons Public Domain Dedication [CC0 1.0 Universal license](https://disease-ontology.org/resources/do-resources).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing link to Disease Ontology website where it states that it's under the Creative Commons Public Domain license

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spiekos , I'm sorry I didn't quite understand that because https://disease-ontology.org/resources/do-resources directs the user to the license page on DO website.


## About the import

### Artifacts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test data and unittests


#### Scripts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add short descriptions to all scripts and files. Internally link the scripts and files to itself in the directory.


##### Shell Script

[`download.sh`](download.sh) downloads the HumanDO owl file in the scratch directory

##### Python Script

[`format_disease_ontology.py`](format_disease_ontology.py) parses the .owl file and converts it into a .csv with disease ontology mappings to other ontologies.

##### Test Script
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update all of the script and file names to match what you added to the directory


[`disease_ontology_test.py`](disease_ontology_test.py) tests the given script on some test data.

#### Files

##### Test File

[`test-do.xml`](test-do.xml) contains test data

[`test-output.csv`](test-output.csv) contains the expected output

##### tMCF File

[`disease_ontology.tmcf`](disease_ontology.tmcf) contains the tmcf mapping to the csv file, to generate an accurate tmcf-csv pair.

### Examples

#### Run Tests

To test disease_ontology_test.py run:

```
python disease_ontology_test.py unit-tests/test-do.owl unit-tests/test-output.owl
```

#### Import

1. Download data to scratch/.

```
bash download.sh
```

2. Clean and convert the downloaded Disease Ontology data into `.csv` format

```
python format_disease_ontology.py HumanDO.owl HumanDO.csv
```
24 changes: 24 additions & 0 deletions scripts/biomedical/diseaseOntology/disease_ontology.tmcf
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
Node: E:DiseaseOntology->E1
typeOf: dcs:Disease
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update script to include a column of text values with the disease ontology ids eg "DOID:0060329" then add to the tmcf the line diseaseOntologyID: C:DiseaseOntology->diseaseOntologyID

dcid: C:DiseaseOntology->id
name: C:DiseaseOntology->label
specializationOf: C:DiseaseOntology->subClassOf
description: C:DiseaseOntology->diseaseDescription
alternativeDiseaseOntologyID: C:DiseaseOntology->hasAlternativeId
diseaseSynonym: C:DiseaseOntology->hasExactSynonym
internationalClassificationOfDiseaseID: C:DiseaseOntology->ICDO
medicalSubjectHeadingDescriptorID: C:DiseaseOntology->meshDescriptor
medicalSubjectHeadingConceptID: C:DiseaseOntology->meshConcept
nationalCancerInstituteID: C:DiseaseOntology->NCI
snowmedCT: C:DiseaseOntology->SNOMEDCTUS20210731
snowmedCT: C:DiseaseOntology->SNOMEDCTUS20200301
snowmedCT: C:DiseaseOntology->SNOMEDCTUS20200901
snowmedCT: C:DiseaseOntology->SNOMEDCTUS20220630
unifiedMedicalLanguageSystemConceptUniqueIdentifier: C:DiseaseOntology->UMLSCUI
icd10CMCode: C:DiseaseOntology->ICD10CM
icd9CMCode: C:DiseaseOntology->ICD9CM
orphaNumber: C:DiseaseOntology->ORDO
geneticAndRareDiseasesID: C:DiseaseOntology->GARD
onlineMendelianInheritanceInManID: C:DiseaseOntology->OMIM
experimentalFactorOntologyID: C:DiseaseOntology->EFO
medDraID: C:DiseaseOntology->MEDDRA
39 changes: 39 additions & 0 deletions scripts/biomedical/diseaseOntology/disease_ontology_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
'''
Author: Suhana Bedi
Date: 07/08/2022
Name: disease_ontology_test.py
Description: runs unit tests for format_disease_ontology.py
Run: python3 disease_ontology_test.py
'''

import unittest
from pandas.testing import assert_frame_equal
from format_disease_ontology import *

class TestParseMesh(unittest.TestCase):
"""Test the functions in format_disease_ontology"""

def test_main(self):
"""Test in the main function"""
# Read in the expected output files into pandas dataframes
df1_expected = pd.read_csv('unit-tests/test-output.csv')
df_actual = wrapper_fun('unit-tests/test-do.xml')
# Run all the functions in format_mesh.py
# Compare expected and actual output files
assert_frame_equal(df1_expected.reset_index(drop=True), df_actual.reset_index(drop=True))

if __name__ == '__main__':
unittest.main()
4 changes: 4 additions & 0 deletions scripts/biomedical/diseaseOntology/download.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#!/bin/bash

mkdir -p scratch; cd scratch
curl -o HumanDO.owl https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/main/src/ontology/HumanDO.owl
Loading