Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICTV Import #834

Merged
merged 65 commits into from
Mar 21, 2024
Merged
Show file tree
Hide file tree
Changes from 64 commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
0424815
Create README.md
spiekos Mar 23, 2023
c4e65ed
Add VirusMasterSpeciesList.tmcf
spiekos Mar 23, 2023
08d04b0
Add tmcf files
spiekos Mar 23, 2023
caa5759
Update title
spiekos Mar 23, 2023
4ecc413
Add VMR dataset description
spiekos Mar 23, 2023
55fbb51
Mention script formatting taxonomic ranking enums
spiekos Mar 28, 2023
68d0bc5
format schema list
spiekos Mar 28, 2023
6e6d5de
update new enumerations lists
spiekos Mar 28, 2023
0f92262
update new schema summary formatting
spiekos Mar 28, 2023
ddb209a
update new schema overview formatting
spiekos Mar 28, 2023
20ef620
Add create_virus_taxonomic_ranking_enums.py
spiekos Apr 25, 2023
7e8d9dd
Add formatting scripts
spiekos Apr 25, 2023
01ea5ba
Update format_virus_metadata_resource.py
spiekos Apr 25, 2023
9154476
Add log file
spiekos Apr 25, 2023
348a297
Update README.md
spiekos Apr 25, 2023
0ab5772
Create download.sh
spiekos Apr 25, 2023
321677b
Update command to run download.sh
spiekos Apr 25, 2023
b21c9af
update illegal characters subsection
spiekos Apr 25, 2023
5af5a62
fix formatting error
spiekos Apr 25, 2023
4bdec75
Add header
spiekos Jun 7, 2023
8061182
Add header
spiekos Jun 7, 2023
011b3d1
add header
spiekos Jun 7, 2023
042812c
Add header
spiekos Jun 7, 2023
c9e5022
update header
spiekos Jun 7, 2023
667ecc2
Update header
spiekos Jun 7, 2023
4ae299c
Merge branch 'master' into ICTV_import
spiekos Jun 7, 2023
d2368b5
Update scripts
spiekos Jul 31, 2023
5948dd4
Delete log file
spiekos Jul 31, 2023
6404ff6
Update script
spiekos Jul 31, 2023
3b84cb8
Update create_virus_taxonomic_ranking_enums.py
spiekos Feb 21, 2024
a474fc2
Update format_virus_master_species_list.py
spiekos Feb 21, 2024
c298ab4
Update format_virus_master_species_list.py
spiekos Feb 21, 2024
394cda8
Update format_virus_metadata_resource.py
spiekos Feb 21, 2024
1c99e86
Add run.sh
spiekos Feb 21, 2024
e94562e
Update download.sh
spiekos Feb 21, 2024
4dc2aaa
Update format_virus_metadata_resource.log
spiekos Feb 21, 2024
31a429f
Update README.md
spiekos Feb 21, 2024
efd8bef
Update run.sh
spiekos Feb 21, 2024
c1c30de
Update create_virus_taxonomic_ranking_enums.py
spiekos Feb 21, 2024
fb396e9
Update format_virus_metadata_resource.py
spiekos Feb 21, 2024
eee41ab
Update execution bash files
spiekos Feb 26, 2024
8f8a9dd
Update README.md
spiekos Feb 26, 2024
4142015
Rename VirusMasterSpeciesList.tmcf to VirusSpecies.tmcf
spiekos Mar 4, 2024
0530ee6
Rename VirusGenomeSegment.tmcf to VirusGenomeSegments.tmcf
spiekos Mar 4, 2024
b88449f
Rename VirusTaxonomy.tmcf to VirusIsolates.tmcf
spiekos Mar 4, 2024
413656d
Update tmcf links in README.md
spiekos Mar 4, 2024
90fdf4b
Update bash scripts filepaths in README.md
spiekos Mar 5, 2024
4131f21
Update filepaths in README.md
spiekos Mar 5, 2024
4fcca30
Update README.md table of contents
spiekos Mar 5, 2024
0267a42
Update README.md table of contents
spiekos Mar 5, 2024
157f1e5
Update README.md Table of Contents
spiekos Mar 5, 2024
b733280
Update README.md
spiekos Mar 5, 2024
76d0d80
Merge branch 'master' into ICTV_import
spiekos Mar 5, 2024
799cc24
Add line creating CSVs directory
spiekos Mar 5, 2024
25d2740
Update create_virus_taxonomic_ranking_enums.py
spiekos Mar 5, 2024
e3ffedc
Update format_virus_master_species_list.py
spiekos Mar 5, 2024
6790998
Update format_virus_metadata_resource.py
spiekos Mar 5, 2024
b02e7c0
Update tests.sh
spiekos Mar 5, 2024
05ca8d9
Update README.md
spiekos Mar 5, 2024
17a0ba9
Update README.md
spiekos Mar 5, 2024
2479485
Update README.md
spiekos Mar 5, 2024
715a10c
Merge branch 'master' into ICTV_import
spiekos Mar 5, 2024
ff9c9c7
Merge branch 'master' into ICTV_import
pradh Mar 19, 2024
af4fad0
Merge branch 'master' into ICTV_import
spiekos Mar 20, 2024
b282d44
Fix lint (#1005)
pradh Mar 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
161 changes: 161 additions & 0 deletions scripts/biomedical/ICTV_Taxonomy/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@

# Importing Master Species List and Virus Metadata Resource from the International Committee on Taxonomy of Viruses (ICTV)
spiekos marked this conversation as resolved.
Show resolved Hide resolved

## Table of Contents

1. [About the Dataset](#about-the-dataset)
1. [Download URL](#download-urls)
2. [Overview](#overview)
3. [Notes and Caveats](#notes-and-caveats)
4. [dcid Generation](#dcid-generation)
1. [Virus](#virus)
2. [VirusIsolate](#virusisolate)
3. [VirusGenomeSegment](#virusgenomesegment)
4. [Illegal Characters](#illegal-characters)
5. [License](#license)
6. [Dataset Documentation and Relevant Links](#dataset-documentation-and-relevant-links)
2. [About the Import](#about-the-import)
1. [Artifacts](#artifacts)
1. [New Schema](#new-schema)
2. [Scripts](#scripts)
3. [tMCFs](#tmcfs)
4. [Log Files](#log-files)
2. [Import Procedure](#import-procedure)
3. [Tests](#tests)


## About the Datasets
“The [International Committee on Taxonomy of Viruses (ICTV)](https://ictv.global/) authorizes and organizes the taxonomic classification of and the nomenclatures for viruses. The ICTV has developed a universal taxonomic scheme for viruses, and thus has the means to appropriately describe, name, and classify every virus that affects living organisms. The members of the International Committee on Taxonomy of Viruses are considered expert virologists. The ICTV was formed from and is governed by the Virology Division of the International Union of Microbiological Societies. Detailed work, such as delimiting the boundaries of species within a family, typically is performed by study groups of experts in the families.” Description from [Wikipedia](https://en.wikipedia.org/wiki/International_Committee_on_Taxonomy_of_Viruses).

The ICTV Master Species List is curated by virology experts, which have established over 100 international study groups, which organize discussions on emerging taxonomic issues in their field, oversee the submission of proposals for new taxonomy, and prepare or revise the relevant chapter(s) in ICTV reports. ICTV is open to submissions of proposals for taxonomic changes from an individual, however in practice proposals are usually submitted by members of the relevant study groups.

The ICTV chooses an exemplar virus for each species and the Virus Metadata Resource provides a list of these exemplars. An exemplar virus serves as an example of a well-characterized virus isolate of that species and includes the GenBank accession number for the genomic sequence of the isolate as well as the virus name, isolate designation, suggested abbreviation, genome composition, and host source.

### Download URLs

The release history and the most recent release of the Master Species List can be found [here](https://ictv.global/msl).

The release history and the most recent release of the Virus Metadata Resource can be found [here](https://ictv.global/vmr).


### Overview

This directory stores all scripts used to import data on viurses and virus isolates from the ICTV. This includes the master species list, which includes the full viral taxonomy (realm -> species) and information on the genomic composition and taxonomic history for all species. The import also includes the Virus Metadata Resource, which includes information regarding the exemplar isolates for each species selected by the ICTV and additional virus isolates within the ICTV dataset.


### Notes and Caveats
Viruses are not considered alive and are therefore not classified under “The Tree of Life”. They instead have their own taxonomic classification system described here. However, the viral classification system mirrors “The Tree of Life” by copying their Kingdom -> Phylum -> Class -> Order -> Family -> Genus -> Species hierarchical classes, while adding a level above called Domain and sublevels under each one. This similarity in naming can lead to confusion between the two classification systems. In particular, in datasets species of viruses may be included without distinction alongside species of bacteria, archaea, or animals. To mitigate this potential confusion Viruses have their own distinct schema, which they do not share with non-viral biological entity.

Not all levels of the viral classification are currently in use. As of release 37, Subrealm, Subkingdom, and Subclass are not in use. These classifications are defined here in the schema in case they are used in future releases. In addition, for each species there is a classification defined for each of the main classes (Domain, Kingdom, Phylum, Class, Order, Family, Genus, and Species), however there are missing classifications for some or all of the subclasses (Subkingdom, Subphylum, Subclass, Suborder, SubFamily, and Subgenus). To account for this, references will be made to the parent of the next main class in addition to the parent subclass.

“The ICTV chooses an exemplar virus for each species and the VMR provides a list of these exemplars. An exemplar virus serves as an example of a well-characterized virus isolate of that species and includes the GenBank accession number for the genomic sequence of the isolate as well as the virus name, isolate designation, suggested abbreviation, genome composition, and host source.” Additional isolates for each species within the ICTV database are also noted.


### dcid Generation
A ‘bio/’ prefix was attached to all dcids in this import. Each line in each input file is considered its own unique Virus or VirusIsolate. In cases where there are multiple lines that generate the same dcid for a Virus, VirusIsolate, or VirusGenomeSegment then an error message is printed out stating the non-unique dcid generated for a given entity.

#### Virus
Dcids were generated by converting the Virus’s species name to pascal case (i.e. bio/<Species>).

#### VirusIsolate
Unique information regarding the VirusIsolate was added to the end of the Virus dcid to generate a unique VirusIsolate dcid. In the cases for which the isolate had a designation, then this was converted to pascal case and used as the dcid (i.e. bio/<Species><IsolateDesignation>). In cases where there was no isolate designation indicated then the GenBank Accession Number was used to generate the dcid if there was one unique one for that isolate (i.e. bio/<Species><GenBankAccession>). In cases in which there were multiple GenBank Accession numbers associated with a virus isolate, these were daisy chained with ‘_’s to create the dcid for the VirusIsolate (i.e. bio/<Species><GenBankAccession1>_<GenBankAccession2>). In the event both the isolate designation and the GenBank Accession for a VirusIsolate is missing then the word ‘Isolate’ was added to the pascal case name of the species to create the VirusIsolate dcid (i.e. bio/<Species>Isolate).

Note: This resulted in collisions for four VirusIsolates. These errors were recorded in the [format_virus_metadata_resource.log](https://github.com/datacommonsorg/data/new/master/scripts/biomedical/ICTV_Taxonomy/logs/format_virus_metadata_resource.log) file.

#### VirusGenomeSegment
The GenBank Accession number for a VirusGenomeSegment was tacked onto the corresponding VirusIsolate dcid to generate a unique VirusGenomeSegment dcid (i.e. <VirusIsolate_dcid><GenBankAccession>).

#### Illegal Characters
Only ASCII characters are allowed to be used in dcids. Additionally, a number of characters that are illegal to include in the dcid were replaced in place with the following characters specified below:

| Illegal Character | Replacement Character |
| ----------------- | --------------------- |
| : | _ |
| ; | _ |
| <space> | |
| [ | ( |
| ] | ) |
| - | _ |
| – | _ |
| ‘ | _ |
| # | |


### License

The data is published under the Creative Commons Attribution ShareAlike 4.0 International [(CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/).

### Dataset Documentation and Relevant Links

- Documentation can be found in one of the excel sheets in a downloaded dataset from ICTV.
- Taxonomy Browser User Interface: https://ictv.global/taxonomy

## About the import

### Artifacts

#### New Schema

Classes, properties, and enumerations that were added in this import to represent the data.

* Classes
* Virus, VirusIsolate, VirusGenomeSegment
* Properties
* Virus: proposalForLastChange, taxonHistoryURL, versionOfLastChange, virusGenomeComposition, virusHost, virusLastTaxonomicChange, virusSource, virusRealm, virusSubrealm, virusKingdom, virusSubkingdom, virusPhylum, virusSubphylum, virusClass, virusSubclass, virusOrder, virusSuborder, virusFamily, virusSubfamily, virusGenus, virusSubgenus, virusSpecies
* VirusIsolate: genomeCoverage, isExemplarVirusIsolate, ofVirusSpecies, virusIsolateDesignation
* VirusGenomeSegment: genomeSegmentOf
* Enumerations
* GenomeCoverageEnum, VirusGenomeCompositionEnum, VirusHostEnum, VirusSourceEnum
* Enumerations Generated Via Script
* VirusRealmEnum, VirusSubrealmEnum, VirusKingdomEnum, VirusSubkingdomEnum, VirusPhylumEnum, VirusSubphylumEnum, VirusClassEnum, VirusSubclassEnum, VirusOrderEnum, VirusSuborderEnum, VirusFamilyEnum, VirusSubfamilyEnum, VirusGenusEnum, VirusSubgenusEnum

#### Scripts

##### Bash Scripts

- [download.sh](scripts/download.sh) downloads the most recent release of the ICTV Master Species List and Virus Metadata Resource.
- [run.sh](scripts/run.sh) creates new viral taxonomy enum and converts data into formatted CSV for import of data on viruses, virus isolates, and viral genome fragments into the knowledge graph.
- [tests.sh](scripts/tests.sh) runs standard tests on CSV + tMCF pairs to check for proper formatting.

##### Python Scripts

- [create_virus_taxonomic_ranking_enums.py](scripts/create_virus_taxonomic_ranking_enums.py) creates the viral taxonomy enum mcf file from the Virus Metadata Resource file.
- [format_virus_master_species_list.py](scripts/format_virus_master_species_list.py) parses the raw Master Species List xslx file into virus csv file.
- [format_virus_metadata_resource.py](scripts/format_virus_metadata_resource.py) parses the raw Virus Metadata Resource file into virus isolates and viral genome segements csv files.

#### tMCFs

- [VirusSpecies.tmcf](tMCFs/VirusSpecies.tmcf) contains the tmcf mapping to the csv of viruses.
- [VirusIsolates.tmcf](tMCFs/VirusIsolates.tmcf) contains the tmcf mapping to the csv of virus isolates.
- [VirusGenomeSegments.tmcf](tMCFs/VirusGenomeSegments.tmcf) contains the tmcf mapping to the csv of viral genome segments.

#### Log Files

- [format_virus_metadata_resource.log](logs/format_virus_metadata_resource.log) log file from script converting the Virus Metadata Resource into formatted CSV file.

### Import Procedure

Download the most recent versions of the Master Species List and Virus Metadata Resource from ICTV by running:

```bash
sh download.sh
```

Generate the enummeration schema MCF, which represents virus taxonomic ranks by running:

```bash
sh run.sh
```

### Tests

The first step of `tests.sh` is to downloads Data Commons's java -jar import tool, storing it in a `tmp` directory. This assumes that the user has Java Runtime Environment (JRE) installed. This tool is described in Data Commons documentation of the [import pipeline](https://github.com/datacommonsorg/import/). The relases of the tool can be viewed [here](https://github.com/datacommonsorg/import/releases/). Here we download version `0.1-alpha.1k` and apply it to check our csv + tmcf import. It evaluates if all schema used in the import is present in the graph, all referenced nodes are present in the graph, along with other checks that issue fatal errors, errors, or warnings upon failing checks. Please note that empty tokens for some columns are expected as this reflects the original data. The imports create the Virus nodes that are then refrenced within this import. This resolves any concern about missing reference warnings concerning these node types by the test.

To run tests:

```bash
sh tests.sh
```

This will generate an output file for the results of the tests on each csv + tmcf pair

Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Non-unique VirusIsolate dcid generated! Added additional info to differentiate: bio/BetachrysovirusMagnaporthis_VietNam_MoCV1-B
Non-unique VirusIsolate dcid generated! Added additional info to differentiate: bio/OrthoflavivirusAroaense_BeAn4073_AF013366
Non-unique VirusIsolate dcid generated! Added additional info to differentiate: bio/OrthobornavirusCaenophidiae_CHC_224_BK014571
Non-unique VirusIsolate dcid generated! Added additional info to differentiate: bio/UgandanCassavaBrownStreakVirus_UG_FJ185044
Non-unique VirusIsolate dcid generated! Added additional info to differentiate: bio/PotatoVirusY_N_X97895
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Author: Samantha Piekos
Date: 02/21/2024
Name: create_virus_taxonomic_ranking_enums.py
Description: Creates hierarchical viral taxonomy enum schema from the
ICTV Virus Metadata Resource.
@file_input: ICTV Virus Metadata Resource .xslx file
@file_output: formatted .mcf files for viral taxonomy enum schema
"""

# load environment
import pandas as pd
import sys


# declare universal variables
HEADER = [
'sort',
'isolateSort',
'realm',
'subrealm',
'kingdom',
'subkingdom',
'phylum',
'subphylum',
'class',
'subclass',
'order',
'suborder',
'family',
'subfamily',
'genus',
'subgenus',
'species',
'isExemplar',
'name',
'abbreviation',
'isolateDesignation',
'genBankAccession',
'refSeqAccession',
'genomeCoverage',
'genomeComposition',
'hostSource',
'host',
'source',
'dcid',
'isolate_dcid',
'isolate_name'
]

LIST_DROP = [
'sort',
'isolateSort',
'species',
'isExemplar',
'name',
'abbreviation',
'isolateDesignation',
'genBankAccession',
'refSeqAccession',
'genomeCoverage',
'genomeComposition',
'hostSource',
'host',
'source',
'dcid',
'isolate_dcid',
'isolate_name'
]


# declare functions
def pascalcase(s):
list_words = s.split()
converted = "".join(word[0].upper() + word[1:].lower() for word in list_words)
return converted


def check_for_illegal_charc(s):
list_illegal = ["'", "–", "*" ">", "<", "@", "]", "[", "|", ":", ";" " "]
if any([x in s for x in list_illegal]):
print('Error! dcid contains illegal characters!', s)


def initiate_enum_dict():
d = {}
list_levels = [i for i in HEADER if i not in LIST_DROP]
for item in list_levels:
enum_name = 'Virus' + item.capitalize() + 'Enum'
d[enum_name] = {}
return d


def add_enums_to_dicts(key, value, d):
if value == value:
enum = 'Virus' + key + 'Enum'
dcid = 'Virus' + key + pascalcase(value)
check_for_illegal_charc(dcid)
d[enum][value] = dcid
return d


def add_item_to_enums(df):
list_levels = [i for i in HEADER if i not in LIST_DROP]
dict_of_dicts = initiate_enum_dict()
dict_specialization = {} # keep track of previous top level
for index, row in df.iterrows():
last_level_dcid = False # initiate empty value for tracking specialization
for item in list_levels:
level = item.capitalize()
if row[item] != row[item]:
continue
dict_of_dicts = add_enums_to_dicts(level, row[item], dict_of_dicts)
if last_level_dcid: # track specialization if relevant
dcid = 'Virus' + level + pascalcase(row[item])
dict_specialization[dcid] = last_level_dcid
last_level_dcid = 'Virus' + level + pascalcase(row[item]) # update top level
return dict_of_dicts, dict_specialization


def write_individual_entries_to_file(w, enum, d, dict_specialization):
for key, value in d.items():
w.write('Node: dcid:' + value + '\n')
w.write('name: "' + key + '"\n')
w.write('typeOf: dcs:' + enum + '\n')
if value in dict_specialization:
w.write('specializationOf: dcs:' + dict_specialization[value] + '\n\n')
else:
w.write('\n')
return w


def write_dict_to_file(w, enum, d, dict_specialization):
w.write('# ' + enum + '\n')
w.write('Node: dcid:' + enum + '\n')
w.write('name: "' + enum + '"\n')
w.write('typeOf: schema:Class\n')
w.write('subClassOf: schema:Enumeration\n\n')
w = write_individual_entries_to_file(w, enum, d, dict_specialization)
w.write('\n')
return w


def generate_enums_mcf(f, w):
df = pd.read_excel(f, names=HEADER, header=None, sheet_name=0)
df = df.drop(LIST_DROP, axis=1).drop(0, axis=0)
dict_of_dicts, dict_specialization = add_item_to_enums(df)
w = open(w, mode='w')
w.write('# Schema generated by create_virus_taxonomic_ranking_enums.py\n\n')
for key, value in dict_of_dicts.items():
w = write_dict_to_file(w, key, value, dict_specialization)


def main():
file_input = sys.argv[1]
file_output = sys.argv[2]

generate_enums_mcf(file_input, file_output)


if __name__ == '__main__':
main()
30 changes: 30 additions & 0 deletions scripts/biomedical/ICTV_Taxonomy/scripts/download.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Author: Samantha Piekos
Date: 02/26/2024
Name: download
Description: This file downloads the most recent version of the ICTV Master
Species List and Virus Metadata Resource and prepares it for processing.
"""

#!/bin/bash


# make input directory
mkdir -p input; cd input

# download NCBI data
curl -o ICTV_Virus_Species_List.xlsx https://ictv.global/msl/current
curl -o ICTV_Virus_Metadata_Resource.xlsx https://ictv.global/vmr/current
Loading
Loading