ICTV Import (datacommonsorg#834)

* Create README.md Note: need to add Notes and Caveats as well as commands for running and testing scripts. Also need to add tests and testing files * Add VirusMasterSpeciesList.tmcf * Add tmcf files * Update title * Add VMR dataset description * Mention script formatting taxonomic ranking enums * format schema list * update new enumerations lists * update new schema summary formatting * update new schema overview formatting * Add create_virus_taxonomic_ranking_enums.py * Add formatting scripts * Update format_virus_metadata_resource.py Removes error generated in two dcids by removing whitespace * Add log file * Update README.md Add notes and caveats and dcid generation segments. Also add the commands to run data cleaning scripts. * Create download.sh * Update command to run download.sh * update illegal characters subsection * fix formatting error * Add header * Add header * add header * Add header * update header * Update header * Update scripts * Delete log file * Update script * Update create_virus_taxonomic_ranking_enums.py update script to handle new v38 master species list input file * Update format_virus_master_species_list.py update script to accommodate new v38 release * Update format_virus_master_species_list.py correct file_output description in the header * Update format_virus_metadata_resource.py update script to accommodate v38 * Add run.sh * Update download.sh * Update format_virus_metadata_resource.log update log from running script for v38 * Update README.md * Update run.sh change taxonomic rank enum schema to be generated from the virus metadata file * Update create_virus_taxonomic_ranking_enums.py update so that the virus taxonomic schema is generated from the virus metadata resource file * Update format_virus_metadata_resource.py fix new enum added in v38 * Update execution bash files update download.sh, run.sh, and tests.sh scripts that download, format+clean, and test the import files * Update README.md * Rename VirusMasterSpeciesList.tmcf to VirusSpecies.tmcf * Rename VirusGenomeSegment.tmcf to VirusGenomeSegments.tmcf * Rename VirusTaxonomy.tmcf to VirusIsolates.tmcf * Update tmcf links in README.md * Update bash scripts filepaths in README.md * Update filepaths in README.md * Update README.md table of contents * Update README.md table of contents * Update README.md Table of Contents * Update README.md * Add line creating CSVs directory * Update create_virus_taxonomic_ranking_enums.py remove trailing \ in comma separated lists * Update format_virus_master_species_list.py remove trailing '\' from comma separated lists and directories * Update format_virus_metadata_resource.py remove trailing '\' from comma separated lists and directories * Update tests.sh add extra step to download the data commons java test tool * Update README.md update tests subsection description * Update README.md update tests subsection description to add assumptions * Update README.md fix typo --------- Co-authored-by: Prashanth R <[email protected]>
shamimansari1988 · Mar 21, 2024 · 484b905 · 484b905
1 parent c3311b7
commit 484b905
Show file tree

Hide file tree

Showing 11 changed files with 1,003 additions and 0 deletions.
diff --git a/scripts/biomedical/ICTV_Taxonomy/README.md b/scripts/biomedical/ICTV_Taxonomy/README.md
@@ -0,0 +1,161 @@
+
+# Importing Master Species List and Virus Metadata Resource from the International Committee on Taxonomy of Viruses (ICTV)
+
+## Table of Contents
+
+1. [About the Dataset](#about-the-dataset)
+    1. [Download URL](#download-urls)
+    2. [Overview](#overview)
+    3. [Notes and Caveats](#notes-and-caveats)
+    4. [dcid Generation](#dcid-generation)
+       1. [Virus](#virus)
+       2. [VirusIsolate](#virusisolate)
+       3. [VirusGenomeSegment](#virusgenomesegment)
+       4. [Illegal Characters](#illegal-characters)
+    5. [License](#license)
+    6. [Dataset Documentation and Relevant Links](#dataset-documentation-and-relevant-links)
+2. [About the Import](#about-the-import)
+    1. [Artifacts](#artifacts)
+       1. [New Schema](#new-schema)
+       2. [Scripts](#scripts)
+       3. [tMCFs](#tmcfs)
+       4. [Log Files](#log-files)
+    2. [Import Procedure](#import-procedure)
+    3. [Tests](#tests)
+
+
+## About the Datasets
+“The [International Committee on Taxonomy of Viruses (ICTV)](https://ictv.global/) authorizes and organizes the taxonomic classification of and the nomenclatures for viruses. The ICTV has developed a universal taxonomic scheme for viruses, and thus has the means to appropriately describe, name, and classify every virus that affects living organisms. The members of the International Committee on Taxonomy of Viruses are considered expert virologists. The ICTV was formed from and is governed by the Virology Division of the International Union of Microbiological Societies. Detailed work, such as delimiting the boundaries of species within a family, typically is performed by study groups of experts in the families.” Description from [Wikipedia](https://en.wikipedia.org/wiki/International_Committee_on_Taxonomy_of_Viruses).
+
+The ICTV Master Species List is curated by virology experts, which have established over 100 international study groups, which organize discussions on emerging taxonomic issues in their field, oversee the submission of proposals for new taxonomy, and prepare or revise the relevant chapter(s) in ICTV reports. ICTV is open to submissions of proposals for taxonomic changes from an individual, however in practice proposals are usually submitted by members of the relevant study groups.
+
+The ICTV chooses an exemplar virus for each species and the Virus Metadata Resource provides a list of these exemplars. An exemplar virus serves as an example of a well-characterized virus isolate of that species and includes the GenBank accession number for the genomic sequence of the isolate as well as the virus name, isolate designation, suggested abbreviation, genome composition, and host source.
+
+### Download URLs
+
+The release history and the most recent release of the Master Species List can be found [here](https://ictv.global/msl).
+
+The release history and the most recent release of the Virus Metadata Resource can be found [here](https://ictv.global/vmr).
+
+
+### Overview
+
+This directory stores all scripts used to import data on viurses and virus isolates from the ICTV. This includes the master species list, which includes the full viral taxonomy (realm -> species) and information on the genomic composition and taxonomic history for all species. The import also includes the Virus Metadata Resource, which includes information regarding the exemplar isolates for each species selected by the ICTV and additional virus isolates within the ICTV dataset.
+
+
+### Notes and Caveats
+Viruses are not considered alive and are therefore not classified under “The Tree of Life”. They instead have their own taxonomic classification system described here. However, the viral classification system mirrors “The Tree of Life” by copying their Kingdom -> Phylum -> Class -> Order -> Family -> Genus -> Species hierarchical classes, while adding a level above called Domain and sublevels under each one. This similarity in naming can lead to confusion between the two classification systems. In particular, in datasets species of viruses may be included  without distinction alongside species of bacteria, archaea, or animals. To mitigate this potential confusion Viruses have their own distinct schema, which they do not share with non-viral biological entity.
+
+Not all levels of the viral classification are currently in use. As of release 37, Subrealm, Subkingdom, and Subclass are not in use. These classifications are defined here in the schema in case they are used in future releases. In addition, for each species there is a classification defined for each of the main classes (Domain, Kingdom, Phylum, Class, Order, Family, Genus, and Species), however there are missing classifications for some or all of the subclasses (Subkingdom, Subphylum, Subclass, Suborder, SubFamily, and Subgenus). To account for this, references will be made to the parent of the next main class in addition to the parent subclass.
+
+“The ICTV chooses an exemplar virus for each species and the VMR provides a list of these exemplars. An exemplar virus serves as an example of a well-characterized virus isolate of that species and includes the GenBank accession number for the genomic sequence of the isolate as well as the virus name, isolate designation, suggested abbreviation, genome composition, and host source.” Additional isolates for each species within the ICTV database are also noted.
+
+
+### dcid Generation
+A ‘bio/’ prefix was attached to all dcids in this import. Each line in each input file is considered its own unique Virus or VirusIsolate. In cases where there are multiple lines that generate the same dcid for a Virus, VirusIsolate, or VirusGenomeSegment then an error message is printed out stating the non-unique dcid generated for a given entity.
+
+#### Virus
+Dcids were generated by converting the Virus’s species name to pascal case (i.e. bio/<Species>).
+
+#### VirusIsolate
+Unique information regarding the VirusIsolate was added to the end of the Virus dcid to generate a unique VirusIsolate dcid. In the cases for which the isolate had a designation, then this was converted to pascal case and used as the dcid (i.e. bio/<Species><IsolateDesignation>). In cases where there was no isolate designation indicated then the GenBank Accession Number was used to generate the dcid if there was one unique one for that isolate (i.e. bio/<Species><GenBankAccession>). In cases in which there were multiple GenBank Accession numbers associated with a virus isolate, these were daisy chained with ‘_’s to create the dcid for the VirusIsolate (i.e. bio/<Species><GenBankAccession1>_<GenBankAccession2>). In the event both the isolate designation and the GenBank Accession for a VirusIsolate is missing then the word ‘Isolate’ was added to the pascal case name of the species to create the VirusIsolate dcid (i.e. bio/<Species>Isolate).
+
+Note: This resulted in collisions for four VirusIsolates. These errors were recorded in the [format_virus_metadata_resource.log](https://github.com/datacommonsorg/data/new/master/scripts/biomedical/ICTV_Taxonomy/logs/format_virus_metadata_resource.log) file.
+
+#### VirusGenomeSegment
+The GenBank Accession number for a VirusGenomeSegment was tacked onto the corresponding VirusIsolate dcid to generate a unique VirusGenomeSegment dcid (i.e. <VirusIsolate_dcid><GenBankAccession>).
+
+#### Illegal Characters
+Only ASCII characters are allowed to be used in dcids. Additionally, a number of characters that are illegal to include in the dcid were replaced in place with the following characters specified below:
+
+| Illegal Character | Replacement Character |
+| ----------------- | --------------------- |
+| : | _ |
+| ; | _ |
+| <space>   |   |
+| [ | ( |
+| ] | ) |
+| - | _ |
+| – | _ |
+| ‘ | _ |
+| # |   |
+
+
+### License
+
+The data is published under the Creative Commons Attribution ShareAlike 4.0 International [(CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/).
+
+### Dataset Documentation and Relevant Links
+
+- Documentation can be found in one of the excel sheets in a downloaded dataset from ICTV.
+- Taxonomy Browser User Interface: https://ictv.global/taxonomy
+
+## About the import
+
+### Artifacts
+
+#### New Schema
+
+Classes, properties, and enumerations that were added in this import to represent the data.
+
+* Classes
+    * Virus, VirusIsolate, VirusGenomeSegment
+* Properties
+    * Virus: proposalForLastChange, taxonHistoryURL, versionOfLastChange, virusGenomeComposition, virusHost, virusLastTaxonomicChange, virusSource, virusRealm, virusSubrealm, virusKingdom, virusSubkingdom, virusPhylum, virusSubphylum, virusClass, virusSubclass, virusOrder, virusSuborder, virusFamily, virusSubfamily, virusGenus, virusSubgenus, virusSpecies
+    * VirusIsolate: genomeCoverage, isExemplarVirusIsolate, ofVirusSpecies, virusIsolateDesignation
+    * VirusGenomeSegment: genomeSegmentOf
+* Enumerations
+    * GenomeCoverageEnum, VirusGenomeCompositionEnum, VirusHostEnum, VirusSourceEnum
+* Enumerations Generated Via Script
+    * VirusRealmEnum, VirusSubrealmEnum, VirusKingdomEnum, VirusSubkingdomEnum, VirusPhylumEnum, VirusSubphylumEnum, VirusClassEnum, VirusSubclassEnum, VirusOrderEnum, VirusSuborderEnum, VirusFamilyEnum, VirusSubfamilyEnum, VirusGenusEnum, VirusSubgenusEnum
+
+#### Scripts 
+
+##### Bash Scripts
+
+- [download.sh](scripts/download.sh) downloads the most recent release of the ICTV Master Species List and Virus Metadata Resource.
+- [run.sh](scripts/run.sh) creates new viral taxonomy enum and converts data into formatted CSV for import of data on viruses, virus isolates, and viral genome fragments into the knowledge graph.
+- [tests.sh](scripts/tests.sh) runs standard tests on CSV + tMCF pairs to check for proper formatting.
+
+##### Python Scripts
+
+- [create_virus_taxonomic_ranking_enums.py](scripts/create_virus_taxonomic_ranking_enums.py) creates the viral taxonomy enum mcf file from the Virus Metadata Resource file.
+- [format_virus_master_species_list.py](scripts/format_virus_master_species_list.py) parses the raw Master Species List xslx file into virus csv file.
+- [format_virus_metadata_resource.py](scripts/format_virus_metadata_resource.py) parses the raw Virus Metadata Resource file into virus isolates and viral genome segements csv files.
+
+#### tMCFs
+
+- [VirusSpecies.tmcf](tMCFs/VirusSpecies.tmcf) contains the tmcf mapping to the csv of viruses.
+- [VirusIsolates.tmcf](tMCFs/VirusIsolates.tmcf) contains the tmcf mapping to the csv of virus isolates.
+- [VirusGenomeSegments.tmcf](tMCFs/VirusGenomeSegments.tmcf) contains the tmcf mapping to the csv of viral genome segments.
+
+#### Log Files
+
+- [format_virus_metadata_resource.log](logs/format_virus_metadata_resource.log) log file from script converting the Virus Metadata Resource into formatted CSV file.
+
+### Import Procedure
+
+Download the most recent versions of the Master Species List and Virus Metadata Resource from ICTV by running:
+
+```bash
+sh download.sh
+```
+
+Generate the enummeration schema MCF, which represents virus taxonomic ranks by running:
+
+```bash
+sh run.sh
+```
+
+### Tests
+
+The first step of `tests.sh` is to downloads Data Commons's java -jar import tool, storing it in a `tmp` directory. This assumes that the user has Java Runtime Environment (JRE) installed. This tool is described in Data Commons documentation of the [import pipeline](https://github.com/datacommonsorg/import/). The relases of the tool can be viewed [here](https://github.com/datacommonsorg/import/releases/). Here we download version `0.1-alpha.1k` and apply it to check our csv + tmcf import. It evaluates if all schema used in the import is present in the graph, all referenced nodes are present in the graph, along with other checks that issue fatal errors, errors, or warnings upon failing checks. Please note that empty tokens for some columns are expected as this reflects the original data. The imports create the Virus nodes that are then refrenced within this import. This resolves any concern about missing reference warnings concerning these node types by the test.
+
+To run tests:
+
+```bash
+sh tests.sh
+```
+
+This will generate an output file for the results of the tests on each csv + tmcf pair
+
diff --git a/scripts/biomedical/ICTV_Taxonomy/logs/format_virus_metadata_resource.log b/scripts/biomedical/ICTV_Taxonomy/logs/format_virus_metadata_resource.log
@@ -0,0 +1,5 @@
+Non-unique VirusIsolate dcid generated! Added additional info to differentiate: bio/BetachrysovirusMagnaporthis_VietNam_MoCV1-B
+Non-unique VirusIsolate dcid generated! Added additional info to differentiate: bio/OrthoflavivirusAroaense_BeAn4073_AF013366
+Non-unique VirusIsolate dcid generated! Added additional info to differentiate: bio/OrthobornavirusCaenophidiae_CHC_224_BK014571
+Non-unique VirusIsolate dcid generated! Added additional info to differentiate: bio/UgandanCassavaBrownStreakVirus_UG_FJ185044
+Non-unique VirusIsolate dcid generated! Added additional info to differentiate: bio/PotatoVirusY_N_X97895
diff --git a/scripts/biomedical/ICTV_Taxonomy/scripts/create_virus_taxonomic_ranking_enums.py b/scripts/biomedical/ICTV_Taxonomy/scripts/create_virus_taxonomic_ranking_enums.py
@@ -0,0 +1,141 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Author: Samantha Piekos
+Date: 02/21/2024
+Name: create_virus_taxonomic_ranking_enums.py
+Description: Creates hierarchical viral taxonomy enum schema from the
+ICTV Virus Metadata Resource.
+@file_input: ICTV Virus Metadata Resource .xslx file
+@file_output: formatted .mcf files for viral taxonomy enum schema
+"""
+
+# load environment
+import pandas as pd
+import sys
+
+# declare universal variables
+HEADER = [
+    'sort', 'isolateSort', 'realm', 'subrealm', 'kingdom', 'subkingdom',
+    'phylum', 'subphylum', 'class', 'subclass', 'order', 'suborder', 'family',
+    'subfamily', 'genus', 'subgenus', 'species', 'isExemplar', 'name',
+    'abbreviation', 'isolateDesignation', 'genBankAccession', 'refSeqAccession',
+    'genomeCoverage', 'genomeComposition', 'hostSource', 'host', 'source',
+    'dcid', 'isolate_dcid', 'isolate_name'
+]
+
+LIST_DROP = [
+    'sort', 'isolateSort', 'species', 'isExemplar', 'name', 'abbreviation',
+    'isolateDesignation', 'genBankAccession', 'refSeqAccession',
+    'genomeCoverage', 'genomeComposition', 'hostSource', 'host', 'source',
+    'dcid', 'isolate_dcid', 'isolate_name'
+]
+
+
+# declare functions
+def pascalcase(s):
+    list_words = s.split()
+    converted = "".join(
+        word[0].upper() + word[1:].lower() for word in list_words)
+    return converted
+
+
+def check_for_illegal_charc(s):
+    list_illegal = ["'", "–", "*"
+                    ">", "<", "@", "]", "[", "|", ":", ";"
+                    " "]
+    if any([x in s for x in list_illegal]):
+        print('Error! dcid contains illegal characters!', s)
+
+
+def initiate_enum_dict():
+    d = {}
+    list_levels = [i for i in HEADER if i not in LIST_DROP]
+    for item in list_levels:
+        enum_name = 'Virus' + item.capitalize() + 'Enum'
+        d[enum_name] = {}
+    return d
+
+
+def add_enums_to_dicts(key, value, d):
+    if value == value:
+        enum = 'Virus' + key + 'Enum'
+        dcid = 'Virus' + key + pascalcase(value)
+        check_for_illegal_charc(dcid)
+        d[enum][value] = dcid
+    return d
+
+
+def add_item_to_enums(df):
+    list_levels = [i for i in HEADER if i not in LIST_DROP]
+    dict_of_dicts = initiate_enum_dict()
+    dict_specialization = {}  # keep track of previous top level
+    for index, row in df.iterrows():
+        last_level_dcid = False  # initiate empty value for tracking specialization
+        for item in list_levels:
+            level = item.capitalize()
+            if row[item] != row[item]:
+                continue
+            dict_of_dicts = add_enums_to_dicts(level, row[item], dict_of_dicts)
+            if last_level_dcid:  # track specialization if relevant
+                dcid = 'Virus' + level + pascalcase(row[item])
+                dict_specialization[dcid] = last_level_dcid
+            last_level_dcid = 'Virus' + level + pascalcase(
+                row[item])  # update top level
+    return dict_of_dicts, dict_specialization
+
+
+def write_individual_entries_to_file(w, enum, d, dict_specialization):
+    for key, value in d.items():
+        w.write('Node: dcid:' + value + '\n')
+        w.write('name: "' + key + '"\n')
+        w.write('typeOf: dcs:' + enum + '\n')
+        if value in dict_specialization:
+            w.write('specializationOf: dcs:' + dict_specialization[value] +
+                    '\n\n')
+        else:
+            w.write('\n')
+    return w
+
+
+def write_dict_to_file(w, enum, d, dict_specialization):
+    w.write('# ' + enum + '\n')
+    w.write('Node: dcid:' + enum + '\n')
+    w.write('name: "' + enum + '"\n')
+    w.write('typeOf: schema:Class\n')
+    w.write('subClassOf: schema:Enumeration\n\n')
+    w = write_individual_entries_to_file(w, enum, d, dict_specialization)
+    w.write('\n')
+    return w
+
+
+def generate_enums_mcf(f, w):
+    df = pd.read_excel(f, names=HEADER, header=None, sheet_name=0)
+    df = df.drop(LIST_DROP, axis=1).drop(0, axis=0)
+    dict_of_dicts, dict_specialization = add_item_to_enums(df)
+    w = open(w, mode='w')
+    w.write('# Schema generated by create_virus_taxonomic_ranking_enums.py\n\n')
+    for key, value in dict_of_dicts.items():
+        w = write_dict_to_file(w, key, value, dict_specialization)
+
+
+def main():
+    file_input = sys.argv[1]
+    file_output = sys.argv[2]
+
+    generate_enums_mcf(file_input, file_output)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/scripts/biomedical/ICTV_Taxonomy/scripts/download.sh b/scripts/biomedical/ICTV_Taxonomy/scripts/download.sh
@@ -0,0 +1,30 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Author: Samantha Piekos
+Date: 02/26/2024
+Name: download
+Description: This file downloads the most recent version of the ICTV Master 
+Species List and Virus Metadata Resource and prepares it for processing.
+"""
+
+#!/bin/bash
+
+
+# make input directory
+mkdir -p input; cd input
+
+# download NCBI data
+curl -o ICTV_Virus_Species_List.xlsx https://ictv.global/msl/current
+curl -o ICTV_Virus_Metadata_Resource.xlsx https://ictv.global/vmr/current