diff --git a/scripts/biomedical/ICTV_Taxonomy/README.md b/scripts/biomedical/ICTV_Taxonomy/README.md new file mode 100644 index 0000000000..14374ca4d9 --- /dev/null +++ b/scripts/biomedical/ICTV_Taxonomy/README.md @@ -0,0 +1,161 @@ + +# Importing Master Species List and Virus Metadata Resource from the International Committee on Taxonomy of Viruses (ICTV) + +## Table of Contents + +1. [About the Dataset](#about-the-dataset) + 1. [Download URL](#download-urls) + 2. [Overview](#overview) + 3. [Notes and Caveats](#notes-and-caveats) + 4. [dcid Generation](#dcid-generation) + 1. [Virus](#virus) + 2. [VirusIsolate](#virusisolate) + 3. [VirusGenomeSegment](#virusgenomesegment) + 4. [Illegal Characters](#illegal-characters) + 5. [License](#license) + 6. [Dataset Documentation and Relevant Links](#dataset-documentation-and-relevant-links) +2. [About the Import](#about-the-import) + 1. [Artifacts](#artifacts) + 1. [New Schema](#new-schema) + 2. [Scripts](#scripts) + 3. [tMCFs](#tmcfs) + 4. [Log Files](#log-files) + 2. [Import Procedure](#import-procedure) + 3. [Tests](#tests) + + +## About the Datasets +“The [International Committee on Taxonomy of Viruses (ICTV)](https://ictv.global/) authorizes and organizes the taxonomic classification of and the nomenclatures for viruses. The ICTV has developed a universal taxonomic scheme for viruses, and thus has the means to appropriately describe, name, and classify every virus that affects living organisms. The members of the International Committee on Taxonomy of Viruses are considered expert virologists. The ICTV was formed from and is governed by the Virology Division of the International Union of Microbiological Societies. Detailed work, such as delimiting the boundaries of species within a family, typically is performed by study groups of experts in the families.” Description from [Wikipedia](https://en.wikipedia.org/wiki/International_Committee_on_Taxonomy_of_Viruses). + +The ICTV Master Species List is curated by virology experts, which have established over 100 international study groups, which organize discussions on emerging taxonomic issues in their field, oversee the submission of proposals for new taxonomy, and prepare or revise the relevant chapter(s) in ICTV reports. ICTV is open to submissions of proposals for taxonomic changes from an individual, however in practice proposals are usually submitted by members of the relevant study groups. + +The ICTV chooses an exemplar virus for each species and the Virus Metadata Resource provides a list of these exemplars. An exemplar virus serves as an example of a well-characterized virus isolate of that species and includes the GenBank accession number for the genomic sequence of the isolate as well as the virus name, isolate designation, suggested abbreviation, genome composition, and host source. + +### Download URLs + +The release history and the most recent release of the Master Species List can be found [here](https://ictv.global/msl). + +The release history and the most recent release of the Virus Metadata Resource can be found [here](https://ictv.global/vmr). + + +### Overview + +This directory stores all scripts used to import data on viurses and virus isolates from the ICTV. This includes the master species list, which includes the full viral taxonomy (realm -> species) and information on the genomic composition and taxonomic history for all species. The import also includes the Virus Metadata Resource, which includes information regarding the exemplar isolates for each species selected by the ICTV and additional virus isolates within the ICTV dataset. + + +### Notes and Caveats +Viruses are not considered alive and are therefore not classified under “The Tree of Life”. They instead have their own taxonomic classification system described here. However, the viral classification system mirrors “The Tree of Life” by copying their Kingdom -> Phylum -> Class -> Order -> Family -> Genus -> Species hierarchical classes, while adding a level above called Domain and sublevels under each one. This similarity in naming can lead to confusion between the two classification systems. In particular, in datasets species of viruses may be included without distinction alongside species of bacteria, archaea, or animals. To mitigate this potential confusion Viruses have their own distinct schema, which they do not share with non-viral biological entity. + +Not all levels of the viral classification are currently in use. As of release 37, Subrealm, Subkingdom, and Subclass are not in use. These classifications are defined here in the schema in case they are used in future releases. In addition, for each species there is a classification defined for each of the main classes (Domain, Kingdom, Phylum, Class, Order, Family, Genus, and Species), however there are missing classifications for some or all of the subclasses (Subkingdom, Subphylum, Subclass, Suborder, SubFamily, and Subgenus). To account for this, references will be made to the parent of the next main class in addition to the parent subclass. + +“The ICTV chooses an exemplar virus for each species and the VMR provides a list of these exemplars. An exemplar virus serves as an example of a well-characterized virus isolate of that species and includes the GenBank accession number for the genomic sequence of the isolate as well as the virus name, isolate designation, suggested abbreviation, genome composition, and host source.” Additional isolates for each species within the ICTV database are also noted. + + +### dcid Generation +A ‘bio/’ prefix was attached to all dcids in this import. Each line in each input file is considered its own unique Virus or VirusIsolate. In cases where there are multiple lines that generate the same dcid for a Virus, VirusIsolate, or VirusGenomeSegment then an error message is printed out stating the non-unique dcid generated for a given entity. + +#### Virus +Dcids were generated by converting the Virus’s species name to pascal case (i.e. bio/). + +#### VirusIsolate +Unique information regarding the VirusIsolate was added to the end of the Virus dcid to generate a unique VirusIsolate dcid. In the cases for which the isolate had a designation, then this was converted to pascal case and used as the dcid (i.e. bio/). In cases where there was no isolate designation indicated then the GenBank Accession Number was used to generate the dcid if there was one unique one for that isolate (i.e. bio/). In cases in which there were multiple GenBank Accession numbers associated with a virus isolate, these were daisy chained with ‘_’s to create the dcid for the VirusIsolate (i.e. bio/_). In the event both the isolate designation and the GenBank Accession for a VirusIsolate is missing then the word ‘Isolate’ was added to the pascal case name of the species to create the VirusIsolate dcid (i.e. bio/Isolate). + +Note: This resulted in collisions for four VirusIsolates. These errors were recorded in the [format_virus_metadata_resource.log](https://github.com/datacommonsorg/data/new/master/scripts/biomedical/ICTV_Taxonomy/logs/format_virus_metadata_resource.log) file. + +#### VirusGenomeSegment +The GenBank Accession number for a VirusGenomeSegment was tacked onto the corresponding VirusIsolate dcid to generate a unique VirusGenomeSegment dcid (i.e. ). + +#### Illegal Characters +Only ASCII characters are allowed to be used in dcids. Additionally, a number of characters that are illegal to include in the dcid were replaced in place with the following characters specified below: + +| Illegal Character | Replacement Character | +| ----------------- | --------------------- | +| : | _ | +| ; | _ | +| | | +| [ | ( | +| ] | ) | +| - | _ | +| – | _ | +| ‘ | _ | +| # | | + + +### License + +The data is published under the Creative Commons Attribution ShareAlike 4.0 International [(CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/). + +### Dataset Documentation and Relevant Links + +- Documentation can be found in one of the excel sheets in a downloaded dataset from ICTV. +- Taxonomy Browser User Interface: https://ictv.global/taxonomy + +## About the import + +### Artifacts + +#### New Schema + +Classes, properties, and enumerations that were added in this import to represent the data. + +* Classes + * Virus, VirusIsolate, VirusGenomeSegment +* Properties + * Virus: proposalForLastChange, taxonHistoryURL, versionOfLastChange, virusGenomeComposition, virusHost, virusLastTaxonomicChange, virusSource, virusRealm, virusSubrealm, virusKingdom, virusSubkingdom, virusPhylum, virusSubphylum, virusClass, virusSubclass, virusOrder, virusSuborder, virusFamily, virusSubfamily, virusGenus, virusSubgenus, virusSpecies + * VirusIsolate: genomeCoverage, isExemplarVirusIsolate, ofVirusSpecies, virusIsolateDesignation + * VirusGenomeSegment: genomeSegmentOf +* Enumerations + * GenomeCoverageEnum, VirusGenomeCompositionEnum, VirusHostEnum, VirusSourceEnum +* Enumerations Generated Via Script + * VirusRealmEnum, VirusSubrealmEnum, VirusKingdomEnum, VirusSubkingdomEnum, VirusPhylumEnum, VirusSubphylumEnum, VirusClassEnum, VirusSubclassEnum, VirusOrderEnum, VirusSuborderEnum, VirusFamilyEnum, VirusSubfamilyEnum, VirusGenusEnum, VirusSubgenusEnum + +#### Scripts + +##### Bash Scripts + +- [download.sh](scripts/download.sh) downloads the most recent release of the ICTV Master Species List and Virus Metadata Resource. +- [run.sh](scripts/run.sh) creates new viral taxonomy enum and converts data into formatted CSV for import of data on viruses, virus isolates, and viral genome fragments into the knowledge graph. +- [tests.sh](scripts/tests.sh) runs standard tests on CSV + tMCF pairs to check for proper formatting. + +##### Python Scripts + +- [create_virus_taxonomic_ranking_enums.py](scripts/create_virus_taxonomic_ranking_enums.py) creates the viral taxonomy enum mcf file from the Virus Metadata Resource file. +- [format_virus_master_species_list.py](scripts/format_virus_master_species_list.py) parses the raw Master Species List xslx file into virus csv file. +- [format_virus_metadata_resource.py](scripts/format_virus_metadata_resource.py) parses the raw Virus Metadata Resource file into virus isolates and viral genome segements csv files. + +#### tMCFs + +- [VirusSpecies.tmcf](tMCFs/VirusSpecies.tmcf) contains the tmcf mapping to the csv of viruses. +- [VirusIsolates.tmcf](tMCFs/VirusIsolates.tmcf) contains the tmcf mapping to the csv of virus isolates. +- [VirusGenomeSegments.tmcf](tMCFs/VirusGenomeSegments.tmcf) contains the tmcf mapping to the csv of viral genome segments. + +#### Log Files + +- [format_virus_metadata_resource.log](logs/format_virus_metadata_resource.log) log file from script converting the Virus Metadata Resource into formatted CSV file. + +### Import Procedure + +Download the most recent versions of the Master Species List and Virus Metadata Resource from ICTV by running: + +```bash +sh download.sh +``` + +Generate the enummeration schema MCF, which represents virus taxonomic ranks by running: + +```bash +sh run.sh +``` + +### Tests + +The first step of `tests.sh` is to downloads Data Commons's java -jar import tool, storing it in a `tmp` directory. This assumes that the user has Java Runtime Environment (JRE) installed. This tool is described in Data Commons documentation of the [import pipeline](https://github.com/datacommonsorg/import/). The relases of the tool can be viewed [here](https://github.com/datacommonsorg/import/releases/). Here we download version `0.1-alpha.1k` and apply it to check our csv + tmcf import. It evaluates if all schema used in the import is present in the graph, all referenced nodes are present in the graph, along with other checks that issue fatal errors, errors, or warnings upon failing checks. Please note that empty tokens for some columns are expected as this reflects the original data. The imports create the Virus nodes that are then refrenced within this import. This resolves any concern about missing reference warnings concerning these node types by the test. + +To run tests: + +```bash +sh tests.sh +``` + +This will generate an output file for the results of the tests on each csv + tmcf pair + diff --git a/scripts/biomedical/ICTV_Taxonomy/logs/format_virus_metadata_resource.log b/scripts/biomedical/ICTV_Taxonomy/logs/format_virus_metadata_resource.log new file mode 100644 index 0000000000..80d063f948 --- /dev/null +++ b/scripts/biomedical/ICTV_Taxonomy/logs/format_virus_metadata_resource.log @@ -0,0 +1,5 @@ +Non-unique VirusIsolate dcid generated! Added additional info to differentiate: bio/BetachrysovirusMagnaporthis_VietNam_MoCV1-B +Non-unique VirusIsolate dcid generated! Added additional info to differentiate: bio/OrthoflavivirusAroaense_BeAn4073_AF013366 +Non-unique VirusIsolate dcid generated! Added additional info to differentiate: bio/OrthobornavirusCaenophidiae_CHC_224_BK014571 +Non-unique VirusIsolate dcid generated! Added additional info to differentiate: bio/UgandanCassavaBrownStreakVirus_UG_FJ185044 +Non-unique VirusIsolate dcid generated! Added additional info to differentiate: bio/PotatoVirusY_N_X97895 diff --git a/scripts/biomedical/ICTV_Taxonomy/scripts/create_virus_taxonomic_ranking_enums.py b/scripts/biomedical/ICTV_Taxonomy/scripts/create_virus_taxonomic_ranking_enums.py new file mode 100644 index 0000000000..1924621cf6 --- /dev/null +++ b/scripts/biomedical/ICTV_Taxonomy/scripts/create_virus_taxonomic_ranking_enums.py @@ -0,0 +1,141 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Author: Samantha Piekos +Date: 02/21/2024 +Name: create_virus_taxonomic_ranking_enums.py +Description: Creates hierarchical viral taxonomy enum schema from the +ICTV Virus Metadata Resource. +@file_input: ICTV Virus Metadata Resource .xslx file +@file_output: formatted .mcf files for viral taxonomy enum schema +""" + +# load environment +import pandas as pd +import sys + +# declare universal variables +HEADER = [ + 'sort', 'isolateSort', 'realm', 'subrealm', 'kingdom', 'subkingdom', + 'phylum', 'subphylum', 'class', 'subclass', 'order', 'suborder', 'family', + 'subfamily', 'genus', 'subgenus', 'species', 'isExemplar', 'name', + 'abbreviation', 'isolateDesignation', 'genBankAccession', 'refSeqAccession', + 'genomeCoverage', 'genomeComposition', 'hostSource', 'host', 'source', + 'dcid', 'isolate_dcid', 'isolate_name' +] + +LIST_DROP = [ + 'sort', 'isolateSort', 'species', 'isExemplar', 'name', 'abbreviation', + 'isolateDesignation', 'genBankAccession', 'refSeqAccession', + 'genomeCoverage', 'genomeComposition', 'hostSource', 'host', 'source', + 'dcid', 'isolate_dcid', 'isolate_name' +] + + +# declare functions +def pascalcase(s): + list_words = s.split() + converted = "".join( + word[0].upper() + word[1:].lower() for word in list_words) + return converted + + +def check_for_illegal_charc(s): + list_illegal = ["'", "–", "*" + ">", "<", "@", "]", "[", "|", ":", ";" + " "] + if any([x in s for x in list_illegal]): + print('Error! dcid contains illegal characters!', s) + + +def initiate_enum_dict(): + d = {} + list_levels = [i for i in HEADER if i not in LIST_DROP] + for item in list_levels: + enum_name = 'Virus' + item.capitalize() + 'Enum' + d[enum_name] = {} + return d + + +def add_enums_to_dicts(key, value, d): + if value == value: + enum = 'Virus' + key + 'Enum' + dcid = 'Virus' + key + pascalcase(value) + check_for_illegal_charc(dcid) + d[enum][value] = dcid + return d + + +def add_item_to_enums(df): + list_levels = [i for i in HEADER if i not in LIST_DROP] + dict_of_dicts = initiate_enum_dict() + dict_specialization = {} # keep track of previous top level + for index, row in df.iterrows(): + last_level_dcid = False # initiate empty value for tracking specialization + for item in list_levels: + level = item.capitalize() + if row[item] != row[item]: + continue + dict_of_dicts = add_enums_to_dicts(level, row[item], dict_of_dicts) + if last_level_dcid: # track specialization if relevant + dcid = 'Virus' + level + pascalcase(row[item]) + dict_specialization[dcid] = last_level_dcid + last_level_dcid = 'Virus' + level + pascalcase( + row[item]) # update top level + return dict_of_dicts, dict_specialization + + +def write_individual_entries_to_file(w, enum, d, dict_specialization): + for key, value in d.items(): + w.write('Node: dcid:' + value + '\n') + w.write('name: "' + key + '"\n') + w.write('typeOf: dcs:' + enum + '\n') + if value in dict_specialization: + w.write('specializationOf: dcs:' + dict_specialization[value] + + '\n\n') + else: + w.write('\n') + return w + + +def write_dict_to_file(w, enum, d, dict_specialization): + w.write('# ' + enum + '\n') + w.write('Node: dcid:' + enum + '\n') + w.write('name: "' + enum + '"\n') + w.write('typeOf: schema:Class\n') + w.write('subClassOf: schema:Enumeration\n\n') + w = write_individual_entries_to_file(w, enum, d, dict_specialization) + w.write('\n') + return w + + +def generate_enums_mcf(f, w): + df = pd.read_excel(f, names=HEADER, header=None, sheet_name=0) + df = df.drop(LIST_DROP, axis=1).drop(0, axis=0) + dict_of_dicts, dict_specialization = add_item_to_enums(df) + w = open(w, mode='w') + w.write('# Schema generated by create_virus_taxonomic_ranking_enums.py\n\n') + for key, value in dict_of_dicts.items(): + w = write_dict_to_file(w, key, value, dict_specialization) + + +def main(): + file_input = sys.argv[1] + file_output = sys.argv[2] + + generate_enums_mcf(file_input, file_output) + + +if __name__ == '__main__': + main() diff --git a/scripts/biomedical/ICTV_Taxonomy/scripts/download.sh b/scripts/biomedical/ICTV_Taxonomy/scripts/download.sh new file mode 100644 index 0000000000..672164fcd4 --- /dev/null +++ b/scripts/biomedical/ICTV_Taxonomy/scripts/download.sh @@ -0,0 +1,30 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Author: Samantha Piekos +Date: 02/26/2024 +Name: download +Description: This file downloads the most recent version of the ICTV Master +Species List and Virus Metadata Resource and prepares it for processing. +""" + +#!/bin/bash + + +# make input directory +mkdir -p input; cd input + +# download NCBI data +curl -o ICTV_Virus_Species_List.xlsx https://ictv.global/msl/current +curl -o ICTV_Virus_Metadata_Resource.xlsx https://ictv.global/vmr/current diff --git a/scripts/biomedical/ICTV_Taxonomy/scripts/format_virus_master_species_list.py b/scripts/biomedical/ICTV_Taxonomy/scripts/format_virus_master_species_list.py new file mode 100644 index 0000000000..0445d557ec --- /dev/null +++ b/scripts/biomedical/ICTV_Taxonomy/scripts/format_virus_master_species_list.py @@ -0,0 +1,159 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Author: Samantha Piekos +Date: 02/21/2024 +Name: format_virus_master_species_list.py +Description: Formats ICTV Master Species List into a csv format for import +into Data Commons. This includes converting genome composition and last +change made to corresponding enums. Dcids were formatted by converting the +viral species name to pascalcase and adding the prefix 'bio/'. The viral +taxonomy is encoded in enum format. +@file_input: ICTV Master Speices List .xslx file +@file_output: formatted csv format of Virus nodes +""" + +# load environment +import pandas as pd +import sys + +# declare universal variables +DICT_CHANGE_ENUM = { + 'abolished': 'dcs:VirusLastTaxonomicChangeAbolished', + 'demoted': 'dcs:VirusLastTaxonomicChangeDemoted', + 'merged': 'dcs:VirusLastTaxonomicChangeMerged', + 'moved': 'dcs:VirusLastTaxonomicChangeMoved', + 'new': 'dcs:VirusLastTaxonomicChangeNew', + 'promoted': 'dcs:VirusLastTaxonomicChangePromoted', + 'removed as type species': 'dcs:VirusLastTaxonomicChangeRemoved', + 'renamed': 'dcs:VirusLastTaxonomicChangeRenamed', + 'split': 'dcs:VirusLastTaxonomicChangeSplit' +} + +DICT_GC = { + 'dsDNA': + 'dcs:VirusGenomeCompositionDoubleStrandedDNA', + 'ssDNA': + 'dcs:VirusGenomeCompositionSingleStrandedDNA', + 'ssDNA(-)': + 'dcs:VirusGenomeCompositionSingleStrandedDNANegative', + 'ssDNA(+)': + 'dcs:VirusGenomeCompositionSingleStrandedDNAPositive', + 'ssDNA(+/-)': + 'dcs:VirusGenomeCompositionSingleStrandedDNA', + 'dsDNA-RT': + 'dcs:VirusGenomeCompositionDoubleStrandedDNAReverseTranscription', + 'ssRNA-RT': + 'dcs:VirusGenomeCompositionSingleStrandedRNAReverseTranscription', + 'dsRNA': + 'dcs:VirusGenomeCompositionDoubleStrandedRNA', + 'ssRNA': + 'dcs:VirusGenomeCompositionSingleStrandedRNA', + 'ssRNA(-)': + 'dcs:VirusGenomeCompositionSingleStrandedRNANegative', + 'ssRNA(+)': + 'dcs:VirusGenomeCompositionSingleStrandedRNAPositive', + 'ssRNA(+/-)': + 'dcs:VirusGenomeCompositionSingleStrandedRNA' +} + +HEADER = [ + 'sort', 'realm', 'subrealm', 'kingdom', 'subkingdom', 'phylum', 'subphylum', + 'class', 'subclass', 'order', 'suborder', 'family', 'subfamily', 'genus', + 'subgenus', 'species', 'genomeComposition', 'lastChange', + 'lastChangeVersion', 'proposalForLastChange', 'taxonHistoryURL', 'dcid' +] + +LIST_TAXONOMIC_LEVELS = [ + 'realm', 'subrealm', 'kingdom', 'subkingdom', 'phylum', 'subphylum', + 'class', 'subclass', 'order', 'suborder', 'family', 'subfamily', 'genus', + 'subgenus' +] + + +# declare functions +def pascalcase(s): + list_words = s.split() + converted = "".join( + word[0].upper() + word[1:].lower() for word in list_words) + return converted + + +def check_for_illegal_charc(s): + list_illegal = ["'", "–", "*" + ">", "<", "@", "]", "[", "|", ":", ";" + " "] + if any([x in s for x in list_illegal]): + print('Error! dcid contains illegal characters!', s) + + +def format_taxonomic_rank_properties(df, index, row): + for rank in LIST_TAXONOMIC_LEVELS: + if row[rank] == row[rank]: + enum = 'dcs:Virus' + rank.upper()[0] + rank.lower( + )[1:] + pascalcase(row[rank]) + df.loc[index, rank] = enum + return df + + +def convert_gc_to_enum(gc): + list_enum = [] + list_gc = gc.split(';') + for item in list_gc: + item = item.strip() + enum = DICT_GC[item] + list_enum.append(enum) + return (',').join(list_enum) + + +def convert_change_to_enum(change): + list_enum = [] + change = change.lower() + list_changes = change.split(',')[:-1] + for item in list_changes: + enum = DICT_CHANGE_ENUM[item] + list_enum.append(enum) + return (',').join(list_enum) + + +def clean_df(df): + for index, row in df.iterrows(): + dcid = 'bio/' + pascalcase(row['species']) + check_for_illegal_charc(dcid) + df = format_taxonomic_rank_properties(df, index, row) + df.loc[index, 'dcid'] = dcid + df.loc[index, 'genomeComposition'] = convert_gc_to_enum( + row['genomeComposition']) + df.loc[index, 'lastChange'] = convert_change_to_enum(row['lastChange']) + df.loc[index, + 'taxonHistoryURL'] = row['taxonHistoryURL'].strip('ICTVonline=') + return df + + +def clean_file(f, w): + df = pd.read_excel(f, names=HEADER, header=None, sheet_name=1) + df = df.drop('sort', axis=1).drop(0, axis=0) + df = clean_df(df) + df.to_csv(w, index=False) + + +def main(): + file_input = sys.argv[1] + file_output = sys.argv[2] + + clean_file(file_input, file_output) + + +if __name__ == '__main__': + main() diff --git a/scripts/biomedical/ICTV_Taxonomy/scripts/format_virus_metadata_resource.py b/scripts/biomedical/ICTV_Taxonomy/scripts/format_virus_metadata_resource.py new file mode 100644 index 0000000000..b0294423bc --- /dev/null +++ b/scripts/biomedical/ICTV_Taxonomy/scripts/format_virus_metadata_resource.py @@ -0,0 +1,370 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Author: Samantha Piekos +Date: 02/21/2024 +Name: format_virus_master_species_list.py +Description: Formats ICTV Virus Metadata Resource into two csv files - +one specific to VirusIsolates and the other VirusGenomeSegment for import +into Data Commons. This includes converting genome composition, genome +coverage, viral host, and viral source to corresponding enums. Virus, +VirusIsolate and VirusGenomeSegment dcids were formatted by converting +the names into pascal case and adding the prefix 'bio/'. The viral taxonomy +is encoded in enum format and found within Virus nodes. Whether an isolate +is an exemplar isolate or not was encoded into a boolean as a value for the +property 'isExemplar'. +@file_input: ICTV Virus Metadata Resource .xslx file +@file_output: formatted csv format of VirusIsolate and VirusGenomeSegment + nodes +""" + +# set up environment +import pandas as pd +import sys +import unidecode + +# declare universal variables +DICT_COVERAGE = { + 'coding-complete genome': 'dcs:GenomeCoverageCompleteGenome', + 'complete genome': 'dcs:GenomeCoverageCompleteGenome', + 'complete coding genome': 'dcs:GenomeCoverageCompleteCodingGenome', + 'no entry in genbank': 'dcs:GenomeCoverageNoEntryInGenBank', + 'partial genome': 'dcs:GenomeCoveragePartialGenome' +} + +DICT_GC = { + 'dsDNA': + 'dcs:VirusGenomeCompositionDoubleStrandedDNA', + 'ssDNA': + 'dcs:VirusGenomeCompositionSingleStrandedDNA', + 'ssDNA(-)': + 'dcs:VirusGenomeCompositionSingleStrandedDNANegative', + 'ssDNA(+)': + 'dcs:VirusGenomeCompositionSingleStrandedDNAPositive', + 'ssDNA(+/-)': + 'dcs:VirusGenomeCompositionSingleStrandedDNA', + 'dsDNA-RT': + 'dcs:VirusGenomeCompositionDoubleStrandedDNAReverseTranscription', + 'ssRNA-RT': + 'dcs:VirusGenomeCompositionSingleStrandedRNAReverseTranscription', + 'dsRNA': + 'dcs:VirusGenomeCompositionDoubleStrandedRNA', + 'ssRNA': + 'dcs:VirusGenomeCompositionSingleStrandedRNA', + 'ssRNA(-)': + 'dcs:VirusGenomeCompositionSingleStrandedRNANegative', + 'ssRNA(+)': + 'dcs:VirusGenomeCompositionSingleStrandedRNAPositive', + 'ssRNA(+/-)': + 'dcs:VirusGenomeCompositionSingleStrandedRNA' +} + +DICT_HOST = { + 'algae': 'dcs:VirusHostAlgae', + 'archaea': 'dcs:VirusHostArchaea', + 'bacteria': 'dcs:VirusHostBacteria', + 'fungi': 'dcs:VirusHostFungi', + 'invertebrates': 'dcs:VirusHostInvertebrates', + 'plants': 'dcs:VirusHostPlants', + 'protists': 'dcs:VirusHostProtists', + 'vertebrates': 'dcs:VirusHostVertebrates' +} + +DICT_SOURCE = { + 'freshwater': 'dcs:VirusSourceFreshwater', + 'invertebrates': 'dcs:VirusSourceInvertebrates', + 'marine': 'dcs:VirusSourceMarine', + 'phytobiome': 'dcs:VirusSourcePhytobiome', + 'plants': 'dcs:VirusSourcePlants', + 'protists': 'dcs:VirusSourceProtists', + 'sewage': 'dcs:VirusSourceSewage', + 'soil': 'dcs:VirusSourceSoil' +} + +HEADER = [ + 'sort', 'isolateSort', 'realm', 'subrealm', 'kingdom', 'subkingdom', + 'phylum', 'subphylum', 'class', 'subclass', 'order', 'suborder', 'family', + 'subfamily', 'genus', 'subgenus', 'species', 'isExemplar', 'name', + 'abbreviation', 'isolateDesignation', 'genBankAccession', 'refSeqAccession', + 'genomeCoverage', 'genomeComposition', 'hostSource', 'host', 'source', + 'dcid', 'isolate_dcid', 'isolate_name' +] + +HEADER_2 = [ + 'dcid', 'name', 'genBankAccession', 'genomeSegmentOf', 'refSeqAccession' +] + +LIST_TAXONOMIC_LEVELS = [ + 'realm', 'subrealm', 'kingdom', 'subkingdom', 'phylum', 'subphylum', + 'class', 'subclass', 'order', 'suborder', 'family', 'subfamily', 'genus', + 'subgenus' +] + + +# declare functions +# declare functions +def pascalcase(s): + list_words = s.split() + converted = "".join(word[0].upper() + word[1:] for word in list_words) + return converted + + +def check_for_illegal_charc(s): + list_illegal = [ + "'", "#", "–", "*" + ">", "<", "@", "]", "[", "|", ":", ";", " " + ] + if any([x in s for x in list_illegal]): + print('Error! dcid contains illegal characters!', s) + + +def format_list(s): + if s != s: + return s + list_items = [] + s = str(s) + list_s = s.split(';') + for item in list_s: + list_items.append(item.strip()) + return (',').join(list_items) + + +def format_taxonomic_rank_properties(df, index, row): + for rank in LIST_TAXONOMIC_LEVELS: + if row[rank] == row[rank]: + enum = 'dcs:Virus' + rank.upper()[0] + rank.lower( + )[1:] + pascalcase(row[rank]) + df.loc[index, rank] = enum + return df + + +def convert_gc_to_enum(gc): + list_enum = [] + list_gc = gc.split(';') + for item in list_gc: + item = item.strip() + enum = DICT_GC[item] + list_enum.append(enum) + return (',').join(list_enum) + + +def convert_coverage_to_enum(cov): + return DICT_COVERAGE[cov.lower()] + + +def convert_type_to_boolean(t): + if t == 'E': + return True + if t == 'A': + return False + print('Error! Not an expected isolate type! Expected E or A, but got', t, + '.') + + +def convert_source_to_enum(source): + source = source[:-4] + return DICT_SOURCE[source] + + +def convert_host_to_enum(host): + list_enum = [] + list_host = host.split(',') + for item in list_host: + item = item.strip() + enum = DICT_HOST[item] + list_enum.append(enum) + return (',').join(list_enum) + + +def handle_genBank_missing_exception(n, virus_dcid, virus_name): + if n != n: + dcid = virus_dcid + 'Isolate' + name = virus_name + ' Isolate' + return dcid, name + n = str(n) + if ';' in n: + n = n.split(';')[0] + dcid = virus_dcid + pascalcase(n) + dcid = dcid.replace("'", "") + dcid = dcid.replace('–', '-') + name = virus_name + n + return dcid, name + + +def handle_genBank_components_exception(genBank, virus_dcid, virus_name): + dcid = virus_dcid + name = virus_name + list_genBank = genBank.split(';') + for item in list_genBank: + if ':' in item: + n, gb = item.split(':') + dcid = virus_dcid + '_' + gb.strip() + name = virus_name + gb + else: + dcid = virus_dcid + '_' + item.strip() + name = virus_name + item + return dcid, name + + +def format_isolate_designation_for_dcid(des): + des = str(des) + des = des.replace(':', '_') + des = des.replace(';', '_') + des = des.replace('[', '(') + des = des.replace(']', ')') + des = des.replace('-', '_') + des = des.replace('–', '_') + des = des.replace("'", '') + des = des.replace('#', '') + return des + + +def verify_isolate_dcid_uniqueness(dcid, list_isolate_dcids, genBank, + virus_abrv): + if dcid in list_isolate_dcids: + if ';' in genBank: + dcid = dcid + '_' + virus_abrv + else: + dcid = dcid + '_' + genBank + print( + 'Non-unique VirusIsolate dcid generated! Added additional info to differentiate:', + dcid) + list_isolate_dcids.append(dcid) + return dcid, list_isolate_dcids + + +def declare_isolate_dcid(n, genBank, virus_dcid, virus_name, virus_abrv, + isolate_designation, list_isolate_dcids): + if isolate_designation == isolate_designation: + des = format_isolate_designation_for_dcid(isolate_designation) + dcid = virus_dcid + '_' + pascalcase(des) + name = virus_name + ' strain ' + str(isolate_designation) + elif genBank != genBank: + dcid, name = handle_genBank_missing_exception(n, virus_dcid, virus_name) + elif ':' in genBank or ';' in genBank: + dcid, name = handle_genBank_components_exception( + genBank, virus_dcid, virus_name) + else: + dcid = virus_dcid + '_' + genBank + name = virus_name + ' ' + genBank + dcid = dcid.replace(' ', '') + dcid = unidecode.unidecode(dcid) + dcid, list_isolate_dcids = verify_isolate_dcid_uniqueness( + dcid, list_isolate_dcids, genBank, virus_abrv) + return dcid, name, list_isolate_dcids + + +def make_refSeq_dict(refSeq): + d = {} + list_refSeq = refSeq.split(';') + for item in list_refSeq: + if ':' in item: + name, rs = item.split(':') + d[name.strip()] = rs.strip() + return d + + +def handle_genome_segments(df_segment, virus_dcid, virus_name, isolate_dcid, + genBank, refSeq): + dict_refSeq = {} + list_genBank = genBank.split(';') + if refSeq == refSeq: + dict_refSeq = make_refSeq_dict(refSeq) + for item in list_genBank: + d = { + 'dcid': [], + 'name': [], + 'genBankAccession': [], + 'genomeSegmentOf': [], + 'refSeqAccession': [] + } + if ':' not in item: + continue + name, gb = item.split(':') + name = name.strip() + gb = gb.strip() + d['dcid'].append(virus_dcid + gb) + check_for_illegal_charc(virus_dcid + gb) + d['name'].append(virus_name + ' Segment ' + name) + d['genBankAccession'].append(gb) + d['genomeSegmentOf'].append('dcid:' + isolate_dcid) + if name in dict_refSeq: + d['refSeqAccession'].append(dict_refSeq[name]) + else: + d['refSeqAccession'].append('') + df_new_row = pd.DataFrame.from_dict(d, orient='columns') + df_segment = pd.concat([df_segment, df_new_row], ignore_index=True) + return df_segment + + +def clean_df(df, df_segment): + list_isolate_dcids = [] + for index, row in df.iterrows(): + dcid = 'bio/' + pascalcase(row['species']) + check_for_illegal_charc(dcid) + df.loc[index, 'dcid'] = dcid + df = format_taxonomic_rank_properties(df, index, row) + isolate_dcid, isolate_name, list_isolate_dcids = declare_isolate_dcid( + row['name'], row['genBankAccession'], dcid, row['species'], + row['abbreviation'], row['isolateDesignation'], list_isolate_dcids) + check_for_illegal_charc(isolate_dcid) + df.loc[index, 'isolate_dcid'] = isolate_dcid + df.loc[index, 'isolate_name'] = isolate_name + df.loc[index, 'genomeComposition'] = convert_gc_to_enum( + row['genomeComposition']) + df.loc[index, 'genomeCoverage'] = convert_coverage_to_enum( + row['genomeCoverage']) + df.loc[index, 'isExemplar'] = convert_type_to_boolean(row['isExemplar']) + df.loc[index, 'name'] = format_list(row['name']) + df.loc[index, 'abbreviation'] = format_list(row['abbreviation']) + df.loc[index, + 'isolateDesignation'] = format_list(row['isolateDesignation']) + genBank = row['genBankAccession'] + if genBank == genBank and ':' in genBank: + df_segment = handle_genome_segments(df_segment, dcid, row['name'], + isolate_dcid, genBank, + row['refSeqAccession']) + df.loc[index, 'genBankAccession'] = '' + df.loc[index, 'refSeqAccession'] = '' + elif genBank == genBank and ';' in genBank: + df.loc[index, 'genBankAccession'] = format_list(genBank) + df.loc[index, + 'refSeqAccession'] = format_list(row['refSeqAccession']) + if '(S)' in row['hostSource']: + df.loc[index, 'source'] = convert_source_to_enum(row['hostSource']) + else: + df.loc[index, 'host'] = convert_host_to_enum(row['hostSource']) + return df, df_segment + + +def clean_file(f, w, w_2): + df = pd.read_excel(f, names=HEADER, header=None, sheet_name=0) + df = df.drop(0, axis=0) + df_segment = pd.DataFrame([], columns=HEADER_2) + df, df_segment = clean_df(df, df_segment) + df = df.drop(['sort', 'isolateSort', 'hostSource'], axis=1) + df.to_csv(w, index=False) + df_segment.to_csv(w_2, index=False) + + +def main(): + file_input = sys.argv[1] + file_output_1 = sys.argv[2] + file_output_2 = sys.argv[3] + + clean_file(file_input, file_output_1, file_output_2) + + +if __name__ == '__main__': + main() diff --git a/scripts/biomedical/ICTV_Taxonomy/scripts/run.sh b/scripts/biomedical/ICTV_Taxonomy/scripts/run.sh new file mode 100644 index 0000000000..796a9cb975 --- /dev/null +++ b/scripts/biomedical/ICTV_Taxonomy/scripts/run.sh @@ -0,0 +1,34 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Author: Samantha Piekos +Date: 02/26/2024 +Name: download +Description: This file runs the python scripts to generate the viral taxonomy +enum mcf file and the csv files for Viruses, Virus Isolates, and Virus Genome +Segments from the ICTV Master Species List and the Virus Metadata Files. +""" + +# !/bin/bash + +# make CSV directory to which to output cleaned csv +mkdir -p CSVs + +# Command to Generate Taxonomic Rank Enum Schema +python3 scripts/create_virus_taxonomic_ranking_enums.py input/ICTV_Virus_Metadata_Resource.xlsx ICTV_schema_taxonomic_ranking_enum.mcf + +# Commands to Run Scripts to Generate Cleaned CSV Files +python3 scripts/format_virus_master_species_list.py input/ICTV_Virus_Species_List.xlsx CSVs/VirusSpecies.csv + +python3 scripts/format_virus_metadata_resource.py input/ICTV_Virus_Metadata_Resource.xlsx CSVs/VirusIsolates.csv CSVs/VirusGenomeSegments.csv > format_virus_metadata_resource.log diff --git a/scripts/biomedical/ICTV_Taxonomy/scripts/tests.sh b/scripts/biomedical/ICTV_Taxonomy/scripts/tests.sh new file mode 100644 index 0000000000..5117c9eb81 --- /dev/null +++ b/scripts/biomedical/ICTV_Taxonomy/scripts/tests.sh @@ -0,0 +1,37 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Author: Samantha Piekos +Date: 03/05/2024 +Name: tests +Description: This file runs the Data Commons Java tool to run standard +tests on tmcf + CSV pairs for the ICTV data import. +""" + +#!/bin/bash + +# download data commons java test tool version 0.1-alpha.1k +mkdir -p tmp; cd tmp +wget https://github.com/datacommonsorg/import/releases/download/0.1-alpha.1k/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar +cd .. + +# run tests +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/virusMasterSpeciesList.tmcf CSVs/VirusSpecies.csv ICTV*.mcf +mv dc_generated species + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/virusTaxonomy.tmcf CSVs/VirusIsolates.csv ICTV*.mcf +mv dc_generated virus_isolates + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/virusGenomeSegment.tmcf CSVs/VirusGenomeSegments.csv ICTV*.mcf +mv dc_generated genome_segments diff --git a/scripts/biomedical/ICTV_Taxonomy/tMCF/VirusGenomeSegments.tmcf b/scripts/biomedical/ICTV_Taxonomy/tMCF/VirusGenomeSegments.tmcf new file mode 100644 index 0000000000..d9288e2ce7 --- /dev/null +++ b/scripts/biomedical/ICTV_Taxonomy/tMCF/VirusGenomeSegments.tmcf @@ -0,0 +1,7 @@ +Node: E:VirusGenomeSegments->E1 +typeOf: dcs:VirusGenomeSegment +dcid: C:VirusGenomeSegments->dcid +name: C:VirusGenomeSegments->name +genBankAccession: C:VirusGenomeSegments->genBankAccession +genomeSegmentOf: C:VirusGenomeSegments->genomeSegmentOf +refSeqAccession: C:VirusGenomeSegments->refSeqAccession diff --git a/scripts/biomedical/ICTV_Taxonomy/tMCF/VirusIsolates.tmcf b/scripts/biomedical/ICTV_Taxonomy/tMCF/VirusIsolates.tmcf new file mode 100644 index 0000000000..26d13211cd --- /dev/null +++ b/scripts/biomedical/ICTV_Taxonomy/tMCF/VirusIsolates.tmcf @@ -0,0 +1,35 @@ +Node: E:VirusTaxonomy->E1 +typeOf: dcs:Virus +dcid: C:VirusTaxonomy->dcid +name: C:VirusTaxonomy->species +abbreviation: C:VirusTaxonomy->abbreviation +alternateName: C:VirusTaxonomy->name +virusClass: C:VirusTaxonomy->class +virusFamily: C:VirusTaxonomy->family +virusGenomeComposition: C:VirusTaxonomy->genomeComposition +virusGenus: C:VirusTaxonomy->genus +virusHost: C:VirusTaxonomy->host +virusKingdom: C:VirusTaxonomy->kingdom +virusOrder: C:VirusTaxonomy->order +virusPhylum: C:VirusTaxonomy->phylum +virusRealm: C:VirusTaxonomy->realm +virusSource: C:VirusTaxonomy->source +virusSpecies: C:VirusTaxonomy->species +virusSubclass: C:VirusTaxonomy->subclass +virusSubfamily: C:VirusTaxonomy->subfamily +virusSubgenus: C:VirusTaxonomy->subgenus +virusSubkingdom: C:VirusTaxonomy->subkingdom +virusSuborder: C:VirusTaxonomy->suborder +virusSubphylum: C:VirusTaxonomy->subphylum +virusSubrealm: C:VirusTaxonomy->subrealm + +Node: E:ViralTaxonomy->E2 +typeOf: dcs:VirusIsolate +dcid: C:VirusTaxonomy->isolate_dcid +name: C:VirusTaxonomy->isolate_name +genBankAccession: C:VirusTaxonomy->genBankAccession +genomeCoverage: C:VirusTaxonomy->genomeCoverage +isExemplarVirusIsolate: C:VirusTaxonomy->isExemplar +ofVirusSpecies: E:VirusTaxonomy->E1 +refSeqAccession: C:VirusTaxonomy->refSeqAccession +virusIsolateDesignation: C:VirusTaxonomy->isolateDesignation diff --git a/scripts/biomedical/ICTV_Taxonomy/tMCF/VirusSpecies.tmcf b/scripts/biomedical/ICTV_Taxonomy/tMCF/VirusSpecies.tmcf new file mode 100644 index 0000000000..623dc759ed --- /dev/null +++ b/scripts/biomedical/ICTV_Taxonomy/tMCF/VirusSpecies.tmcf @@ -0,0 +1,24 @@ +Node: E:VirusSpecies->E1 +typeOf: dcs:Virus +dcid: C:VirusSpecies->dcid +name: C:VirusSpecies->species +proposalForLastChange: C:VirusSpecies->proposalForLastChange +taxonHistoryURL: C:VirusSpecies->taxonHistoryURL +versionOfLastChange: C:VirusSpecies->lastChangeVersion +virusClass: C:VirusSpecies->class +virusFamily: C:VirusSpecies->family +virusGenomeComposition: C:VirusSpecies->genomeComposition +virusGenus: C:VirusSpecies->genus +virusKingdom: C:VirusSpecies->kingdom +virusLastTaxonomicChange: C:VirusSpecies->lastChange +virusOrder: C:VirusSpecies->order +virusPhylum: C:VirusSpecies->phylum +virusRealm: C:VirusSpecies->realm +virusSpecies: C:VirusSpecies->species +virusSubclass: C:VirusSpecies->subclass +virusSubfamily: C:VirusSpecies->subfamily +virusSubgenus: C:VirusSpecies->subgenus +virusSubkingdom: C:VirusSpecies->subkingdom +virusSuborder: C:VirusSpecies->suborder +virusSubphylum: C:VirusSpecies->subphylum +virusSubrealm: C:VirusSpecies->subrealm