diff --git a/.gitignore b/.gitignore index f4453e3..2b2d2b5 100644 --- a/.gitignore +++ b/.gitignore @@ -2,3 +2,6 @@ edirect share paper/paper.html paper/paper.doc +# pixi environments +.pixi +*.egg-info diff --git a/paper/mra.md b/paper/mra.md index 19ca283..d3ca356 100644 --- a/paper/mra.md +++ b/paper/mra.md @@ -47,7 +47,9 @@ Kalamari also contains a custom taxonomy and software for downloading and format ## Announcement -Public Health laboratories sequence microbial pathogens daily for genomic epidemiology, i.e., to track pathogen spread [@armstrong2019pathogen]. +Public Health laboratories sequence microbial pathogens daily for many applications including genomic epidemiology [@armstrong2019pathogen], +species identification [@lindsey2023rapid], +and metagenomic analysis [@huang2017metagenomics]. Relevant databases exist such as RefSeq [@o2016reference] or The Genome Taxonomy Database (GTDB) [@parks2022gtdb]. However, due to their so comprehensive nature, they are disadvantageous for our specific purposes: @@ -64,7 +66,7 @@ All chromosomes and plasmids are complete, i.e., no contig breaks, and obtained from trusted sources, e.g., FDA-ARGOS [@sichtig2019fda] or the NCTC 3000 collection [@dicks2023nctc3000], or provided and reviewed by a CDC subject matter expert. We obtained the list of plasmids from the Mob-Suite project [@robertsonMobsuite] -and clustered them at 97% average nucleotide identity (ANI) [@lindsey2023rapid]. +and clustered them at 97% average nucleotide identity using edlb_ani_mummer v1 with default options [@lindsey2023rapid]. For each cluster, the taxonomy identifier was raised to the lowest common tier of taxonomy. For example, if a cluster of plasmids were identified in both _Escherichia coli_ and _Salmonella enterica_, then taxonomy identifiers for all the plasmids in the cluster were changed to their common family, _Enterobacteriaceae_. As a result, any taxonomic signature from these plasmids diff --git a/src/provenance/README.md b/src/provenance/README.md index f5e1acf..9abc9af 100644 --- a/src/provenance/README.md +++ b/src/provenance/README.md @@ -57,6 +57,16 @@ _NOTE_ This command probably would have saved me time if I turned it into a batc datasets summary genome accession GCF_000006945.2 --report sequence --as-json-lines | dataformat tsv genome-seq --fields genbank-seq-acc ``` +Check the assemblies spreadsheet on whether or not it is retrospecitively part of NCBI references + +```bash +zcat assembly_summary_genbank.txt.gz | perl -F'\t' -lane 'print if($F[11] eq "Complete Genome" || $.==1);' > tmp.tsv +mv tmp.tsv assembly-complete.tsv +bash quick-ncbi-ref-check.sh Caulobacter vibrioides CP001340 +# If found, add to ncbi_ref.acc.more +# If not found, see if there is different reference to add +``` + ## Translate Translate certain entries into CDC, NCTC3000, FDA, or the NCBI list. diff --git a/src/provenance/SME.acc b/src/provenance/SME.acc index 573562a..f350fce 100644 --- a/src/provenance/SME.acc +++ b/src/provenance/SME.acc @@ -5,6 +5,8 @@ Betacoronavirus_coronavirus MT233526 2697049 694009 UNKNOWN Borreliella_burgdorferi AE000783 139 64895 UNKNOWN Burkholderia_pseudomallei BX571965 28450 111527 UNKNOWN Burkholderia_pseudomallei BX571966 28450 111527 UNKNOWN + +# A ton of Campy and related genomes from Jim Bono Campylobacter_coli CP028187 195 194 UNKNOWN Campylobacter_concisus CP012541 199 194 NCBI-REF Campylobacter_cuniculorum CP020867 374106 194 NCBI-REF @@ -22,6 +24,9 @@ Campylobacter_sputorum CP019683 202 194 UNKNOWN Campylobacter_subantarcticus CP007772 497724 194 UNKNOWN Campylobacter_volucris CP007774 1031542 194 UNKNOWN Campylobacter_ureolyticus CP012195 827 194 UNKNOWN +Arcobacter_ellisii CP032097 913109 28196 UNKNOWN +Arcobacter_cibarius CP043857 255507 28196 UNKNOWN +Helicobacter_pylori AE000511 210 209 UNKNOWN # We trust Segre's lab for Citrobacter Citrobacter_freundii CP007557 546 544 UNKNOWN @@ -86,8 +91,31 @@ Yersinia_massiliensis CP054048 33060 629 UNKNOWN Yersinia_pseudotuberculosis CP009712 502800 633 UNKNOWN +Salmonella_enterica_IIIa CP000880 9000014 28901 UNKNOWN +Salmonella_enterica_IIIb CP053583 9000015 28901 UNKNOWN +Salmonella_enterica_IV CP053579 59205 28901 UNKNOWN +Salmonella_enterica_IX CP054715 9000016 28901 UNKNOWN +Salmonella_enterica_VII CP053582 59208 28901 UNKNOWN +Salmonella_enterica_X CP053581 9000017 28901 UNKNOWN + + Neisseria_gonorrhoeae AE004969 485 482 UNKNOWN Neisseria_meningitidis AE002098 487 482 UNKNOWN # Gulvik Leptospira_biflexa CP000777 355278 145259 UNKNOWN +Leptospira_interrogans CP020414 173 171 UNKNOWN + +# Eukie hosts +Bos_taurus KC153975 9913 9903 UNKNOWN +Arripis_trutta AP006810 270544 163128 UNKNOWN +Brassica_oleracea JF920286 3712 3705 UNKNOWN +Gallus_gallus HQ857211 9031 9030 UNKNOWN +Humulus_lupulus NC_086845 3486 3484 UNKNOWN +Lactuca_sativa MK820672 75943 4235 UNKNOWN +Neomysis_japonica KR006340 1676841 223649 UNKNOWN +Pollachius_virens FR751399 8060 8059 UNKNOWN +Solanum_lycopersicum MF034192 4081 49274 UNKNOWN +Sus_scrofa FJ236999 9823 9822 UNKNOWN +Thunnus_alalunga AB101291 8235 8234 UNKNOWN +Vicia_faba KC189947 3906 3904 UNKNOWN diff --git a/src/provenance/chromosomes.insdc.tsv b/src/provenance/chromosomes.insdc.tsv index 227bbf5..52b3c51 100644 --- a/src/provenance/chromosomes.insdc.tsv +++ b/src/provenance/chromosomes.insdc.tsv @@ -2,15 +2,13 @@ scientificName nuccoreAcc taxid parent Acholeplasma_laidlawii CP000896 2148 2147 Acinetobacter_baumannii CP045110 509170 470 Acinetobacter_pittii CP002177 48296 909768 -Aeromonas_hydrophila CP000462 644 642 Agrobacterium_fabrum AE007869 1176649 1183400 Agrobacterium_fabrum AE007870 1176649 1183400 Alkaliphilus_transvaalensis CP000724 208226 114627 Aliivibrio_fischeri CP000021 668 511678 Aliivibrio_fischeri CP000020 668 511678 -Amycolatopsis_mediterranei CP002000 33910 1813 +Amycolatopsis_mediterranei CP002896 33910 1813 Aquifex_aeolicus AE000657 63363 2713 -Architeuthis_dux NC_011581 256136 34555 Arcobacter_bivalviorum CP031217 663364 2321115 Arcobacter_ellisii CP032097 913109 28196 Arcobacter_cibarius CP043857 255507 28196 @@ -26,14 +24,13 @@ Atlantibacter_hermannii CP042941 565 1903434 Bacillus_cereus AP007209 1396 86661 Bacillus_subtilis AL009126 1423 653685 Bacillus_thuringiensis AE017355 1428 86661 -Bacteroides_fragilis AP006841 817 816 -Bacteroides_thetaiotaomicron AE015928 818 816 +Bacteroides_fragilis CR626927 817 816 +Bacteroides_thetaiotaomicron CP040530 818 816 Bartonella_bacilliformis CP000524 774 773 Betacoronavirus_coronavirus MT233526 2697049 694009 -Bifidobacterium_adolenscentis CP028341 1680 1678 -Bifidobacterium_bifidum CP001840 1681 1678 -Bifidobacterium_longum AE014295 216816 1678 -Bordetella_bronchiseptica HE965806 518 517 +Bifidobacterium_bifidum CP058603 1681 1678 +Bifidobacterium_longum AP010888 216816 1678 +Bordetella_bronchiseptica LR134326 518 517 Borreliella_burgdorferi AE000783 139 64895 Bos_taurus KC153975 9913 9903 Brachybacterium_faecium CP001643 43669 43668 diff --git a/src/provenance/ncbi_ref.acc.gz b/src/provenance/ncbi_ref.acc.gz index 66a282a..038fe86 100644 Binary files a/src/provenance/ncbi_ref.acc.gz and b/src/provenance/ncbi_ref.acc.gz differ diff --git a/src/provenance/ncbi_ref.acc.more b/src/provenance/ncbi_ref.acc.more new file mode 100644 index 0000000..fb34909 --- /dev/null +++ b/src/provenance/ncbi_ref.acc.more @@ -0,0 +1,68 @@ +# I'll just cat these later to ncbi_ref.acc.gz later +BA000040.2 +Buchnera_aphidicola BA000003 9 32199 UNKNOWN +Candidatus_Desulforudis_audaxviator CP000860 471827 471826 UNKNOWN +Chlamydia_pneumoniae AE001363 83558 810 UNKNOWN +Caulobacter_vibrioides CP001340 155892 75 UNKNOWN +Corynebacterium_diphtheriae CP091095 1717 1716 UNKNOWN +Corynebacterium_urealyticum AM942444 43771 1716 UNKNOWN +Cronobacter_condimenti CP012264 1163710 413496 UNKNOWN +Cronobacter_malonaticus CP006731 413503 413496 UNKNOWN +Cronobacter_sakazakii CP011047 28141 413496 UNKNOWN +Cronobacter_turicensis FN543093 413502 413496 UNKNOWN +Coxiella_burnetii AE016828 777 776 UNKNOWN +Cronobacter_malonaticus CP006731 413503 413496 UNKNOWN +Cronobacter_sakazakii CP011047 28141 413496 UNKNOWN +Cronobacter_turicensis FN543093 413502 413496 UNKNOWN +Deinococcus_radiodurans AE000513 1299 1298 UNKNOWN +Deinococcus_radiodurans AE001825 1299 1298 UNKNOWN +Exiguobacterium_antarcticum CP003063 132920 33986 UNKNOWN +Flavobacterium_psychrophilum AM398681 96345 237 UNKNOWN +Francisella_tularensis AJ749949 263 262 UNKNOWN +Gardnerella_vaginalis CP002104 2702 2701 UNKNOWN +Haemophilus_influenzae L42023 727 724 UNKNOWN +Halobacterium_salinarum AM774415 478009 2242 UNKNOWN +Helianthus_annuus MG770607 4232 4231 UNKNOWN +Ketogulonicigenium_vulgare CP002018 92945 92944 UNKNOWN +Klebsiella_aerogenes CP002824 548 570 UNKNOWN +Lactobacillus_acidophilus CP000033 1579 1578 UNKNOWN +Lactococcus_lactis AE005176 1358 1357 UNKNOWN +Legionella_pneumophila AE017354 446 445 UNKNOWN +Leuconostoc_citreum DQ489736 349519 33964 UNKNOWN +Lysinibacillus_sphaericus CP000817 444177 1421 UNKNOWN +Mesorhizobium_ciceri CP002447 39645 68287 UNKNOWN +Methylobacterium CP000943 426117 2615210 UNKNOWN +Methylobacterium_radiotolerans CP001001 31998 407 UNKNOWN +Micrococcus_luteus CP001628 1270 1269 UNKNOWN +Morganella_morganii_morganii CP004345 180434 582 UNKNOWN +Mycobacterium_leprae AL450380 1769 1763 UNKNOWN +Mycoplasma_mycoides BX293980 2102 656088 UNKNOWN +Pantoea_ananatis CP001875 553 53335 UNKNOWN +Parabacteroides_distasonis CP000140 823 375288 UNKNOWN +Prochlorococcus_marinus AE017126 1219 1218 UNKNOWN +Proteus_mirabilis CP004022 584 583 UNKNOWN +Pseudomonas_aeruginosa AE004091 287 136841 UNKNOWN +Pseudomonas_putida AP013070 390235 303 UNKNOWN +Pyrobaculum_neutrophilum CP001014 70771 2276 UNKNOWN +Rhodospirillum_rubrum CP000230 1085 1081 UNKNOWN +Rickettsia_prowazekii AJ235269 782 114292 UNKNOWN +Salinibacter_ruber CP000159 146919 146918 UNKNOWN +Salmonella_bongori FR877557 54736 590 UNKNOWN +Shewanella_halifaxensis CP000931 271098 22 UNKNOWN +Shewanella_oneidensis AE014299 70863 22 UNKNOWN +Sinorhizobium_fredii CP001389 380 663276 UNKNOWN +Sinorhizobium_medicae CP000738 110321 28105 UNKNOWN +Sinorhizobium_meliloti AL591688 382 28105 UNKNOWN +Staphylococcus_aureus CP009554 46170 1280 UNKNOWN +Staphylococcus_epidermidis AE015929 1282 1279 UNKNOWN +Streptococcus_agalactiae AE009948 1311 1301 UNKNOWN +Streptococcus_mitis FN568063 28037 1301 UNKNOWN +Streptococcus_mutans AE014133 1309 1301 UNKNOWN +Streptococcus_pneumoniae CP000936 487214 1313 UNKNOWN +Streptococcus_pyogenes AE004092 1314 1301 UNKNOWN +Streptococcus_sanguinis CP000387 1305 1301 UNKNOWN +Thermotoga_maritima AE000512 2336 2335 UNKNOWN +Thermus_thermophilus AP008226 274 270 UNKNOWN +Xanthomonas_campestris AE008922 339 338 UNKNOWN +Xylella_fastidiosa CP000941 405440 2371 UNKNOWN + diff --git a/src/provenance/quick-ncbi-ref-check.sh b/src/provenance/quick-ncbi-ref-check.sh new file mode 100644 index 0000000..56f61f2 --- /dev/null +++ b/src/provenance/quick-ncbi-ref-check.sh @@ -0,0 +1,25 @@ +#!/bin/bash +set -e +set -o pipefail +#set -x + +#which datasets dataformat + +genus=$1 +species=$2 +expected=$3 +spreadsheet=assembly-complete.tsv + +acc=($(grep $genus $spreadsheet | grep $species | cut -f 1 | tr '\n' ' ')) +chunk_size=100 + +for ((i=0; i<${#acc}; i+=chunk_size)); do + chunk=("${acc[@]:i:chunk_size}") + accs=$(echo "$chunk" | tr '\n' ',' | sed 's/,$//') + # Join chunk into a comma-separated list + accs=$(IFS=,; echo "${chunk[*]}") + echo "ACCESSIONS: $accs" + datasets summary genome accession $accs --report sequence --as-json-lines | \ + dataformat tsv genome-seq --fields accession,genbank-seq-acc | \ + grep -m 1 $expected || true +done \ No newline at end of file diff --git a/src/provenance/sources.tsv b/src/provenance/sources.tsv index 223e124..54f9386 100644 --- a/src/provenance/sources.tsv +++ b/src/provenance/sources.tsv @@ -1,19 +1,17 @@ scientificName nuccoreAcc taxid parent source Acholeplasma_laidlawii CP000896 2148 2147 NCBI-REF Acinetobacter_baumannii CP045110 509170 470 NCBI-REF -Acinetobacter_pittii CP002177 48296 909768 UNKNOWN -Aeromonas_hydrophila CP000462 644 642 UNKNOWN +Acinetobacter_pittii CP002177 48296 909768 NCBI-REF Agrobacterium_fabrum AE007869 1176649 1183400 NCBI-REF Agrobacterium_fabrum AE007870 1176649 1183400 NCBI-REF Alkaliphilus_transvaalensis CP000724 208226 114627 NCBI-REF Aliivibrio_fischeri CP000021 668 511678 NCBI-REF Aliivibrio_fischeri CP000020 668 511678 NCBI-REF -Amycolatopsis_mediterranei CP002000 33910 1813 UNKNOWN +Amycolatopsis_mediterranei CP002896 33910 1813 NCBI-REF Aquifex_aeolicus AE000657 63363 2713 NCBI-REF -Architeuthis_dux NC_011581 256136 34555 UNKNOWN Arcobacter_bivalviorum CP031217 663364 2321115 NCBI-REF -Arcobacter_ellisii CP032097 913109 28196 UNKNOWN -Arcobacter_cibarius CP043857 255507 28196 UNKNOWN +Arcobacter_ellisii CP032097 913109 28196 SME +Arcobacter_cibarius CP043857 255507 28196 SME Arcobacter_halophilus CP031218 197482 28196 NCBI-REF Arcobacter_molluscorum CP032098 1032072 28196 NCBI-REF Arcobacter_mytili CP031219 603050 28196 NCBI-REF @@ -21,25 +19,24 @@ Arcobacter_skirrowii CP032099 28200 28196 NCBI-REF Arcobacter_suis CP032100 1278212 28196 NCBI-REF Arcobacter_thereius CP035926 544718 28196 NCBI-REF Arcobacter_trophiarum CP031367 708186 28196 NCBI-REF -Arripis_trutta AP006810 270544 163128 UNKNOWN +Arripis_trutta AP006810 270544 163128 SME Atlantibacter_hermannii CP042941 565 1903434 NCBI-GEN:CDC Bacillus_cereus AP007209 1396 86661 SME Bacillus_subtilis AL009126 1423 653685 SME Bacillus_thuringiensis AE017355 1428 86661 SME -Bacteroides_fragilis AP006841 817 816 UNKNOWN -Bacteroides_thetaiotaomicron AE015928 818 816 UNKNOWN +Bacteroides_fragilis CR626927 817 816 NCBI-REF +Bacteroides_thetaiotaomicron CP040530 818 816 NCBI-REF Bartonella_bacilliformis CP000524 774 773 NCBI-REF Betacoronavirus_coronavirus MT233526 2697049 694009 SME -Bifidobacterium_adolenscentis CP028341 1680 1678 UNKNOWN -Bifidobacterium_bifidum CP001840 1681 1678 UNKNOWN -Bifidobacterium_longum AE014295 216816 1678 UNKNOWN -Bordetella_bronchiseptica HE965806 518 517 UNKNOWN +Bifidobacterium_bifidum CP058603 1681 1678 NCBI-REF +Bifidobacterium_longum AP010888 216816 1678 NCBI-REF +Bordetella_bronchiseptica LR134326 518 517 NCTC3000 Borreliella_burgdorferi AE000783 139 64895 SME -Bos_taurus KC153975 9913 9903 UNKNOWN +Bos_taurus KC153975 9913 9903 SME Brachybacterium_faecium CP001643 43669 43668 UNKNOWN -Bradyrhizobium_diazoefficiens BA000040 1355477 374 UNKNOWN -Brassica_oleracea JF920286 3712 3705 UNKNOWN -Buchnera_aphidicola BA000003 9 32199 UNKNOWN +Bradyrhizobium_diazoefficiens BA000040 1355477 374 NCBI-REF +Brassica_oleracea JF920286 3712 3705 SME +Buchnera_aphidicola BA000003 9 32199 NCBI-REF Burkholderia_pseudomallei BX571965 28450 111527 SME Burkholderia_pseudomallei BX571966 28450 111527 SME Campylobacter_avium CP022347 522484 522485 NCBI-REF @@ -60,10 +57,10 @@ Campylobacter_sputorum CP019683 202 194 SME Campylobacter_subantarcticus CP007772 497724 194 SME Campylobacter_volucris CP007774 1031542 194 SME Campylobacter_ureolyticus CP012195 827 194 SME -Candidatus_Desulforudis_audaxviator CP000860 471827 471826 UNKNOWN +Candidatus_Desulforudis_audaxviator CP000860 471827 471826 NCBI-REF Candidatus_Korarchaeum_cryptofilum CP000968 374847 498846 NCBI-REF -Chlamydia_pneumoniae AE001363 83558 810 UNKNOWN -Caulobacter_vibrioides CP001340 155892 75 UNKNOWN +Chlamydia_pneumoniae AE001363 83558 810 NCBI-REF +Caulobacter_vibrioides CP001340 155892 75 NCBI-REF Chlamydomonas_reinhardtii AF008237 3055 3052 UNKNOWN Chlamydia_trachomatis AE001273 813 810 NCBI-REF Chlorobaculum_tepidum AE006470 1097 256319 NCBI-REF @@ -78,17 +75,17 @@ Clostridium_baratii CP014204 1561 1485 SME Clostridium_botulinum_groupI CP001078 9000005 1491 SME Clostridium_botulinum_groupII CP000939 9000004 1491 SME Clostridium_butyricum CP013239 1492 1485 SME -Corynebacterium_diphtheriae CP091095 1717 1716 UNKNOWN +Corynebacterium_diphtheriae CP091095 1717 1716 NCBI-REF Clostridium_perfringens CP000312 1502 1485 SME -Corynebacterium_urealyticum AM942444 43771 1716 UNKNOWN +Corynebacterium_urealyticum AM942444 43771 1716 NCBI-REF Corynebacterium_glutamicum BA000036 1718 1716 NCBI-REF -Cronobacter_condimenti CP012264 1163710 413496 UNKNOWN -Coxiella_burnetii AE016828 777 776 UNKNOWN +Cronobacter_condimenti CP012264 1163710 413496 NCBI-REF +Coxiella_burnetii AE016828 777 776 NCBI-REF Cronobacter_dublinensis_dublinensis CP012266 413498 413497 NCBI-REF -Cronobacter_malonaticus CP006731 413503 413496 UNKNOWN +Cronobacter_malonaticus CP006731 413503 413496 NCBI-REF Cronobacter_muytjensii CP012268 413501 413496 NCBI-REF -Cronobacter_sakazakii CP011047 28141 413496 UNKNOWN -Cronobacter_turicensis FN543093 413502 413496 UNKNOWN +Cronobacter_sakazakii CP011047 28141 413496 NCBI-REF +Cronobacter_turicensis FN543093 413502 413496 NCBI-REF Cronobacter_universalis CP012257 535744 413496 NCBI-REF Cryptosporidium_parvum CM000430 353152 5806 SME Cryptosporidium_parvum CM000429 353152 5806 SME @@ -110,8 +107,8 @@ Cupriavidus_taiwanensis CU633750 164546 106589 NCBI-REF Cupriavidus_taiwanensis CU633749 164546 106589 NCBI-REF Cyanothece CP000806 43989 2546360 NCBI-REF Cyanothece CP000807 43989 2546360 NCBI-REF -Deinococcus_radiodurans AE000513 1299 1298 UNKNOWN -Deinococcus_radiodurans AE001825 1299 1298 UNKNOWN +Deinococcus_radiodurans AE000513 1299 1298 NCBI-REF +Deinococcus_radiodurans AE001825 1299 1298 NCBI-REF Delftia_acidovorans CP000884 80866 80865 NCBI-REF Desulfovibrio_vulgaris AE017285 881 872 UNKNOWN Dictyoglomus_turgidum CP001251 513050 13 NCBI-REF @@ -121,40 +118,40 @@ Enterococcus_faecium CP003583 1352 1350 SME Escherichia_albertii CP024282 208962 561 SME Escherichia_coli CP027582 562 561 SME Escherichia_fergusonii CP042945 564 561 SME -Exiguobacterium_antarcticum CP003063 132920 33986 UNKNOWN +Exiguobacterium_antarcticum CP003063 132920 33986 NCBI-REF Finegoldia_magna AP008971 1260 150022 SME Francisella_philomiragia CP063138 28110 262 NCBI-REF -Flavobacterium_psychrophilum AM398681 96345 237 UNKNOWN -Francisella_tularensis AJ749949 263 262 UNKNOWN +Flavobacterium_psychrophilum AM398681 96345 237 NCBI-REF +Francisella_tularensis AJ749949 263 262 NCBI-REF Fusobacterium_nucleatum CP028101 851 848 SME -Gallus_gallus HQ857211 9031 9030 UNKNOWN -Gardnerella_vaginalis CP002104 2702 2701 UNKNOWN +Gallus_gallus HQ857211 9031 9030 SME +Gardnerella_vaginalis CP002104 2702 2701 NCBI-REF Geobacter_sulfurreducens AE017180 35554 28231 NCBI-REF Gloeobacter_violaceus BA000045 33072 33071 NCBI-REF Grimontia_hollisae CP035690 673 246861 NCBI-GEN:CDC Grimontia_hollisae CP035691 673 246861 NCBI-GEN:CDC -Haemophilus_influenzae L42023 727 724 UNKNOWN +Haemophilus_influenzae L42023 727 724 NCBI-REF Haemophilus_somnus CP000947 228400 731 NCBI-REF -Halobacterium_salinarum AM774415 478009 2242 UNKNOWN -Helianthus_annuus MG770607 4232 4231 UNKNOWN -Helicobacter_pylori AE000511 210 209 UNKNOWN +Halobacterium_salinarum AM774415 478009 2242 NCBI-REF +Helianthus_annuus MG770607 4232 4231 NCBI-REF +Helicobacter_pylori AE000511 210 209 SME Heliobacterium_modesticaldum CP000930 35701 2697 NCBI-REF -Humulus_lupulus NC_086845 3486 3484 UNKNOWN +Humulus_lupulus NC_086845 3486 3484 SME Homo_sapiens J01415 9606 9605 NCBI-REF -Ketogulonicigenium_vulgare CP002018 92945 92944 UNKNOWN -Klebsiella_aerogenes CP002824 548 570 UNKNOWN +Ketogulonicigenium_vulgare CP002018 92945 92944 NCBI-REF +Klebsiella_aerogenes CP002824 548 570 NCBI-REF Klebsiella_pneumoniae CP003200 573 570 NCBI-REF -Lactobacillus_acidophilus CP000033 1579 1578 UNKNOWN +Lactobacillus_acidophilus CP000033 1579 1578 NCBI-REF Lactobacillus_plantarum AL935263 1590 1578 UNKNOWN Lactobacillus_paracasei CP000423 1597 655183 UNKNOWN -Lactococcus_lactis AE005176 1358 1357 UNKNOWN +Lactococcus_lactis AE005176 1358 1357 NCBI-REF Lactobacillus_salivarius CP000233 1624 1578 UNKNOWN -Lactuca_sativa MK820672 75943 4235 UNKNOWN -Legionella_pneumophila AE017354 446 445 UNKNOWN +Lactuca_sativa MK820672 75943 4235 SME +Legionella_pneumophila AE017354 446 445 NCBI-REF Leptospira_biflexa CP000777 355278 145259 SME Leptothrix_cholodnii CP001013 34029 88 NCBI-REF -Leptospira_interrogans CP020414 173 171 FDA-ARGOS -Leuconostoc_citreum DQ489736 349519 33964 UNKNOWN +Leptospira_interrogans CP020414 173 171 SME +Leuconostoc_citreum DQ489736 349519 33964 NCBI-REF Listeria_grayi LR134483 1641 1637 NCTC3000 Listeria_innocua CP045743 1642 1637 SME Listeria_ivanovii FR687253 1638 1637 SME @@ -165,83 +162,83 @@ Listeria_monocytogenes_III CP054039 9000002 1639 SME Listeria_monocytogenes_IV CP054041 9000003 1639 SME Listeria_seeligeri FN557490 1640 1637 SME Listeria_welshimeri LT906444 1643 1637 NCTC3000 -Lysinibacillus_sphaericus CP000817 444177 1421 UNKNOWN +Lysinibacillus_sphaericus CP000817 444177 1421 NCBI-REF Mesoplasma_florum AE017263 2151 46239 NCBI-REF -Mesorhizobium_ciceri CP002447 39645 68287 UNKNOWN -Methylobacterium CP000943 426117 2615210 UNKNOWN -Methylobacterium_radiotolerans CP001001 31998 407 UNKNOWN -Micrococcus_luteus CP001628 1270 1269 UNKNOWN +Mesorhizobium_ciceri CP002447 39645 68287 NCBI-REF +Methylobacterium CP000943 426117 2615210 NCBI-REF +Methylobacterium_radiotolerans CP001001 31998 407 NCBI-REF +Micrococcus_luteus CP001628 1270 1269 NCBI-REF Moorella_thermoacetica CP012370 1525 44260 NCBI-REF -Morganella_morganii_morganii CP004345 180434 582 UNKNOWN +Morganella_morganii_morganii CP004345 180434 582 NCBI-REF Mycobacterium_abscessus CU458896 36809 670516 NCBI-REF -Mycobacterium_leprae AL450380 1769 1763 UNKNOWN +Mycobacterium_leprae AL450380 1769 1763 NCBI-REF Mycobacterium_smegmatis CP000480 1772 1866885 UNKNOWN Mycobacterium_tuberculosis AL123456 1773 77643 NCBI-REF -Mycoplasma_mycoides BX293980 2102 656088 UNKNOWN +Mycoplasma_mycoides BX293980 2102 656088 NCBI-REF Mycoplasma_pneumoniae U00089 2104 2093 UNKNOWN Neisseria_gonorrhoeae AE004969 485 482 SME Neisseria_meningitidis AE002098 487 482 SME -Neomysis_japonica KR006340 1676841 223649 UNKNOWN +Neomysis_japonica KR006340 1676841 223649 SME Ochrobactrum_anthropi CP000758 529 528 UNKNOWN Ochrobactrum_anthropi CP000759 529 528 UNKNOWN -Pantoea_ananatis CP001875 553 53335 UNKNOWN -Parabacteroides_distasonis CP000140 823 375288 UNKNOWN +Pantoea_ananatis CP001875 553 53335 NCBI-REF +Parabacteroides_distasonis CP000140 823 375288 NCBI-REF Photobacterium_damselae CP046752 38293 657 NCBI-GEN:CDC Photobacterium_damselae CP046751 38293 657 NCBI-GEN:CDC -Pollachius_virens FR751399 8060 8059 UNKNOWN +Pollachius_virens FR751399 8060 8059 SME Polynucleobacter_necessarius LT615228 576610 44013 NCBI-REF -Prochlorococcus_marinus AE017126 1219 1218 UNKNOWN -Proteus_mirabilis CP004022 584 583 UNKNOWN -Pseudomonas_aeruginosa AE004091 287 136841 UNKNOWN -Pseudomonas_putida AP013070 390235 303 UNKNOWN -Pyrobaculum_neutrophilum CP001014 70771 2276 UNKNOWN +Prochlorococcus_marinus AE017126 1219 1218 NCBI-REF +Proteus_mirabilis CP004022 584 583 NCBI-REF +Pseudomonas_aeruginosa AE004091 287 136841 NCBI-REF +Pseudomonas_putida AP013070 390235 303 NCBI-REF +Pyrobaculum_neutrophilum CP001014 70771 2276 NCBI-REF Pseudomonas_syringae_group_genomosp._3 AE016853 251701 136849 NCBI-REF Rhodobacter_sphaeroides CP000144 1063 1060 UNKNOWN Rhodobacter_sphaeroides CP000143 1063 1060 UNKNOWN Rhodopirellula_baltica BX119912 265606 265488 NCBI-REF -Rhodospirillum_rubrum CP000230 1085 1081 UNKNOWN +Rhodospirillum_rubrum CP000230 1085 1081 NCBI-REF Ruminococcus_sp. CP039381 1263 541000 NCBI-REF -Rickettsia_prowazekii AJ235269 782 114292 UNKNOWN -Salinibacter_ruber CP000159 146919 146918 UNKNOWN -Salmonella_bongori FR877557 54736 590 UNKNOWN +Rickettsia_prowazekii AJ235269 782 114292 NCBI-REF +Salinibacter_ruber CP000159 146919 146918 NCBI-REF +Salmonella_bongori FR877557 54736 590 NCBI-REF Salmonella_enterica_IIa CP053411 9000010 28901 NCBI-GEN:Centers for Disease Control and Prevention Salmonella_enterica_IIb LR134141 9000011 28901 NCTC3000 -Salmonella_enterica_IIIa CP000880 9000014 28901 UNKNOWN -Salmonella_enterica_IIIb CP053583 9000015 28901 UNKNOWN +Salmonella_enterica_IIIa CP000880 9000014 28901 SME +Salmonella_enterica_IIIb CP053583 9000015 28901 SME Salmonella_enterica_I AE006468 59201 28901 NCBI-REF -Salmonella_enterica_IV CP053579 59205 28901 UNKNOWN -Salmonella_enterica_IX CP054715 9000016 28901 UNKNOWN -Salmonella_enterica_VII CP053582 59208 28901 UNKNOWN +Salmonella_enterica_IV CP053579 59205 28901 SME +Salmonella_enterica_IX CP054715 9000016 28901 SME +Salmonella_enterica_VII CP053582 59208 28901 SME Salmonella_enterica_VIII CP053318 9000009 28901 NCBI-GEN:EDLB-CDC Salmonella_enterica_VI CP053406 59207 28901 NCBI-GEN:Centers for Disease Control and Prevention -Salmonella_enterica_X CP053581 9000017 28901 UNKNOWN +Salmonella_enterica_X CP053581 9000017 28901 SME Serratia_liquefaciens CP006252 614 613 NCBI-REF -Shewanella_halifaxensis CP000931 271098 22 UNKNOWN -Shewanella_oneidensis AE014299 70863 22 UNKNOWN +Shewanella_halifaxensis CP000931 271098 22 NCBI-REF +Shewanella_oneidensis AE014299 70863 22 NCBI-REF Shewanella_woodyi CP000961 60961 22 NCBI-REF Shimwellia_blattae CP001560 563 1335483 UNKNOWN -Sinorhizobium_fredii CP001389 380 663276 UNKNOWN -Sinorhizobium_medicae CP000738 110321 28105 UNKNOWN -Sinorhizobium_meliloti AL591688 382 28105 UNKNOWN -Solanum_lycopersicum MF034192 4081 49274 UNKNOWN -Staphylococcus_aureus CP009554 46170 1280 UNKNOWN -Staphylococcus_epidermidis AE015929 1282 1279 UNKNOWN -Streptococcus_agalactiae AE009948 1311 1301 UNKNOWN -Streptococcus_mitis FN568063 28037 1301 UNKNOWN -Streptococcus_mutans AE014133 1309 1301 UNKNOWN -Streptococcus_pneumoniae CP000936 487214 1313 UNKNOWN -Streptococcus_pyogenes AE004092 1314 1301 UNKNOWN -Streptococcus_sanguinis CP000387 1305 1301 UNKNOWN -Sus_scrofa FJ236999 9823 9822 UNKNOWN +Sinorhizobium_fredii CP001389 380 663276 NCBI-REF +Sinorhizobium_medicae CP000738 110321 28105 NCBI-REF +Sinorhizobium_meliloti AL591688 382 28105 NCBI-REF +Solanum_lycopersicum MF034192 4081 49274 SME +Staphylococcus_aureus CP009554 46170 1280 NCBI-REF +Staphylococcus_epidermidis AE015929 1282 1279 NCBI-REF +Streptococcus_agalactiae AE009948 1311 1301 NCBI-REF +Streptococcus_mitis FN568063 28037 1301 NCBI-REF +Streptococcus_mutans AE014133 1309 1301 NCBI-REF +Streptococcus_pneumoniae CP000936 487214 1313 NCBI-REF +Streptococcus_pyogenes AE004092 1314 1301 NCBI-REF +Streptococcus_sanguinis CP000387 1305 1301 NCBI-REF +Sus_scrofa FJ236999 9823 9822 SME Streptococcus_suis FM252032 1307 1301 NCBI-REF Synechococcus CP000951 32049 2626047 UNKNOWN Thermanaerovibrio_acidaminovorans CP001818 81462 81461 NCBI-REF Thermoanaerobacter_pseudethanolicus CP000924 496866 1754 NCBI-REF Thermodesulfovibrio_yellowstonii CP001147 28262 28261 NCBI-REF Thermosynechococcus_elongatus BA000039 146786 146785 NCBI-REF -Thermotoga_maritima AE000512 2336 2335 UNKNOWN -Thermus_thermophilus AP008226 274 270 UNKNOWN -Thunnus_alalunga AB101291 8235 8234 UNKNOWN +Thermotoga_maritima AE000512 2336 2335 NCBI-REF +Thermus_thermophilus AP008226 274 270 NCBI-REF +Thunnus_alalunga AB101291 8235 8234 SME Treponema_denticola AE017226 158 157 NCBI-REF Ureaplasma_parvum CP000942 134821 2129 NCBI-REF Vibrio_alginolyticus CP035699 663 717610 NCBI-GEN:CDC @@ -270,9 +267,9 @@ Vibrio_parahaemolyticus CP046828 670 717610 NCBI-GEN:CDC Vibrio_parahaemolyticus CP046827 670 717610 NCBI-GEN:CDC Vibrio_vulnificus CP046832 672 662 NCBI-GEN:CDC Vibrio_vulnificus CP046833 672 662 NCBI-GEN:CDC -Vicia_faba KC189947 3906 3904 UNKNOWN -Xanthomonas_campestris AE008922 339 338 UNKNOWN -Xylella_fastidiosa CP000941 405440 2371 UNKNOWN +Vicia_faba KC189947 3906 3904 SME +Xanthomonas_campestris AE008922 339 338 NCBI-REF +Xylella_fastidiosa CP000941 405440 2371 NCBI-REF Yersinia_aldovae CP009781 29483 629 NCBI-REF Yersinia_bercovieri CP054044 634 629 SME Yersinia_enterocolitica CP002246 630 629 SME diff --git a/src/provenance/unknown.tsv b/src/provenance/unknown.tsv index f4dd02a..fe872a9 100644 --- a/src/provenance/unknown.tsv +++ b/src/provenance/unknown.tsv @@ -1,108 +1,14 @@ -Acinetobacter_pittii CP002177 48296 909768 UNKNOWN -Aeromonas_hydrophila CP000462 644 642 UNKNOWN -Amycolatopsis_mediterranei CP002000 33910 1813 UNKNOWN -Architeuthis_dux NC_011581 256136 34555 UNKNOWN -Arcobacter_ellisii CP032097 913109 28196 UNKNOWN -Arcobacter_cibarius CP043857 255507 28196 UNKNOWN -Arripis_trutta AP006810 270544 163128 UNKNOWN -Bacteroides_fragilis AP006841 817 816 UNKNOWN -Bacteroides_thetaiotaomicron AE015928 818 816 UNKNOWN -Bifidobacterium_adolenscentis CP028341 1680 1678 UNKNOWN -Bifidobacterium_bifidum CP001840 1681 1678 UNKNOWN -Bifidobacterium_longum AE014295 216816 1678 UNKNOWN -Bordetella_bronchiseptica HE965806 518 517 UNKNOWN -Bos_taurus KC153975 9913 9903 UNKNOWN Brachybacterium_faecium CP001643 43669 43668 UNKNOWN -Bradyrhizobium_diazoefficiens BA000040 1355477 374 UNKNOWN -Brassica_oleracea JF920286 3712 3705 UNKNOWN -Buchnera_aphidicola BA000003 9 32199 UNKNOWN -Candidatus_Desulforudis_audaxviator CP000860 471827 471826 UNKNOWN -Chlamydia_pneumoniae AE001363 83558 810 UNKNOWN -Caulobacter_vibrioides CP001340 155892 75 UNKNOWN Chlamydomonas_reinhardtii AF008237 3055 3052 UNKNOWN -Corynebacterium_diphtheriae CP091095 1717 1716 UNKNOWN -Corynebacterium_urealyticum AM942444 43771 1716 UNKNOWN -Cronobacter_condimenti CP012264 1163710 413496 UNKNOWN -Coxiella_burnetii AE016828 777 776 UNKNOWN -Cronobacter_malonaticus CP006731 413503 413496 UNKNOWN -Cronobacter_sakazakii CP011047 28141 413496 UNKNOWN -Cronobacter_turicensis FN543093 413502 413496 UNKNOWN -Deinococcus_radiodurans AE000513 1299 1298 UNKNOWN -Deinococcus_radiodurans AE001825 1299 1298 UNKNOWN Desulfovibrio_vulgaris AE017285 881 872 UNKNOWN -Exiguobacterium_antarcticum CP003063 132920 33986 UNKNOWN -Flavobacterium_psychrophilum AM398681 96345 237 UNKNOWN -Francisella_tularensis AJ749949 263 262 UNKNOWN -Gallus_gallus HQ857211 9031 9030 UNKNOWN -Gardnerella_vaginalis CP002104 2702 2701 UNKNOWN -Haemophilus_influenzae L42023 727 724 UNKNOWN -Halobacterium_salinarum AM774415 478009 2242 UNKNOWN -Helianthus_annuus MG770607 4232 4231 UNKNOWN -Helicobacter_pylori AE000511 210 209 UNKNOWN -Humulus_lupulus NC_086845 3486 3484 UNKNOWN -Ketogulonicigenium_vulgare CP002018 92945 92944 UNKNOWN -Klebsiella_aerogenes CP002824 548 570 UNKNOWN -Lactobacillus_acidophilus CP000033 1579 1578 UNKNOWN Lactobacillus_plantarum AL935263 1590 1578 UNKNOWN Lactobacillus_paracasei CP000423 1597 655183 UNKNOWN -Lactococcus_lactis AE005176 1358 1357 UNKNOWN Lactobacillus_salivarius CP000233 1624 1578 UNKNOWN -Lactuca_sativa MK820672 75943 4235 UNKNOWN -Legionella_pneumophila AE017354 446 445 UNKNOWN -Leuconostoc_citreum DQ489736 349519 33964 UNKNOWN -Lysinibacillus_sphaericus CP000817 444177 1421 UNKNOWN -Mesorhizobium_ciceri CP002447 39645 68287 UNKNOWN -Methylobacterium CP000943 426117 2615210 UNKNOWN -Methylobacterium_radiotolerans CP001001 31998 407 UNKNOWN -Micrococcus_luteus CP001628 1270 1269 UNKNOWN -Morganella_morganii_morganii CP004345 180434 582 UNKNOWN -Mycobacterium_leprae AL450380 1769 1763 UNKNOWN Mycobacterium_smegmatis CP000480 1772 1866885 UNKNOWN -Mycoplasma_mycoides BX293980 2102 656088 UNKNOWN Mycoplasma_pneumoniae U00089 2104 2093 UNKNOWN -Neomysis_japonica KR006340 1676841 223649 UNKNOWN Ochrobactrum_anthropi CP000758 529 528 UNKNOWN Ochrobactrum_anthropi CP000759 529 528 UNKNOWN -Pantoea_ananatis CP001875 553 53335 UNKNOWN -Parabacteroides_distasonis CP000140 823 375288 UNKNOWN -Pollachius_virens FR751399 8060 8059 UNKNOWN -Prochlorococcus_marinus AE017126 1219 1218 UNKNOWN -Proteus_mirabilis CP004022 584 583 UNKNOWN -Pseudomonas_aeruginosa AE004091 287 136841 UNKNOWN -Pseudomonas_putida AP013070 390235 303 UNKNOWN -Pyrobaculum_neutrophilum CP001014 70771 2276 UNKNOWN Rhodobacter_sphaeroides CP000144 1063 1060 UNKNOWN Rhodobacter_sphaeroides CP000143 1063 1060 UNKNOWN -Rhodospirillum_rubrum CP000230 1085 1081 UNKNOWN -Rickettsia_prowazekii AJ235269 782 114292 UNKNOWN -Salinibacter_ruber CP000159 146919 146918 UNKNOWN -Salmonella_bongori FR877557 54736 590 UNKNOWN -Salmonella_enterica_IIIa CP000880 9000014 28901 UNKNOWN -Salmonella_enterica_IIIb CP053583 9000015 28901 UNKNOWN -Salmonella_enterica_IV CP053579 59205 28901 UNKNOWN -Salmonella_enterica_IX CP054715 9000016 28901 UNKNOWN -Salmonella_enterica_VII CP053582 59208 28901 UNKNOWN -Salmonella_enterica_X CP053581 9000017 28901 UNKNOWN -Shewanella_halifaxensis CP000931 271098 22 UNKNOWN -Shewanella_oneidensis AE014299 70863 22 UNKNOWN Shimwellia_blattae CP001560 563 1335483 UNKNOWN -Sinorhizobium_fredii CP001389 380 663276 UNKNOWN -Sinorhizobium_medicae CP000738 110321 28105 UNKNOWN -Sinorhizobium_meliloti AL591688 382 28105 UNKNOWN -Solanum_lycopersicum MF034192 4081 49274 UNKNOWN -Staphylococcus_aureus CP009554 46170 1280 UNKNOWN -Staphylococcus_epidermidis AE015929 1282 1279 UNKNOWN -Streptococcus_agalactiae AE009948 1311 1301 UNKNOWN -Streptococcus_mitis FN568063 28037 1301 UNKNOWN -Streptococcus_mutans AE014133 1309 1301 UNKNOWN -Streptococcus_pneumoniae CP000936 487214 1313 UNKNOWN -Streptococcus_pyogenes AE004092 1314 1301 UNKNOWN -Streptococcus_sanguinis CP000387 1305 1301 UNKNOWN -Sus_scrofa FJ236999 9823 9822 UNKNOWN Synechococcus CP000951 32049 2626047 UNKNOWN -Thermotoga_maritima AE000512 2336 2335 UNKNOWN -Thermus_thermophilus AP008226 274 270 UNKNOWN -Thunnus_alalunga AB101291 8235 8234 UNKNOWN -Vicia_faba KC189947 3906 3904 UNKNOWN -Xanthomonas_campestris AE008922 339 338 UNKNOWN -Xylella_fastidiosa CP000941 405440 2371 UNKNOWN