Skip to content

Commit

Permalink
another round of provenance
Browse files Browse the repository at this point in the history
  • Loading branch information
lskatz committed Nov 4, 2024
1 parent f7728af commit 24be76d
Show file tree
Hide file tree
Showing 10 changed files with 236 additions and 200 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,6 @@ edirect
share
paper/paper.html
paper/paper.doc
# pixi environments
.pixi
*.egg-info
6 changes: 4 additions & 2 deletions paper/mra.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,9 @@ Kalamari also contains a custom taxonomy and software for downloading and format

## Announcement

Public Health laboratories sequence microbial pathogens daily for genomic epidemiology, i.e., to track pathogen spread [@armstrong2019pathogen].
Public Health laboratories sequence microbial pathogens daily for many applications including genomic epidemiology [@armstrong2019pathogen],
species identification [@lindsey2023rapid],
and metagenomic analysis [@huang2017metagenomics].
Relevant databases exist such as RefSeq [@o2016reference] or The Genome Taxonomy Database (GTDB) [@parks2022gtdb].
However, due to their so comprehensive nature,
they are disadvantageous for our specific purposes:
Expand All @@ -64,7 +66,7 @@ All chromosomes and plasmids are complete, i.e., no contig breaks,
and obtained from trusted sources, e.g., FDA-ARGOS [@sichtig2019fda] or the NCTC 3000 collection [@dicks2023nctc3000], or provided and reviewed by a CDC subject matter expert.

We obtained the list of plasmids from the Mob-Suite project [@robertsonMobsuite]
and clustered them at 97% average nucleotide identity (ANI) [@lindsey2023rapid].
and clustered them at 97% average nucleotide identity using edlb_ani_mummer v1 with default options [@lindsey2023rapid].
For each cluster, the taxonomy identifier was raised to the lowest common tier of taxonomy.
For example, if a cluster of plasmids were identified in both _Escherichia coli_ and _Salmonella enterica_, then taxonomy identifiers for all the plasmids in the cluster were changed to their common family, _Enterobacteriaceae_.
As a result, any taxonomic signature from these plasmids
Expand Down
10 changes: 10 additions & 0 deletions src/provenance/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,16 @@ _NOTE_ This command probably would have saved me time if I turned it into a batc
datasets summary genome accession GCF_000006945.2 --report sequence --as-json-lines | dataformat tsv genome-seq --fields genbank-seq-acc
```

Check the assemblies spreadsheet on whether or not it is retrospecitively part of NCBI references

```bash
zcat assembly_summary_genbank.txt.gz | perl -F'\t' -lane 'print if($F[11] eq "Complete Genome" || $.==1);' > tmp.tsv
mv tmp.tsv assembly-complete.tsv
bash quick-ncbi-ref-check.sh Caulobacter vibrioides CP001340
# If found, add to ncbi_ref.acc.more
# If not found, see if there is different reference to add
```

## Translate

Translate certain entries into CDC, NCTC3000, FDA, or the NCBI list.
Expand Down
28 changes: 28 additions & 0 deletions src/provenance/SME.acc
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ Betacoronavirus_coronavirus MT233526 2697049 694009 UNKNOWN
Borreliella_burgdorferi AE000783 139 64895 UNKNOWN
Burkholderia_pseudomallei BX571965 28450 111527 UNKNOWN
Burkholderia_pseudomallei BX571966 28450 111527 UNKNOWN

# A ton of Campy and related genomes from Jim Bono
Campylobacter_coli CP028187 195 194 UNKNOWN
Campylobacter_concisus CP012541 199 194 NCBI-REF
Campylobacter_cuniculorum CP020867 374106 194 NCBI-REF
Expand All @@ -22,6 +24,9 @@ Campylobacter_sputorum CP019683 202 194 UNKNOWN
Campylobacter_subantarcticus CP007772 497724 194 UNKNOWN
Campylobacter_volucris CP007774 1031542 194 UNKNOWN
Campylobacter_ureolyticus CP012195 827 194 UNKNOWN
Arcobacter_ellisii CP032097 913109 28196 UNKNOWN
Arcobacter_cibarius CP043857 255507 28196 UNKNOWN
Helicobacter_pylori AE000511 210 209 UNKNOWN

# We trust Segre's lab for Citrobacter
Citrobacter_freundii CP007557 546 544 UNKNOWN
Expand Down Expand Up @@ -86,8 +91,31 @@ Yersinia_massiliensis CP054048 33060 629 UNKNOWN
Yersinia_pseudotuberculosis CP009712 502800 633 UNKNOWN


Salmonella_enterica_IIIa CP000880 9000014 28901 UNKNOWN
Salmonella_enterica_IIIb CP053583 9000015 28901 UNKNOWN
Salmonella_enterica_IV CP053579 59205 28901 UNKNOWN
Salmonella_enterica_IX CP054715 9000016 28901 UNKNOWN
Salmonella_enterica_VII CP053582 59208 28901 UNKNOWN
Salmonella_enterica_X CP053581 9000017 28901 UNKNOWN


Neisseria_gonorrhoeae AE004969 485 482 UNKNOWN
Neisseria_meningitidis AE002098 487 482 UNKNOWN

# Gulvik
Leptospira_biflexa CP000777 355278 145259 UNKNOWN
Leptospira_interrogans CP020414 173 171 UNKNOWN

# Eukie hosts
Bos_taurus KC153975 9913 9903 UNKNOWN
Arripis_trutta AP006810 270544 163128 UNKNOWN
Brassica_oleracea JF920286 3712 3705 UNKNOWN
Gallus_gallus HQ857211 9031 9030 UNKNOWN
Humulus_lupulus NC_086845 3486 3484 UNKNOWN
Lactuca_sativa MK820672 75943 4235 UNKNOWN
Neomysis_japonica KR006340 1676841 223649 UNKNOWN
Pollachius_virens FR751399 8060 8059 UNKNOWN
Solanum_lycopersicum MF034192 4081 49274 UNKNOWN
Sus_scrofa FJ236999 9823 9822 UNKNOWN
Thunnus_alalunga AB101291 8235 8234 UNKNOWN
Vicia_faba KC189947 3906 3904 UNKNOWN
15 changes: 6 additions & 9 deletions src/provenance/chromosomes.insdc.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,13 @@ scientificName nuccoreAcc taxid parent
Acholeplasma_laidlawii CP000896 2148 2147
Acinetobacter_baumannii CP045110 509170 470
Acinetobacter_pittii CP002177 48296 909768
Aeromonas_hydrophila CP000462 644 642
Agrobacterium_fabrum AE007869 1176649 1183400
Agrobacterium_fabrum AE007870 1176649 1183400
Alkaliphilus_transvaalensis CP000724 208226 114627
Aliivibrio_fischeri CP000021 668 511678
Aliivibrio_fischeri CP000020 668 511678
Amycolatopsis_mediterranei CP002000 33910 1813
Amycolatopsis_mediterranei CP002896 33910 1813
Aquifex_aeolicus AE000657 63363 2713
Architeuthis_dux NC_011581 256136 34555
Arcobacter_bivalviorum CP031217 663364 2321115
Arcobacter_ellisii CP032097 913109 28196
Arcobacter_cibarius CP043857 255507 28196
Expand All @@ -26,14 +24,13 @@ Atlantibacter_hermannii CP042941 565 1903434
Bacillus_cereus AP007209 1396 86661
Bacillus_subtilis AL009126 1423 653685
Bacillus_thuringiensis AE017355 1428 86661
Bacteroides_fragilis AP006841 817 816
Bacteroides_thetaiotaomicron AE015928 818 816
Bacteroides_fragilis CR626927 817 816
Bacteroides_thetaiotaomicron CP040530 818 816
Bartonella_bacilliformis CP000524 774 773
Betacoronavirus_coronavirus MT233526 2697049 694009
Bifidobacterium_adolenscentis CP028341 1680 1678
Bifidobacterium_bifidum CP001840 1681 1678
Bifidobacterium_longum AE014295 216816 1678
Bordetella_bronchiseptica HE965806 518 517
Bifidobacterium_bifidum CP058603 1681 1678
Bifidobacterium_longum AP010888 216816 1678
Bordetella_bronchiseptica LR134326 518 517
Borreliella_burgdorferi AE000783 139 64895
Bos_taurus KC153975 9913 9903
Brachybacterium_faecium CP001643 43669 43668
Expand Down
Binary file modified src/provenance/ncbi_ref.acc.gz
Binary file not shown.
68 changes: 68 additions & 0 deletions src/provenance/ncbi_ref.acc.more
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# I'll just cat these later to ncbi_ref.acc.gz later
BA000040.2
Buchnera_aphidicola BA000003 9 32199 UNKNOWN
Candidatus_Desulforudis_audaxviator CP000860 471827 471826 UNKNOWN
Chlamydia_pneumoniae AE001363 83558 810 UNKNOWN
Caulobacter_vibrioides CP001340 155892 75 UNKNOWN
Corynebacterium_diphtheriae CP091095 1717 1716 UNKNOWN
Corynebacterium_urealyticum AM942444 43771 1716 UNKNOWN
Cronobacter_condimenti CP012264 1163710 413496 UNKNOWN
Cronobacter_malonaticus CP006731 413503 413496 UNKNOWN
Cronobacter_sakazakii CP011047 28141 413496 UNKNOWN
Cronobacter_turicensis FN543093 413502 413496 UNKNOWN
Coxiella_burnetii AE016828 777 776 UNKNOWN
Cronobacter_malonaticus CP006731 413503 413496 UNKNOWN
Cronobacter_sakazakii CP011047 28141 413496 UNKNOWN
Cronobacter_turicensis FN543093 413502 413496 UNKNOWN
Deinococcus_radiodurans AE000513 1299 1298 UNKNOWN
Deinococcus_radiodurans AE001825 1299 1298 UNKNOWN
Exiguobacterium_antarcticum CP003063 132920 33986 UNKNOWN
Flavobacterium_psychrophilum AM398681 96345 237 UNKNOWN
Francisella_tularensis AJ749949 263 262 UNKNOWN
Gardnerella_vaginalis CP002104 2702 2701 UNKNOWN
Haemophilus_influenzae L42023 727 724 UNKNOWN
Halobacterium_salinarum AM774415 478009 2242 UNKNOWN
Helianthus_annuus MG770607 4232 4231 UNKNOWN
Ketogulonicigenium_vulgare CP002018 92945 92944 UNKNOWN
Klebsiella_aerogenes CP002824 548 570 UNKNOWN
Lactobacillus_acidophilus CP000033 1579 1578 UNKNOWN
Lactococcus_lactis AE005176 1358 1357 UNKNOWN
Legionella_pneumophila AE017354 446 445 UNKNOWN
Leuconostoc_citreum DQ489736 349519 33964 UNKNOWN
Lysinibacillus_sphaericus CP000817 444177 1421 UNKNOWN
Mesorhizobium_ciceri CP002447 39645 68287 UNKNOWN
Methylobacterium CP000943 426117 2615210 UNKNOWN
Methylobacterium_radiotolerans CP001001 31998 407 UNKNOWN
Micrococcus_luteus CP001628 1270 1269 UNKNOWN
Morganella_morganii_morganii CP004345 180434 582 UNKNOWN
Mycobacterium_leprae AL450380 1769 1763 UNKNOWN
Mycoplasma_mycoides BX293980 2102 656088 UNKNOWN
Pantoea_ananatis CP001875 553 53335 UNKNOWN
Parabacteroides_distasonis CP000140 823 375288 UNKNOWN
Prochlorococcus_marinus AE017126 1219 1218 UNKNOWN
Proteus_mirabilis CP004022 584 583 UNKNOWN
Pseudomonas_aeruginosa AE004091 287 136841 UNKNOWN
Pseudomonas_putida AP013070 390235 303 UNKNOWN
Pyrobaculum_neutrophilum CP001014 70771 2276 UNKNOWN
Rhodospirillum_rubrum CP000230 1085 1081 UNKNOWN
Rickettsia_prowazekii AJ235269 782 114292 UNKNOWN
Salinibacter_ruber CP000159 146919 146918 UNKNOWN
Salmonella_bongori FR877557 54736 590 UNKNOWN
Shewanella_halifaxensis CP000931 271098 22 UNKNOWN
Shewanella_oneidensis AE014299 70863 22 UNKNOWN
Sinorhizobium_fredii CP001389 380 663276 UNKNOWN
Sinorhizobium_medicae CP000738 110321 28105 UNKNOWN
Sinorhizobium_meliloti AL591688 382 28105 UNKNOWN
Staphylococcus_aureus CP009554 46170 1280 UNKNOWN
Staphylococcus_epidermidis AE015929 1282 1279 UNKNOWN
Streptococcus_agalactiae AE009948 1311 1301 UNKNOWN
Streptococcus_mitis FN568063 28037 1301 UNKNOWN
Streptococcus_mutans AE014133 1309 1301 UNKNOWN
Streptococcus_pneumoniae CP000936 487214 1313 UNKNOWN
Streptococcus_pyogenes AE004092 1314 1301 UNKNOWN
Streptococcus_sanguinis CP000387 1305 1301 UNKNOWN
Thermotoga_maritima AE000512 2336 2335 UNKNOWN
Thermus_thermophilus AP008226 274 270 UNKNOWN
Xanthomonas_campestris AE008922 339 338 UNKNOWN
Xylella_fastidiosa CP000941 405440 2371 UNKNOWN

25 changes: 25 additions & 0 deletions src/provenance/quick-ncbi-ref-check.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/bin/bash
set -e
set -o pipefail
#set -x

#which datasets dataformat

genus=$1
species=$2
expected=$3
spreadsheet=assembly-complete.tsv

acc=($(grep $genus $spreadsheet | grep $species | cut -f 1 | tr '\n' ' '))
chunk_size=100

for ((i=0; i<${#acc}; i+=chunk_size)); do
chunk=("${acc[@]:i:chunk_size}")
accs=$(echo "$chunk" | tr '\n' ',' | sed 's/,$//')
# Join chunk into a comma-separated list
accs=$(IFS=,; echo "${chunk[*]}")
echo "ACCESSIONS: $accs"
datasets summary genome accession $accs --report sequence --as-json-lines | \
dataformat tsv genome-seq --fields accession,genbank-seq-acc | \
grep -m 1 $expected || true
done
Loading

0 comments on commit 24be76d

Please sign in to comment.