Skip to content

Commit

Permalink
provenance
Browse files Browse the repository at this point in the history
  • Loading branch information
lskatz committed Oct 22, 2024
1 parent b841e88 commit fb1155d
Show file tree
Hide file tree
Showing 19 changed files with 120,553 additions and 0 deletions.
69 changes: 69 additions & 0 deletions src/provenance/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Provanence

Need to get the provanence of each entry per reviewer2

## Convert to INSDC notation

```bash
module load Entrez
cat chromosomes.tsv | tail -n +2 | sed 's/ /_/g' | xargs -L 1 -P 2 bash -c 'insdc=$(esearch -db nuccore -query $1 | elink -target nuccore -name nuccore_nuccore_rsgb | efetch -format acc); if [ ! "$insdc" ]; then insdc="$1.1"; fi; insdc=${insdc%.*}; echo -e "$0\t$insdc\t$2\t$3"' > chromosomes.insdc.tsv
```

## Get the sources

Get sources of chromosomes

```bash
module load Entrez
grep -v taxid chromosomes.insdc.tsv | tail -n +2 | cut -f 2 | xargs -I {} sh -c 'esearch -db nuccore -query "{}" | efetch -format xml | xtract -pattern Bioseq-set -element Textseq-id_accession -block Auth-list_affil -element Affil_std_affil' > ncbi_general.tsv
```

Get the NCTC collection

```bash
esearch -db bioproject -query PRJEB6403 | elink -target assembly | elink -target nuccore | efetch -format acc > nctc3000.acc
```

Get the FDA-ARGOS collection

```bash
esearch -db bioproject -query PRJNA231221 | elink -target assembly | elink -target nuccore | efetch -format acc > fda-argos.acc
```

Get the NCBI reference genomes list: I went to <https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=1&reference_only=true>
and then downloaded the whole list (38603 genomes at the time).
I downloaded as a spreadsheet `ncbi_reference_genomes.txt`
and then converted with dos2unix and changed the extension to tsv.

Then, convert to an accessions list

```bash
cut -f 1 ncbi_reference_genomes.tsv | xargs -P 1 -n 1 bash -c 'esearch -db assembly -query $0 | elink -target nuccore | efetch -format acc' > ncbi.acc
```

Convert the NCBI references into INSDC contigs.
⚠ Took about three days to finish due to all the API calls and created a raw 89M file.
Sort/gzip turned it into 18M.

```bash
tail -n +2 ncbi_reference_genomes.tsv | cut -f 1 | xargs -n 1 -P 1 bash -c 'insdc=$(esearch -db assembly -query $0 | elink -target nuccore -name assembly_nuccore_insdc | efetch -format accn | tr "\n" "\t"); echo -e "$0\t$insdc";' | tee ncbi_ref.acc
sort ncbi_ref.acc | gzip -c9 > ncbi_ref.acc.gz && \
rm -v ncbi_ref.acc
```

_NOTE_ This command probably would have saved me time if I turned it into a batch query and so I'm jotting it down.

```bash
datasets summary genome accession GCF_000006945.2 --report sequence --as-json-lines | dataformat tsv genome-seq --fields genbank-seq-acc
```

## Translate

Translate certain entries into CDC, NCTC3000, FDA, or the NCBI list.
Make a sources field.

Add a sources field to the chromosomes.tsv from entrez search.

```bash
cat chromosomes.tsv | perl provenance.pl > sources.tsv
```
86 changes: 86 additions & 0 deletions src/provenance/SME.acc
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
Bacillus_cereus AP007209 1396 86661 UNKNOWN
Bacillus_subtilis AL009126 1423 653685 NCBI-REF
Bacillus_thuringiensis AE017355 1428 86661 UNKNOWN
Betacoronavirus_coronavirus MT233526 2697049 694009 UNKNOWN
Borreliella_burgdorferi AE000783 139 64895 UNKNOWN
Burkholderia_pseudomallei BX571965 28450 111527 UNKNOWN
Burkholderia_pseudomallei BX571966 28450 111527 UNKNOWN
Campylobacter_coli CP028187 195 194 UNKNOWN
Campylobacter_concisus CP012541 199 194 NCBI-REF
Campylobacter_cuniculorum CP020867 374106 194 NCBI-REF
Campylobacter_fetus CP006833 196 194 UNKNOWN
Campylobacter_gracilis CP012196 824 194 NCBI-REF
Campylobacter_helveticus CP020478 28898 194 NCBI-REF
Campylobacter_hyointestinalis_hyointestinalis CP015575 91352 198 UNKNOWN
Campylobacter_hyointestinalis_lawsonii CP015575 91353 198 UNKNOWN
Campylobacter_insulaenigrae CP007770 260714 194 UNKNOWN
Campylobacter_lanienae CP015578 75658 194 NCBI-REF
Campylobacter_jejuni_jejuni AL111168 32022 197 NCBI-REF
Campylobacter_lari CP000932 201 194 UNKNOWN
Campylobacter_peloridis CP007766 488546 194 NCBI-REF
Campylobacter_sputorum CP019683 202 194 UNKNOWN
Campylobacter_subantarcticus CP007772 497724 194 UNKNOWN
Campylobacter_volucris CP007774 1031542 194 UNKNOWN
Campylobacter_ureolyticus CP012195 827 194 UNKNOWN

# We trust Segre's lab for Citrobacter
Citrobacter_freundii CP007557 546 544 UNKNOWN
# We trust Parkhill's sequencing lab for Clostridioides
Clostridioides_difficile AM180355 1496 1870884 UNKNOWN

Clostridium_acetobutylicum AE001437 1488 1485 UNKNOWN
Clostridium_argentinense CP014176 29341 1485 NCBI-REF
Clostridium_baratii CP014204 1561 1485 NCBI-REF
Clostridium_botulinum_groupI CP001078 9000005 1491 UNKNOWN
Clostridium_botulinum_groupII CP000939 9000004 1491 UNKNOWN
Clostridium_butyricum CP013239 1492 1485 UNKNOWN
Clostridium_perfringens CP000312 1502 1485 UNKNOWN

# WDPB
Cryptosporidium_parvum CM000430 353152 5806 NCBI-GEN:University of Minnesota
Cryptosporidium_parvum CM000429 353152 5806 NCBI-GEN:University of Minnesota
Cryptosporidium_parvum CM000431 353152 5806 NCBI-GEN:University of Minnesota
Cryptosporidium_parvum CM000432 353152 5806 NCBI-GEN:University of Minnesota
Cryptosporidium_parvum CM000433 353152 5806 NCBI-GEN:University of Minnesota
Cryptosporidium_parvum CM000434 353152 5806 NCBI-GEN:University of Minnesota
Cryptosporidium_parvum CM000435 353152 5806 NCBI-GEN:University of Minnesota
Cryptosporidium_parvum CM000436 353152 5806 NCBI-GEN:University of Minnesota
Cryptosporidium_hominis LN877948 237895 5806 UNKNOWN
Cryptosporidium_hominis LN877947 237895 5806 UNKNOWN
Cryptosporidium_hominis LN877950 237895 5806 UNKNOWN
Cryptosporidium_hominis LN877949 237895 5806 UNKNOWN
Cryptosporidium_hominis LN877951 237895 5806 UNKNOWN
Cryptosporidium_hominis LN877952 237895 5806 UNKNOWN
Cryptosporidium_hominis LN877953 237895 5806 UNKNOWN
Cryptosporidium_hominis LN877954 237895 5806 UNKNOWN

# ATCC
Enterobacter_cloacae CP001918 550 354276 UNKNOWN
# Claire Fraser's lab
Enterococcus_faecalis AE016830 1351 1350 UNKNOWN
# Enterococcus_faecium CP003583 1352 1350 UNKNOWN

# Rebecca Lindsey
Escherichia_albertii CP024282 208962 561 NCBI-GEN:CDC
Escherichia_coli CP027582 562 561 NCBI-GEN:CDC
Escherichia_fergusonii CP042945 564 561 NCBI-GEN:CDC

# ATCC
Finegoldia_magna AP008971 1260 150022 UNKNOWN
Fusobacterium_nucleatum CP028101 851 848 UNKNOWN

Listeria_innocua CP045743 1642 1637 UNKNOWN
Listeria_ivanovii FR687253 1638 1637 UNKNOWN
Listeria_monocytogenes_I CP054040 9000000 1639 UNKNOWN
Listeria_monocytogenes_II CP054042 9000001 1639 UNKNOWN
Listeria_monocytogenes_III CP054039 9000002 1639 UNKNOWN
Listeria_monocytogenes_IV CP054041 9000003 1639 UNKNOWN
Listeria_seeligeri FN557490 1640 1637 UNKNOWN

Yersinia_bercovieri CP054044 634 629 UNKNOWN
Yersinia_enterocolitica CP002246 630 629 UNKNOWN
Yersinia_enterocolitica AM286415 630 629 UNKNOWN
Yersinia_intermedia CP009801 631 629 UNKNOWN
Yersinia_kristensenii CP054049 631 629 UNKNOWN
Yersinia_massiliensis CP054048 33060 629 UNKNOWN
Yersinia_pseudotuberculosis CP009712 502800 633 UNKNOWN
Loading

0 comments on commit fb1155d

Please sign in to comment.