Skip to content

Commit

Permalink
some minor fixes; author affiliations; code examples
Browse files Browse the repository at this point in the history
  • Loading branch information
lskatz committed May 20, 2024
1 parent a79a5f6 commit b0f42fd
Show file tree
Hide file tree
Showing 2 changed files with 51 additions and 10 deletions.
10 changes: 10 additions & 0 deletions paper/paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -128,3 +128,13 @@ @article{lindsey2023rapid
year={2023},
publisher={Frontiers Media SA}
}

@article{ondov2016mash,
title={Mash: fast genome and metagenome distance estimation using MinHash},
author={Ondov, Brian D and Treangen, Todd J and Melsted, P{\'a}ll and Mallonee, Adam B and Bergman, Nicholas H and Koren, Sergey and Phillippy, Adam M},
journal={Genome biology},
volume={17},
pages={1--14},
year={2016},
publisher={Springer}
}
51 changes: 41 additions & 10 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,23 @@ authors:
- name: Rebecca Lindsey
affiliation: 1
- name: Ana Lauer
affiliation: 3
- name: Monica S. Im
affiliation: 1
- name: Grant Williams
affiliation: 1
- name: Jessica Halpin
affiliation: 1
- name: Gerardo A. Gómez
affiliation: 3
- name: Katie Roache
affiliation: 3
- name: Zuzana Kucerova
affiliation: 1
- name: Cheryl L. Tarr
affiliation: 1
- name: Andrew Page
affiliation: 4
- name: Henk C. Den Bakker
affiliation: 2
- name: Heather A. Carleton
Expand All @@ -37,6 +41,10 @@ affiliations:
name: Enteric Diseases Laboratory Branch (EDLB), Centers for Disease Control and Prevention, Atlanta, GA, USA
- index: 2
name: Center for Food Safety, University of Georgia, Griffin, GA, USA
- index: 3
name: At the time of this work, Enteric Diseases Laboratory Branch (EDLB), Centers for Disease Control and Prevention, Atlanta, GA, USA
- index: 4
name: At the time of this work, Quadram Institute, Norwich, UK
date: "May 17, 2024"
bibliography: paper.bib
output:
Expand Down Expand Up @@ -81,25 +89,23 @@ GenBank accessions, custom taxonomy, and software to utilize the accessions and
### accessions

The genomes in Kalamari are not housed in the repo itself.
Instead, NCBI accessions are in a file describing chromosomes, and another for plasmids.
Each of these files follows a tab-separated values (tsv) custom format.
Instead, NCBI accessions are in a tab-separated values (tsv) file describing chromosomes, and another tsv for plasmids.
The tsv files have a header line with the following columns: `scientificName` (genus and species), `nuccoreAcc` (GenBank accession), `taxid` (NCBI or Kalamari Taxonomy ID), and `parent` (the parent taxonomy ID).
Most genomes in the database are bacterial pathogens or related organisms.
All chromosomes and plasmids must be complete, i.e., no contig breaks,
and they come from trusted sources, e.g., FDA-ARGOS [@sichtig2019fda] or the NCTC 3000 collection [@dicks2023nctc3000], or our own subject matter experts at CDC vouch for them.

However, there are some pathogen exceptions such as SARS-CoV-2 and _Cryptosporidium_.
Additionally, there are several host organisms. The animal hosts include chicken, human, and squid. The plant hosts include fava beans, tomato, and cabbage.
However, there are some viral pathogen exceptions such as SARS-CoV-2, protist exceptions such as _Cryptosporidium_, and several host organisms. The animal hosts include but are not limited to chicken, human, and squid. The plant hosts include fava beans, tomato, and cabbage.
Most hosts are very large in size and so only the mitochondrial genomes are included as markers.
Also due to the magnitude of possible hosts for foodborne infections,
only a relative select few are included to represent many other possibilities.
For example, tomato chosen to represent the family of tomatoes, potatoes, eggplant, and tobacco;
tuna was selected to represent a variety of fish species.

We also obtained the list of plasmids from the MobSuite project [@robertsonMobsuite].
We also obtained the list of plasmids from the Mob-Suite project [@robertsonMobsuite].
We clustered them at 97% average nucleotide identity (ANI) [@lindsey2023rapid].
For each cluster, the taxonomy identifier was raised to the lowest common tier of taxonomy.
For example, if a cluster of plasmids were identified by both _Escherichia coli_ and _Salmonella enterica_, then all taxonomy identifiers for the plasmids were changed to their common family, Enterobacteriaceae.
For example, if a cluster of plasmids were identified by both _Escherichia coli_ and _Salmonella enterica_, then taxonomy identifiers for all the plasmids in the cluster were changed to their common family, Enterobacteriaceae.
As a result, any metagenomic signature from these plasmids
is both specific enough to the target taxon and general enough to help avoid any misidentifications.

Expand All @@ -126,17 +132,42 @@ This can result in a much smaller directory size and hypothetically faster downs
## Example Usage

Kalamari can be used where most metagenomic analyses are used.
Most commonly, we use Kalamari as the source data for Kraken1 [@wood2014kraken] or Kraken2 [@wood2019improved].
Most commonly, we use Kalamari to customize databases for Kraken1 [@wood2014kraken] or Kraken2 [@wood2019improved].
Building the Kraken database has been implemented in `buildKraken1.sh`
and in `buildKraken2.sh`.
However, other descriptions for building databases such as for BLAST
or Mash can be found in the documentation [@camacho2009blast+].
However, other descriptions for building databases such as for BLAST [@camacho2009blast+]
or Mash [@ondov2016mash] can be found in the documentation .

For genomes, a metagenomic database is useful for quality control because
For singular genomes, a metagenomic database is useful for quality control because
a user can have a null hypothesis that the sample is a metagenomic sample with a singular taxon.
An alternate hypothesis of contamination can be supported when conflicting taxa are detected by the database.
Therefore, a data scientist could use Kalamari as a way to detect contamination.
For metagenomes, the database is useful as intended, to detect which taxa are present in a sample.

A more concise example is shown

```bash
# Understand where the output files are
KALAMARI_VER=$(perl bin/downloadKalamari.pl --version)
OUTDIR="kalamari/share/kalamari-$KALAMARI_VER"

# after installing Kalamari
downloadKalamari.sh
# => files are now in $OUTDIR/kalamari
buildTaxonomy.sh
# => files are now in $OUTDIR/taxonomy
filterTaxonomy.sh
# => files are now in $OUTDIR/taxonomy/filtered

# Load kraken1 into the environment
buildKraken1.sh
# => files are now in $OUTDIR/kalamari-kraken

# Unload Kraken1 and then load Kraken2 into the environment
buildKraken2.sh
# => files are now in $OUTDIR/kalamari-kraken2
```

## Acknowledgements

This work was made possible through support from the Advanced Molecular Detection (AMD) Initiative at the Centers for Disease Control and Prevention.
Expand Down

0 comments on commit b0f42fd

Please sign in to comment.