Skip to content

Commit

Permalink
Address JOSS reviewers suggestions for documentation and testing
Browse files Browse the repository at this point in the history
  • Loading branch information
tongzhouxu committed Nov 29, 2024
1 parent a3ac0d4 commit 4256230
Show file tree
Hide file tree
Showing 36 changed files with 1,844 additions and 1,401 deletions.
4 changes: 2 additions & 2 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
mashpit/test/* linguist-vendored
test/* linguist-vendored
paper/* linguist-vendored
jquery.js linguist-vendored=false
docs/* linguist-vendored
8 changes: 4 additions & 4 deletions .github/workflows/python-app.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ jobs:
python-version: ["3.8", "3.9", "3.10"]

steps:
- uses: actions/checkout@main
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'
Expand All @@ -30,11 +30,11 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install .
python -m pip install .
- name: Install pytest
run: |
pip install pytest
- name: Run tests
working-directory: mashpit/test
working-directory: test
run: |
pytest test.py
68 changes: 50 additions & 18 deletions .github/workflows/python-publish.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# This workflow will upload a Python Package using Twine when a release is created
# This workflow will upload a Python Package to PyPI when a release is created
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python#publishing-to-package-registries

# This workflow uses actions that are not certified by GitHub.
Expand All @@ -16,23 +16,55 @@ permissions:
contents: read

jobs:
deploy:
release-build:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: "3.8"

- name: Build release distributions
run: |
# NOTE: put your own distribution build steps here.
python -m pip install build
python -m pip install .
- name: Upload distributions
uses: actions/upload-artifact@v4
with:
name: release-dists
path: dist/

pypi-publish:
runs-on: ubuntu-latest
needs:
- release-build
permissions:
# IMPORTANT: this permission is mandatory for trusted publishing
id-token: write

# Dedicated environments with protections for publishing are strongly recommended.
# For more information, see: https://docs.github.com/en/actions/deployment/targeting-different-environments/using-environments-for-deployment#deployment-protection-rules
environment:
name: pypi
# OPTIONAL: uncomment and update to include your PyPI project URL in the deployment status:
# url: https://pypi.org/p/YOURPROJECT
#
# ALTERNATIVE: if your GitHub Release name is the PyPI project version string
# ALTERNATIVE: exactly, uncomment the following line instead:
# url: https://pypi.org/project/YOURPROJECT/${{ github.event.release.name }}

steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v3
with:
python-version: '3.8'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install build
- name: Build package
run: python -m build
- name: Publish package
uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29
with:
user: ${{ secrets.PYPI_USERNAME }}
password: ${{ secrets.PYPI_API_TOKEN }}
- name: Retrieve release distributions
uses: actions/download-artifact@v4
with:
name: release-dists
path: dist/

- name: Publish release distributions to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
with:
packages-dir: dist/
9 changes: 6 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,13 @@
.idea/
.vscode/
/__pycache__/
mashpit/__pycache__/
mashpit/test/__pycache__/
src/__pycache__/
test/__pycache__/
build/
dist/
*.egg-info/
uploads/
MANIFEST.in
README_files/
README.html
paper/paper.pdf
paper/paper.docx
13 changes: 0 additions & 13 deletions .travis.yml

This file was deleted.

77 changes: 62 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,56 +3,98 @@
![unittest](https://github.com/tongzhouxu/mashpit/actions/workflows/python-app.yml/badge.svg)
[![License: GPL v2](https://img.shields.io/badge/License-GPL_v2-blue.svg)](https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html)
[![PyPI release](https://img.shields.io/pypi/v/mashpit)](https://pypi.python.org/pypi/mashpit/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

Create a database of mash signatures and find the most similar genomes to a target sample. To contribute, please see the contributing guidelines [here](CONTRIBUTING.md).
Create a database of mash signatures and find the most similar genomes to a target sample. To contribute, please see the contributing guidelines [here](CONTRIBUTING.md).<br><br>Mashpit is a fast and lightweight genomic epidemiology platform for querying large-scale pathogen datasets locally. It enables users to:
- Build Custom Databases: Create species-specific databases using NCBI Pathogen Detection data or user-provided genomes.
- Perform Rapid Queries: Query assemblies against databases and receive results sorted by Mash distance.
- Integrate Metadata: Include epidemiological metadata such as isolation date, geography, and host information in results.
- Generate Phylogenetic Trees: Construct trees based on Mash distances for quick visualization of genomic relationships.

Mashpit is optimized for local use, requiring minimal computational resources while maintaining high performance, making it ideal for sensitive data or limited infrastructure.

## Installation
### Option 1. Install with Conda/Mamba (Recommended)
### Option 1. Install with Conda/Mamba (recommended)
```
conda create -n mashpit -c conda-forge -c bioconda 'mashpit=0.9.6'
conda create -n mashpit -c conda-forge -c bioconda 'mashpit=0.9.8'
conda activate mashpit
```
### Option 2. Install with pip
#### 1. Dependency: Install NCBI datasets
#### 1. Dependency: Install NCBI datasets (for Linux)
```
curl -o datasets 'https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets'
chmod +x datasets
export PATH=$PATH:$PWD
export PATH=$PATH:$PATH_TO_NCBI_DATASETS
```

#### 2. Install mashpit using pip:
```
pip install mashpit
```
#### Or git clone from github:
#### Or build from source:
```
git clone https://github.com/tongzhouxu/mashpit.git
cd mashpit
pip install .
python -m pip install .
```

## Mashpit Database

A mashpit database is a directory containing:
A mashpit database is a folder containing:
- `$DB_NAME.db`
- `$DB_NAME.sig`

Mashpit database can be built using:

1. A taxonomic name
1. A taxonomic name (e.g. *Salmeonlla enterica*)<br>
A standard database is a collection of representative genomes from each cluster on [Pathogen Detection](https://www.ncbi.nlm.nih.gov/pathogens). By default mashpit will download the latest version of a specified species and find the centroid of each SNP cluter (SNP tree).
2. BioSample accessions
2. BioSample accessions<br>
A custom database is a collection of genomes based on a proveded biosample accesion list.

## Quick start
Here we use a small *Listeria innocua* pathogen detection
#### 1. Build a *Listeria innocua* database:
PDG000000091.9 was versioned on 2022-07-29
```
mashpit build taxon mashpit_listeria_innocua --species Listeria_innocua --pd_version PDG000000091.9
```
#### 2. Download a *Listeria innocua* assembly:
```
datasets download genome accession GCA_022617975.1
```
#### 3. Unzip the ncbi_dataset.zip file:
```
unzip ncbi_dataset.zip
```
#### 4. Run mashpit query:
```
mashpit query ncbi_dataset/data/GCA_022617975.1/GCA_022617975.1_PDT001269761.1_genomic.fna mashpit_listeria_innocua
```
Here, `ncbi_dataset/data/GCA_022617975.1/GCA_022617975.1_PDT001269761.1_genomic.fna` is the downloaded genome assembly and `mashpit_listeria_innocua` is the database folder.

#### Output:
##### 1. `GCA_022617975_output.csv`: the query result table sorted by the similarity_score, which is the sourmash jaccard similarity ranging from 0-1<br>
partial of an example output table:

biosample_acc | ... | PDS_acc | asm_acc | similarity_score | SNP_tree_link |
--- | --- |--- |--- |--- |--- |
SAMN24804945 | ... | PDS000111028.1|GCA_022617975.1|1|...|
SAMEA8998150 | ... | PDS000111027.1|GCA_021238725.1|0.927|...|

In this example, the SNP cluster PDS000111028.1 is the most similar to the query genome GCA_022617975.1 with a similarity score of 1. The SNP cluster PDS000111027.1 is the second most similar with a similarity score of 0.927.

##### 2. `GCA_022617975_tree.newick` and `GCA_022617975_tree.png`: the sourmash distance based tree
##### 3. `mashpit-$date.log`: mashpit query log file


## Usage

### 1. Build a mashpit database
```
usage: mashpit build [-h] [--quiet] [--number NUMBER] [--ksize KSIZE] [--species SPECIES] [--email EMAIL] [--key KEY] [--pd_version PD_VERSION] [--list LIST] {taxon,accession} name
positional arguments:
{taxon,accession} mashpit database type.
{taxon,accession} mashpit database type
name mashpit database name
optional arguments:
Expand All @@ -72,7 +114,7 @@ optional arguments:
mashpit build taxon salmonella --species Salmonella
```

Note: Supported species names can be found in this [list](https://ftp.ncbi.nlm.nih.gov/pathogen/Results/)
#### Note: Supported species names can be found in this [list](https://ftp.ncbi.nlm.nih.gov/pathogen/Results/).

### 2. Query against a mashpit database
```
Expand All @@ -94,7 +136,7 @@ optional arguments:
```
mashpit query sample.fasta path/to/database
```
### Optional: Update the database
### Optional: update the database (e.g. update a taxon database to the lastest NCBI Pathogen Detection version)
```
usage: mashpit update [-h] [--metadata METADATA] [--quiet] database name
Expand All @@ -111,8 +153,13 @@ optional arguments:
```
mashpit update path/to/database salmonella
```
### Webserver
A local host webserver can be started using:
### Optional: webserver
A local host webserver to run the query and visualize the output can be started using:
```
mashpit webserver
```
After running this command, a GUI interface will be deployed at 127.0.0.1:8080. Visit the link in your browser to start using the webserver. The webserver allows users to upload a query sample and select a database to query against. The results will be displayed in a table and a tree. A screenshot of the webserver is shown below:

![screenshot](docs/mashpit_webserver.pdf)

To note, a pre-built database is required to run the webserver. The database can be built using the `mashpit build` command.
Binary file added docs/img/mashpit_webserver.pdf
Binary file not shown.
3 changes: 0 additions & 3 deletions mashpit/__init__.py

This file was deleted.

Loading

0 comments on commit 4256230

Please sign in to comment.