Snakemake workflow: `mashup-phage`

A Snakemake workflow for louvain community partitioning of mash distances

Usage

The usage of this workflow is described in the Snakemake Workflow Catalog.

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) mashup-phagesitory and its DOI (see above).

TODO for workflow implementation

Our general aims are to:

Follow the best practices regarding snakemake workflow folder structures: https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#distribution-and-reproducibility
Use conda environments for automated software installation: https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
Use snakemake wrappers for any steps where they are available: https://snakemake-wrappers.readthedocs.io/
Enable automatic inclusion of the workflow in the snakemake workflow catalog (this should work out-of-the-box, standardized usage can be enabled later): https://snakemake.github.io/snakemake-workflow-catalog/?rules=true
Use the example data provided for automated testing (continuous integration) of workflow functionality in a .test/ folder and using GitHub Actions.

MASHUP workflow step-by-step

The aim of this tool is to ascertain whether MASH distances, in combination with community partitioning (e.g. Louvain or Leiden), could be used to rapidly determine strains, species and genera of bacteriophages.

A systematic evaluation of varying distance, hashes and p-value thresholds is currently underway, so the defaults are the best guess from assessment of correlation between these parameters and ANI or VIRIDIC measures of similarity.

1. Download dataset

TODO: implement rule for data download

The Inphared refseq genomes dataset is a subset of genome files and is much smaller than the _excluding_refseq file that I normally use:

wget https://millardlab-inphared.s3.climb.ac.uk/2Jul2023_refseq_genomes.fa.gz    
gunzip 2Jul2023_refseq_genomes.fa.gz

TODO: create minimal testing data set

2. Data processing with Mash

First, create sketches for records in inphared dataset, then create a file of calculated distances

Variables include
-k <int>: kmer-size
-s <int>: sketch size - number of non-redundant min-hashes
-i: Sketch individual sequences, rather than whole files, e.g. for multi-fastas of single-chromosome genomes or pair-wise gene comparisons
-p <int>: number of threads to use for processing
-d <num>: maximum distance to report

mash sketch -i -k 15 -s 25000 -p 24 2Jul2023_refseq_genomes.fa
mash dist -i -d 0.3 -p 24 2Jul2023_refseq_genomes.fa.msh 2Jul2023_refseq_genomes.fa.msh > 2Jul2023.refseq.d0.3.k15.s25000.tsv

TODO: implement rule for mash sketch
TODO: implement rule for mash dist
TODO: set up automatic testing via GitHub workflow

3. MASHUP

See accompanying code and requirements in

TODO: flesh out actual todos for this step, based on scripts in the web app

Explanation of process:

Input mash tsv file is parsed, removing self-matches where the distance will be 0, i.e. binA=binB

Matching hashes converted to a ratio

Dataframe used to create a networkx object with edge attributes of distance, matching hashes and p-value

Edges dropped from network according to user defined thresholds of distance, matching hashes and p-value

Filtered networkx object passed to Louvain paritioning

Structured outputs produced

NOTE There are options to produce a python rendered network and dendrogram. These should be disabled as they take a massive amount of time to produce. The network output can be visualised in cytoscape.

Input

MASH distance file: 2Jul2023.refseq.d0.3.k15.s25000.tsv

User parameters:
Input file path: location and name of input mash distance file
Sample size: 0 for everything, else limit to (n) sequence records
Distance threshold: maximum distance to use for network edges
P-value threshold: maximum p-value to use for network edges
Hashes threshold: minimum hashes ratio to use for network edges

Output

Output folders are created automatically, named by date and time of run

passed_to_networkx.tsv: Checkpoint dataframe
community_counts_table.csv: Number of member genomes for each community
community_members_table.csv: Comma-separated list of accession numbers for each community
network.xml: graphxml file of networkx object following filtering - can be visualised in cytoscape
full_louvain_partition.json: json output of louvain paritioning

4. Downstream analysis (not coded)

Extract accession numbers for a random community

Download records from GenBank as multi-fasta file for those accessions (E-utilities API)

Process fasta record names so only accession numbers remain: .[0-9].*

Use VIRIDIC to assess whether the communities defined fall within or outside of current genus demarcation criteria.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
.template		.template
.test		.test
config		config
workflow		workflow
.gitignore		.gitignore
.snakemake-workflow-catalog.yml		.snakemake-workflow-catalog.yml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snakemake workflow: `mashup-phage`

Usage

TODO for workflow implementation

MASHUP workflow step-by-step

1. Download dataset

2. Data processing with Mash

3. MASHUP

Input

Output

4. Downstream analysis (not coded)

About

Releases

Packages

Contributors 2

Languages

License

ICTV-VBEG/mashup-phage

Folders and files

Latest commit

History

Repository files navigation

Snakemake workflow: mashup-phage

Usage

TODO for workflow implementation

MASHUP workflow step-by-step

1. Download dataset

2. Data processing with Mash

3. MASHUP

Input

Output

4. Downstream analysis (not coded)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Snakemake workflow: `mashup-phage`

Packages