A Snakemake workflow for louvain community partitioning of mash distances
The usage of this workflow is described in the Snakemake Workflow Catalog.
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) mashup-phagesitory and its DOI (see above).
Our general aims are to:
- Follow the best practices regarding snakemake workflow folder structures: https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#distribution-and-reproducibility
- Use conda environments for automated software installation: https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
- Use snakemake wrappers for any steps where they are available: https://snakemake-wrappers.readthedocs.io/
- Enable automatic inclusion of the workflow in the snakemake workflow catalog (this should work out-of-the-box, standardized usage can be enabled later): https://snakemake.github.io/snakemake-workflow-catalog/?rules=true
- Use the example data provided for automated testing (continuous integration) of workflow functionality in a
.test/
folder and using GitHub Actions.
The aim of this tool is to ascertain whether MASH distances, in combination with community partitioning (e.g. Louvain or Leiden), could be used to rapidly determine strains, species and genera of bacteriophages.
A systematic evaluation of varying distance, hashes and p-value thresholds is currently underway, so the defaults are the best guess from assessment of correlation between these parameters and ANI or VIRIDIC measures of similarity.
- TODO: implement rule for data download
The Inphared refseq genomes dataset is a subset of genome files and is much smaller than the _excluding_refseq file that I normally use:
wget https://millardlab-inphared.s3.climb.ac.uk/2Jul2023_refseq_genomes.fa.gz
gunzip 2Jul2023_refseq_genomes.fa.gz
- TODO: create minimal testing data set
2. Data processing with Mash
First, create sketches for records in inphared dataset, then create a file of calculated distances
Variables include
-k <int>: kmer-size
-s <int>: sketch size - number of non-redundant min-hashes
-i: Sketch individual sequences, rather than whole files, e.g. for multi-fastas of single-chromosome genomes or pair-wise gene comparisons
-p <int>: number of threads to use for processing
-d <num>: maximum distance to report
mash sketch -i -k 15 -s 25000 -p 24 2Jul2023_refseq_genomes.fa
mash dist -i -d 0.3 -p 24 2Jul2023_refseq_genomes.fa.msh 2Jul2023_refseq_genomes.fa.msh > 2Jul2023.refseq.d0.3.k15.s25000.tsv
- TODO: implement rule for
mash sketch
- TODO: implement rule for
mash dist
- TODO: set up automatic testing via GitHub workflow
See accompanying code and requirements in
- TODO: flesh out actual todos for this step, based on scripts in the web app
Explanation of process:
- Input mash tsv file is parsed, removing self-matches where the distance will be 0, i.e. binA=binB
- Matching hashes converted to a ratio
- Dataframe used to create a networkx object with edge attributes of distance, matching hashes and p-value
- Edges dropped from network according to user defined thresholds of distance, matching hashes and p-value
- Filtered networkx object passed to Louvain paritioning
- Structured outputs produced
NOTE There are options to produce a python rendered network and dendrogram. These should be disabled as they take a massive amount of time to produce. The network output can be visualised in cytoscape.
MASH distance file: 2Jul2023.refseq.d0.3.k15.s25000.tsv
User parameters:
Input file path: location and name of input mash distance file
Sample size: 0 for everything, else limit to (n) sequence records
Distance threshold: maximum distance to use for network edges
P-value threshold: maximum p-value to use for network edges
Hashes threshold: minimum hashes ratio to use for network edges
Output folders are created automatically, named by date and time of run
passed_to_networkx.tsv: Checkpoint dataframe
community_counts_table.csv: Number of member genomes for each community
community_members_table.csv: Comma-separated list of accession numbers for each community
network.xml: graphxml file of networkx object following filtering - can be visualised in cytoscape
full_louvain_partition.json: json output of louvain paritioning
- Extract accession numbers for a random community
- Download records from GenBank as multi-fasta file for those accessions (E-utilities API)
- Process fasta record names so only accession numbers remain: .[0-9].*
- Use VIRIDIC to assess whether the communities defined fall within or outside of current genus demarcation criteria.