Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 0.9.1 #23

Merged
merged 15 commits into from
Sep 4, 2024
1 change: 1 addition & 0 deletions .github/workflows/build-docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ on:
paths:
- ".github/workflows/build-docs.yml"
- "docs/**"
- "mkdocs.yml"
# Cancel if a newer run is started
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
Expand Down
21 changes: 9 additions & 12 deletions .github/workflows/unit-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,16 @@ name: Tests

on:
push:
branches: ["main"]
paths:
- ".github/workflows/unit-tests.yml"
- "tests/**"
- "mess/**"
- "setup.py"
branches:
- main
paths-ignore:
- "docs/**"
- "*.md"
pull_request:
branches: ["main"]
paths:
- ".github/workflows/unit-tests.yml"
- "tests/**"
- "mess/**"
- "setup.py"
paths-ignore:
- "docs/**"
- "*.md"


permissions:
contents: read
Expand Down
36 changes: 24 additions & 12 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -8,22 +8,34 @@ authors:
- given-names: Farid
family-names: Chaabane
email: [email protected]
orcid: 'https://orcid.org/0009-0007-9322-1281'
affiliation: University Hospital of Lausanne
orcid: "https://orcid.org/0009-0007-9322-1281"
affiliation: >-
Institute of Microbiology, Lausanne University
Hospital and University of Lausanne, Lausanne,
Switzerland
- given-names: Trestan
family-names: Pillonel
email: [email protected]
orcid: 'https://orcid.org/0000-0002-5725-7929'
affiliation: University Hospital of Lausanne
- orcid: 'https://orcid.org/0000-0003-0550-8981'
given-names: Claire
orcid: "https://orcid.org/0000-0002-5725-7929"
affiliation: >-
Institute of Microbiology, Lausanne University
Hospital and University of Lausanne, Lausanne,
Switzerland
- given-names: Claire
family-names: Bertelli
email: [email protected]
affiliation: University Hospital of Lausanne
repository-code: 'https://github.com/metagenlab/MeSS'
url: 'https://metagenlab.github.io/MeSS/'
orcid: "https://orcid.org/0000-0003-0550-8981"
affiliation: >-
Institute of Microbiology, Lausanne University
Hospital and University of Lausanne, Lausanne,
Switzerland
identifiers:
- type: doi
value: 10.5281/zenodo.13365501
description: zenodo software
repository-code: "https://github.com/metagenlab/MeSS"
url: "https://metagenlab.github.io/MeSS/"
abstract: >-
Snakemake pipeline for simulating shotgun metagenomic
samples
Snakemake pipeline for simulating shotgun metagenomic samples
license: MIT
version: 0.9.0
version: 0.9.1
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ LABEL org.opencontainers.image.description="Snakemake pipeline for simulating sh
LABEL org.opencontainers.image.licenses=MIT

USER root
ENV APT_PKGS="squashfuse fuse2fs gocryptfs"
ENV APT_PKGS="squashfuse fuse2fs gocryptfs procps"
RUN apt-get update \
&& apt-get install -y --no-install-recommends ${APT_PKGS} \
&& apt-get clean \
Expand Down
108 changes: 100 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,120 @@
# Welcome to MeSS !
# Metagenomic Sequence Simulator (MeSS)

[![](https://img.shields.io/static/v1?label=CLI&message=Snaketool&color=blueviolet)](https://github.com/beardymcjohnface/Snaketool)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![version](https://img.shields.io/conda/v/bioconda/mess?label=version&color=blue)](http://bioconda.github.io/recipes/mess/README.html)
[![license](https://img.shields.io/github/license/metagenlab/mess.svg)](https://github.com/metagenlab/MeSS/blob/main/LICENSE)
[![version](https://img.shields.io/conda/vn/bioconda/mess?color=blue)](http://bioconda.github.io/recipes/mess/README.html)
[![downloads](https://img.shields.io/conda/dn/bioconda/mess.svg)](https://anaconda.org/bioconda/mess)

[![tests](https://github.com/metagenlab/MeSS/actions/workflows/unit-tests.yml/badge.svg)](https://github.com/metagenlab/MeSS/actions/workflows/unit-tests.yml)
[![docs](https://github.com/metagenlab/MeSS/actions/workflows/build-docs.yml/badge.svg)](https://github.com/metagenlab/MeSS/actions/workflows/build-docs.yml)
[![docker](https://github.com/metagenlab/MeSS/actions/workflows/docker-publish.yml/badge.svg)](https://github.com/metagenlab/MeSS/actions/workflows/docker-publish.yml)

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13365501.svg)](https://zenodo.org/doi/10.5281/zenodo.13365501)


The Metagenomic Sequence Simulator (MeSS) is a [Snakemake](https://github.com/snakemake/snakemake) pipeline, implemented using [Snaketool](https://github.com/beardymcjohnface/Snaketool), for simulating illumina, Oxford Nanopore (ONT) and Pacific Bioscience (PacBio) shotgun metagenomic samples.

## Overview
## :memo: Overview

MeSS takes as input NCBI taxa or local genome assemblies to generate either long (PacBio or ONT) or short (illumina) reads. In addition to reads, MeSS optionally generates bam alignment files and taxonomic + sequence abundances in [CAMI format](https://github.com/bioboxes/rfc/blob/master/data-format/profiling.mkd).

``` mermaid
%%{init: {'theme':'forest'}}%%
flowchart LR
input["samples.tsv
or
samples/*.tsv"] --> taxons

subgraph genome_download["genome download"]
dlchoice{download ?}
taxons["taxons or
accesions"] --> dlchoice
dlchoice -->|yes| assembly_finder
dlchoice -->|no| fasta
assembly_finder --> fasta
end

input --> distchoice
subgraph community_design["community design"]
distchoice{draw distribution ?}
distchoice -->|yes| dist["distribution
(lognormal, even)"]
dist --> abundances
distchoice -->|no| reads
distchoice -->|no| bases
distchoice -->|no| abundances
depth["coverage depth"]
reads --> depth
bases --> depth
abundances["abundances
(sequence, taxonomic)"] --> depth
end
fasta --> simulator
depth --> simulator

simulator["read simulator
(art_illumina, pbsim3...)"]
simulator --> bam
simulator --> fastq
simulator --> CAMI-profile

MeSS takes as input NCBI taxa or local genome assemblies to generate either long (PacBio or ONT) or short (illumina) reads. In addition to reads, MeSS optionally generates bam alignment files and taxonomic profiles in [bioboxes format](https://github.com/bioboxes/rfc).
%% colors
style genome_download color:black
style community_design color:black
classDef red fill:#faeaea,color:#fff,stroke:#333;
classDef blue fill:#eaecfa,color:#fff,stroke:#333;
class genome_download blue
class community_design red
```
## :books: Documentation

More details can be found in the [documentation](https://metagenlab.github.io/MeSS/)

![overview](docs/images/workflow.svg)
## :zap: Quick start
### Installation

## Installation
#### Mamba

[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/mess/README.html)

```sh
mamba create -n mess mess
```

## Usage
#### Docker

```sh
docker pull ghcr.io/metagenlab/mess:latest
```

#### From source

```sh
git clone https://github.com/metagenlab/MeSS.git
pip install -e MeSS
```

### Usage


#### Download and simulate

Using the following file [minimal_test.tsv](https://github.com/metagenlab/MeSS/blob/main/mess/test_data/minimal_test.tsv)

```sh
mess run -i minimal_test.tsv
```

#### Simulate from local fasta

Download the [fasta directory](https://github.com/metagenlab/MeSS/tree/main/mess/test_data/fastas) and [table](https://github.com/metagenlab/MeSS/blob/main/mess/test_data/simulate_test.tsv)

```sh
mess simulate -i simulate_test.tsv --fasta fasta
```

## :sos: Help

More details on command-line options in the [doc](https://metagenlab.github.io/MeSS/commands/)

![`mess -h`](docs/images/mess-help.svg)
2 changes: 1 addition & 1 deletion mess/mess.VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.9.0
0.9.1
5 changes: 5 additions & 0 deletions mess/workflow/rules/preflight/functions.smk
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,11 @@ from Bio import SeqIO
import random


wildcard_constraints:
sample="[^/]+",
contig="[^/]+",


def list_reads(wildcards):
if PAIRED:
reads = expand(
Expand Down
2 changes: 1 addition & 1 deletion mess/workflow/rules/processing/coverages.smk
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ checkpoint calculate_genome_coverages:
df=os.path.join(dir.out.base, "replicates.tsv"),
asm=get_asm_summary,
output:
os.path.join(dir.out.processing, "coverages.tsv"),
os.path.join(dir.out.base, "coverages.tsv"),
params:
fa=FASTA,
dist=DIST,
Expand Down
2 changes: 1 addition & 1 deletion mess/workflow/rules/processing/fastas.smk
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ if FASTA and not ASM_SUMMARY:
checkpoint split_contigs:
input:
fa=list_fastas,
cov=os.path.join(dir.out.processing, "coverages.tsv"),
cov=os.path.join(dir.out.base, "coverages.tsv"),
output:
tsv=os.path.join(dir.out.processing, "cov.tsv"),
dir=directory(os.path.join(dir.out.processing, "split")),
Expand Down
49 changes: 45 additions & 4 deletions mess/workflow/rules/processing/reads.smk
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ fastq_dir = dir.out.long
sam_in = os.path.join(dir.out.bam, "{sample}", "{fasta}", "{contig}.sam")
if SEQ_TECH == "illumina":
fastq_dir = dir.out.short
sam_in = os.path.join(fastq_dir, "{sample}", "{fasta}", "{contig}.sam")
sam_in = os.path.join(fastq_dir, "{sample}", "{fasta}", "{contig}.fixed")

fastq = os.path.join(fastq_dir, "{sample}", "{fasta}", "{contig}.fq")
fastq_gz = temp(os.path.join(fastq_dir, "{sample}", "{fasta}", "{contig}.fq.gz"))
Expand Down Expand Up @@ -151,6 +151,29 @@ if BAM:
"""


rule fix_art_sam:
"""
rule to replace SAM cigar string with read length + M
Fixes truncated art_illumina SAM files with some genomes
"""
input:
os.path.join(fastq_dir, "{sample}", "{fasta}", "{contig}.sam"),
output:
temp(os.path.join(fastq_dir, "{sample}", "{fasta}", "{contig}.fixed")),
resources:
mem_mb=config.resources.sml.mem,
mem=str(config.resources.sml.mem) + "MB",
time=config.resources.sml.time,
params:
maxlen=MEAN_LEN,
shell:
"""
awk 'BEGIN {{OFS="\t"}} {{ if ($1 ~ /^@/) {{ print $0 }} \\
else {{ $6 = "{params.maxlen}M"; print $0 }} }}' \\
{input} > {output}
"""


rule convert_sam_to_bam:
input:
sam_in,
Expand All @@ -171,7 +194,7 @@ rule convert_sam_to_bam:
containers.bioconvert
shell:
"""
bioconvert {input} {output} -t {threads} 2> {log}
bioconvert sam2bam {input} {output} -t {threads} 2> {log}
"""


Expand Down Expand Up @@ -243,7 +266,7 @@ rule sort_bams:
containers.bioconvert
shell:
"""
samtools sort -@ {threads} {input} -o {output} 2> {log}
samtools sort -@ {threads} {input} -o {output} 2> {log}
"""


Expand All @@ -265,7 +288,7 @@ rule get_bam_coverage:
containers.bioconvert
shell:
"""
samtools coverage {input} > {output}
samtools coverage {input} | tee {output} {log}
"""


Expand All @@ -274,6 +297,7 @@ rule get_tax_profile:
cov=os.path.join(dir.out.bam, "{sample}.txt"),
tax=get_cov_table,
output:
counts=os.path.join(dir.out.tax, "{sample}.tsv"),
seq_abundance=temp(os.path.join(dir.out.tax, "{sample}_seq.tsv")),
tax_abundance=temp(os.path.join(dir.out.tax, "{sample}_tax.tsv")),
resources:
Expand All @@ -288,6 +312,23 @@ rule get_tax_profile:
cov_df = pd.read_csv(input.cov, sep="\t")
cov_df.rename(columns={"#rname": "contig"}, inplace=True)
merge_df = tax_df.merge(cov_df)
merge_df[
[
"samplename",
"fasta",
"contig",
"tax_id",
"startpos",
"endpos",
"numreads",
"covbases",
"coverage",
"cov_sim",
"meandepth",
"meanbaseq",
"meanmapq",
]
].to_csv(output.counts, sep="\t", index=False)
for col in ["numreads", "meandepth"]:
if col == "numreads":
out = output.seq_abundance
Expand Down
2 changes: 1 addition & 1 deletion mess/workflow/rules/simulate/short_reads.smk
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ if PAIRED:


if BAM:
art_args += "-sam"
art_args += "-sam -M"


sam_out = temp(os.path.join(dir.out.short, "{sample}", "{fasta}", "{contig}.txt"))
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ def get_data_files():
"pyyaml>=6.0.1",
"pandas>=2.2.1",
"biopython>=1.83",
"rich-click>=1.7.4",
"rich-click>=1.8.3",
],
entry_points={"console_scripts": ["mess=mess.__main__:main"]},
include_package_data=True,
Expand Down