Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added rotate in input table and changed default sdm #34

Merged
merged 13 commits into from
Oct 25, 2024
2 changes: 2 additions & 0 deletions .github/workflows/docker-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ on:
- "tests/**"
- "mess/**"
- "setup.py"
- "!.github/workflows/build-docs.yml"
- "!mkdocs.yml"
- "!docs/**"
- "!README.md"

Expand Down
51 changes: 35 additions & 16 deletions .github/workflows/unit-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@ name: Tests

on:
push:
branches: [main]
branches:
- main
paths:
- ".github/workflows/unit-tests.yml"
- "tests/**"
Expand All @@ -11,6 +12,17 @@ on:
- "!.github/workflows/build-docs.yml"
- "!docs/**"
- "!mkdocs.yml"
- "!README.md"
pull_request:
paths:
- ".github/workflows/unit-tests.yml"
- "tests/**"
- "mess/**"
- "setup.py"
- "!.github/workflows/build-docs.yml"
- "!docs/**"
- "!mkdocs.yml"
- "!README.md"

permissions:
contents: read
Expand All @@ -25,33 +37,40 @@ jobs:
shell: bash -el {0}

strategy:
fail-fast: false
matrix:
os: [ubuntu-latest]
python-version: ["3.12"]

steps:
- uses: "actions/checkout@v4"
- uses: actions/checkout@v4
with:
fetch-depth: 0

- name: Setup conda env
- name: Setup MeSS environment
uses: conda-incubator/setup-miniconda@v3
with:
auto-update-conda: true
miniforge-version: "latest"
miniforge-variant: Mambaforge
use-mamba: true
mamba-version: "*"
channels: conda-forge,bioconda,defaults
channel-priority: strict
miniforge-version: latest
activate-environment: mess
python-version: ${{ matrix.python-version }}
auto-activate-base: false
auto-update-conda: true

- name: Setup apt dependencies
run: |
sudo add-apt-repository -y ppa:apptainer/ppa
sudo apt-get update
sudo apt install -y squashfuse fuse2fs gocryptfs apptainer

- name: Disable apparmor namespace restrictions for apptainer
run: |
sudo sh -c 'echo kernel.apparmor_restrict_unprivileged_userns=0 \
>/etc/sysctl.d/90-disable-userns-restrictions.conf'
sudo sysctl -p /etc/sysctl.d/90-disable-userns-restrictions.conf

- name: Install MeSS and pytest-cov
run: |
pip install -e .
pip install pytest coverage

- name: "Test and generate coverage report on ${{ matrix.os }} for Python ${{ matrix.python-version }}"
- name: Run tests on ${{ matrix.os }} for python ${{ matrix.python-version }}
run: |
python -m pip install --upgrade pip
python -m pip install pytest coverage
python -m pip install .
coverage run -m pytest
147 changes: 93 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,34 +12,33 @@

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13365501.svg)](https://zenodo.org/doi/10.5281/zenodo.13365501)


The Metagenomic Sequence Simulator (MeSS) is a [Snakemake](https://github.com/snakemake/snakemake) pipeline, implemented using [Snaketool](https://github.com/beardymcjohnface/Snaketool), for simulating illumina, Oxford Nanopore (ONT) and Pacific Bioscience (PacBio) shotgun metagenomic samples.

## :mag: Overview

MeSS takes as input NCBI taxa or local genome assemblies to generate either long (PacBio or ONT) or short (illumina) reads. In addition to reads, MeSS optionally generates bam alignment files and taxonomic + sequence abundances in [CAMI format](https://github.com/bioboxes/rfc/blob/master/data-format/profiling.mkd).

``` mermaid
```mermaid
%%{init: {'theme':'forest'}}%%
flowchart LR
input["samples.tsv
or
input["samples.tsv
or
samples/*.tsv"] --> taxons

subgraph genome_download["genome download"]
dlchoice{download ?}
taxons["taxons or
accesions"] --> dlchoice
dlchoice -->|yes| assembly_finder
dlchoice -->|no| fasta
dlchoice -->|no| fasta
assembly_finder --> fasta
end
style genome_download color:#15161a

input --> distchoice
subgraph community_design["`**community design**`"]
distchoice{draw distribution ?}
distchoice -->|yes| dist["distribution
distchoice -->|yes| dist["distribution
(lognormal, even)"]
dist --> abundances
distchoice -->|no| reads
Expand All @@ -48,16 +47,16 @@ distchoice -->|no| abundances
depth["coverage depth"]
reads --> depth
bases --> depth
abundances["abundances
(sequence, taxonomic)"] --> depth
abundances["abundances
(sequence, taxonomic)"] --> depth
end
style community_design color:#15161a
style community_design color:#15161a

fasta --> simulator
depth --> simulator

simulator["read simulator
simulator["read simulator
(art_illumina, pbsim3...)"]
simulator --> bam
simulator --> fastq
Expand All @@ -70,98 +69,138 @@ class genome_download blue

class community_design red
```
## :books: Documentation

## :books: Documentation

More details can be found in the [documentation](https://metagenlab.github.io/MeSS/)

## :zap: Quick start
## :zap: Quick start

### :gear: Installation
Mamba

- Conda ([Miniforge](https://github.com/conda-forge/miniforge))

```sh
mamba create -n mess mess
conda create -n mess mess
```

Docker
- Docker

```sh
docker pull ghcr.io/metagenlab/mess:latest
```

From source
- From source

```sh
git clone https://github.com/metagenlab/MeSS.git
pip install -e MeSS
```

### :page_facing_up: Usage

#### :arrow_right: Input

Let's simulate two metagenomic samples with the following taxa and read counts in `samples.tsv`:
| sample | taxon | reads |
| --- | --- | --- |
| sample1 | 487 | 174840 |
| sample1 | 727 | 90679 |
| sample1 | 729 | 13129 |
| sample2 | 28132 | 147863 |
| sample2 | 199 | 147545 |
| sample2 | 729 | 131300 |
| sample | taxon | reads |
| --- | --- | --- |
| sample1 | 487 | 174840 |
| sample1 | 727 | 90679 |
| sample1 | 729 | 13129 |
| sample2 | 28132 | 147863 |
| sample2 | 199 | 147545 |
| sample2 | 729 | 131300 |

#### :rocket: Command
Let's run MeSS (using apptainer as the software deployment method) !

```sh
mess run -i samples.tsv --sdm apptainer
mess run -i samples.tsv
```

> [!IMPORTANT]
> [Apptainer](https://apptainer.org/) is the default and recommended dependency deployment method for maximum reproducibility ! If you would like to use conda you can specify `--sdm conda`.

#### :card_index_dividers: Outputs

- Downloaded genomes in `mess_out/assembly_finder/download`

```sh
📦mess_out
┣ 📂assembly_finder
┃ ┣ 📂download
┃ ┃ ┣ 📂GCF_000144405.1
┃ ┃ ┃ ┗ 📜GCF_000144405.1_ASM14440v1_genomic.fna.gz
┃ ┃ ┣ 📂GCF_001298465.1
┃ ┃ ┃ ┗ 📜GCF_001298465.1_ASM129846v1_genomic.fna.gz
┃ ┃ ┣ 📂GCF_016127215.1
┃ ┃ ┃ ┗ 📜GCF_016127215.1_ASM1612721v1_genomic.fna.gz
┃ ┃ ┣ 📂GCF_020736045.1
┃ ┃ ┃ ┗ 📜GCF_020736045.1_ASM2073604v1_genomic.fna.gz
┃ ┃ ┣ 📂GCF_022869645.1
┃ ┃ ┃ ┗ 📜GCF_022869645.1_ASM2286964v1_genomic.fna.gz
┃ ┃ ┗ 📜.snakemake_timestamp
┣ 📂fastq
┃ ┣ 📜sample1_R1.fq.gz
┃ ┣ 📜sample1_R2.fq.gz
┃ ┣ 📜sample2_R1.fq.gz
┃ ┗ 📜sample2_R2.fq.gz
┣ 📜config.yaml
┣ 📜coverages.tsv
┗ 📜mess.log
┣ 📂GCF_000144405.1
┃ ┗ 📜GCF_000144405.1_ASM14440v1_genomic.fna.gz
┣ 📂GCF_001298465.1
┃ ┗ 📜GCF_001298465.1_ASM129846v1_genomic.fna.gz
┣ 📂GCF_016127215.1
┃ ┗ 📜GCF_016127215.1_ASM1612721v1_genomic.fna.gz
┣ 📂GCF_020736045.1
┃ ┗ 📜GCF_020736045.1_ASM2073604v1_genomic.fna.gz
┣ 📂GCF_022869645.1
┗ 📜GCF_022869645.1_ASM2286964v1_genomic.fna.gz
```

Outputs described in more details [here](https://metagenlab.github.io/MeSS/guide/output/)
- Simulated reads in `mess_out/fastq`

```sh
┣ 📜sample1_R1.fq.gz
┣ 📜sample1_R2.fq.gz
┣ 📜sample2_R1.fq.gz
┗ 📜sample2_R2.fq.gz
```

> [!TIP]
> By default `mess` outputs paired illumina reads with the Hiseq25k error profile. Other outputs, and error profiles are described [here](https://metagenlab.github.io/MeSS/guide/output/) and [here](https://metagenlab.github.io/MeSS/tutorials/seqtech/)

#### :bar_chart: Resources usage

On average, using `samples.tsv` (see [table](#arrow_right-input)), MeSS runs in under 2min, while using around 1.8GB of physical RAM
Using [`samples.tsv`](#arrow_right-input), `mess` runs in under 2min, while using around 1.8GB of physical RAM

| task_id | hash | native_id | name | status | exit | submit | duration | realtime | %cpu | peak_rss | peak_vmem | rchar | wchar |
| ------- | --------- | --------- | -------- | --------- | ---- | ----------------------- | -------- | -------- | ------ | -------- | --------- | ------ | ------ |
| 1 | fe/03c2bc | 62286 | MESS (1) | COMPLETED | 0 | 2024-09-04 12:41:15.820 | 1m 50s | 1m 50s | 111.5% | 1.8 GB | 9 GB | 3.5 GB | 2.4 GB |
| 1 | ff/0d03b1 | 73355 | MESS (1) | COMPLETED | 0 | 2024-09-04 12:55:12.903 | 1m 52s | 1m 52s | 112.6% | 1.7 GB | 8.8 GB | 3.5 GB | 2.4 GB |
| 1 | 07/d352bf | 83576 | MESS (1) | COMPLETED | 0 | 2024-09-04 12:57:30.600 | 1m 50s | 1m 50s | 113.2% | 1.7 GB | 8.9 GB | 3.5 GB | 2.4 GB |

> [!NOTE]
> Average resources usage measured 3 times with one CPU (using [nextflow](https://github.com/nextflow-io/nextflow), excluding dependency deployment time).

More details in the [resource usage documentation](https://metagenlab.github.io/MeSS/benchmarks/resource-usage/)

## :fire: Features

> [!NOTE]
> Average resources usage measured 3 times with one CPU (within a [nextflow](https://github.com/nextflow-io/nextflow) process)
### :dna: Multi sequencing technology choice

- Illumina

```sh
mess test --tech illumina
```

- Nanopore

```sh
mess test --tech nanopore
```

> Resources usage was measured exluding dependencies deployement time (conda env creation or container pulling)
- PacBio

```sh
mess test --tech pacbio
```

### :white_check_mark: BAMs and taxonomic profiles

More details on resource usage in the [documentation](https://metagenlab.github.io/MeSS/benchmarks/resource-usage/)
```sh
mess test --bam
```

### :o: Circular genomes

```sh
mess test --rotate 3
```


## :sos: Help

More details on command-line options in the [doc](https://metagenlab.github.io/MeSS/commands/)
All command-line options at described [here](https://metagenlab.github.io/MeSS/commands/)

![`mess -h`](docs/images/mess-help.svg)
6 changes: 3 additions & 3 deletions mess/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -193,8 +193,8 @@ def common_options(func):
),
click.option(
"--sdm",
type=click.Choice(["conda", "apptainer"]),
default="conda",
type=click.Choice(["apptainer", "conda"]),
default="apptainer",
help="Software deplolyment method",
show_default=True,
),
Expand Down Expand Up @@ -249,7 +249,7 @@ def sim_options(func):
),
click.option(
"--rotate",
help="Number of times to shuffle genome start for circular assemblies (2 or more for circular)",
help="Number of times to shuffle genome start for circular assemblies (2 or more to circularize)",
type=int,
default=1,
show_default=True,
Expand Down
3 changes: 3 additions & 0 deletions mess/workflow/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,9 @@ include: os.path.join("rules", "processing", "coverages.smk")

# fasta processing options
ROTATE = config.args.rotate
CIRCULAR = is_circular()


include: os.path.join("rules", "processing", "fastas.smk")


Expand Down
Loading