Skip to content

Commit

Permalink
updated docs
Browse files Browse the repository at this point in the history
  • Loading branch information
farchaab committed May 15, 2024
1 parent f181561 commit d642f51
Show file tree
Hide file tree
Showing 13 changed files with 419 additions and 17 deletions.
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ The Metagenomic Sequence Simulator (MeSS) is a [Snakemake](https://github.com/sn

Simulate sequencing runs for both long and short reads

[:octicons-arrow-right-24: Tutorials](tutorials/index.md#sequencing-technologies)
[:octicons-arrow-right-24: Tutorials](tutorials/seqtech/index.md)

- :material-speedometer:{ .lg .middle } __Fast and scalable__

Expand Down
Empty file removed docs/tutorials/illumina.md
Empty file.
9 changes: 2 additions & 7 deletions docs/tutorials/index.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,10 @@
# Tutorials
Collection of tutorials to showcase `mess` usage examples.

## Taxonomic profiles as input
`mess` uses taxonomic IDs and sequence abundances as input for read simulation, making it compatible with most popular taxonomic profilers. In this tutorial, we used the output of [bracken](https://github.com/jenniferlu717/Bracken) as input for `mess`.
## [Taxonomic profiles as input](profilers/index.md)

## Sequencing technologies
`mess` supports long and short read simulations, and below are some tutorials detailing the parameters to simulate illumina ,Nanopore or PacBio runs.
### Illumina
## [Sequencing technologies](seqtech/index.md)

### Nanopore

### PacBio


Empty file removed docs/tutorials/nanopore.md
Empty file.
Empty file removed docs/tutorials/pacbio.md
Empty file.
50 changes: 47 additions & 3 deletions docs/tutorials/bracken.md → docs/tutorials/profilers/bracken.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ curl https://raw.githubusercontent.com/jenniferlu717/Bracken/master/sample_data/

### Format table

We can use [csvtk](https://github.com/shenwei356/csvtk) to extract and rename the taxonomy_id and new_est_reads columns required for our `mess` input table.
We can use [csvtk](https://github.com/shenwei356/csvtk) to extract the required columns (taxonomy_id and new_est_reads) for the `mess` input table.

```sh
csvtk cut -t -f taxonomy_id,new_est_reads bracken.txt \
Expand Down Expand Up @@ -68,7 +68,7 @@ cat updated.tsv \
For a quick run (~ 5 min with 10 threads), we can subset the top 10 most abundant taxa for our simulation.

```sh
csvtk head -t -n 10 clean.tsv
csvtk head -t -n 10 clean.tsv > subsample.tsv
```

??? example "subsample.tsv"
Expand All @@ -92,6 +92,8 @@ Now that we cleaned up and subsampled bracken's taxonomic profile we can use it
By default, mess downloads 1 reference genome per taxon and excludes atypical genomes (see [download guide](../guide/download.md).
)
For read simulation, the default parameters generate paired 150 bp illumina reads using art_illumina's HiSeq 2500 error profile.

Using the `--bam` flag yields an alignment file and a ground truth taxonomic profile.
```sh
mess run -i subsample.tsv --bam --threads 10
```
Expand All @@ -106,6 +108,48 @@ mess_out/fastq/subsample_R1.fq.gz FASTQ DNA 8,326,992 1,249,048,800 1
mess_out/fastq/subsample_R2.fq.gz FASTQ DNA 8,326,992 1,249,048,800 150 150 150 150 150 150 0 150 97.31 89.64 26.51 50.76
```

- [x] Total read count correspond to the one requested in the input (8327248 - 8326992 = 256 reads difference)
- [x] Total read count correspond to the one requested in the input (256 reads difference)

### Taxonomic profile
Bracken composition

| taxon | reads | seq_abundance |
| ------ | ------- | -------------------- |
| 562 | 1449813 | 0.17410469821482438 |
| 1396 | 1288824 | 0.15477190063271803 |
| 1280 | 1212241 | 0.14557522485219607 |
| 1076 | 824296 | 0.09898780485461704 |
| 83332 | 699935 | 0.08405357928573762 |
| 148447 | 656706 | 0.07886230841209485 |
| 132919 | 622314 | 0.07473225248005103 |
| 272131 | 590648 | 0.07092955559868037 |
| 303 | 515978 | 0.06196260757455525 |
| 32046 | 466493 | 0.056020068094525345 |


Simulated composition

I calculated sequence abundances from true read counts in `logs/coverage/subsample.tsv`, and added each genomes taxid at species level (with `taxonkit reformat --show-lineage-taxids`).

| taxon | reads | seq_abundance |
| ------ | ------- | ------------------ |
| 562 | 1449799 | 0.1741083695048584 |
| 1396 | 1288786 | 0.1547720953736955 |
| 1280 | 1212160 | 0.1455699729265982 |
| 1076 | 824275 | 0.0989883261566721 |
| 1773 | 699930 | 0.0840555629211604 |
| 148447 | 656686 | 0.0788623310794582 |
| 132919 | 622291 | 0.0747317879013213 |
| 272131 | 590622 | 0.0709286138379861 |
| 303 | 515970 | 0.0619635517843658 |
| 32046 | 466473 | 0.0560193885138835 |

I used a wilcoxon test to compare abundances
```python
>>> from scipy.stats import wilcoxon
>>> delta = abs(real_df["seq_abundance"] - sim_df["seq_abundance"])
>>> wilcoxon(delta)
WilcoxonResult(statistic=24.0, pvalue=0.76953125)
```

- [x] Non significant difference in sequence abundance between the simulated and real sample (mean delta = ${1.47}^{-06}$, $p=0.77$)
12 changes: 12 additions & 0 deletions docs/tutorials/profilers/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Taxonomic profiles as input
`mess` uses taxonomic IDs and read counts as input for read simulation, making it compatible with most popular taxonomic profilers like:

* [Kraken2](kraken2.md)
* [Bracken](bracken.md)

## Requirements
Install:

* [csvtk (>=0.30.0)](https://github.com/shenwei356/csvtk?tab=readme-ov-file#installation)
* [taxonkit](https://github.com/shenwei356/taxonkit?tab=readme-ov-file#installation)
* [taxpasta](https://taxpasta.readthedocs.io/en/latest/#install)
167 changes: 167 additions & 0 deletions docs/tutorials/profilers/kraken2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
## Input

### Download an example

We can use a kraken report [from bracken's sample data](https://github.com/jenniferlu717/Bracken/blob/master/sample_data/sample_output_bracken.report)

```sh
curl https://raw.githubusercontent.com/jenniferlu717/Bracken/master/sample_data/sample_output_bracken.report -o kreport.txt
```

### Format report

We can use [taxpasta](https://github.com/taxprofiler/taxpasta) to convert kraken reports (and others) to a format compatible with `mess`

```sh
taxpasta standardise --profiler kraken2 --output sample.tsv kreport.txt
```
### Sort and filter taxids

Sort by descending read count, filter zero counts and unclassified taxids with [csvtk](https://github.com/shenwei356/csvtk)

```sh
cat sample.tsv \
| csvtk filter2 -t -f '$taxonomy_id > 0 && $count > 0' \
| csvtk rename -t -f 1,2 -n taxon,reads \
| csvtk sort -t -k reads:nr > filtered.tsv
```

### Update taxids

Update taxids and add their lineage with [taxonkit](https://github.com/shenwei356/taxonkit)

```sh
csvtk cut -t -U -f taxon filtered.tsv \
| taxonkit lineage --show-status-code 2> tax.log \
| csvtk -t add-header -n old_taxid,updated_taxid,lineage 1> lineage.tsv
```
Checking the tax.log (as of may 2024)

```sh
$ cat tax.log
17:07:59.845 [WARN] taxid 86662 was merged into 1405
17:07:59.845 [WARN] taxid 62928 was merged into 418699
17:07:59.845 [WARN] taxid 1172 was merged into 264691
```

Join updated taxids with read counts

```sh
csvtk join -t filtered.tsv lineage.tsv -f 1 \
| csvtk cut -t -f updated_taxid,reads \
| csvtk rename -t -f 1,2 -n taxon,reads > updated.tsv
```
### Replace unclassified taxids

!!! failure
Some taxids have no genome data available !

For example, 60481, 164757 and 189918 correspond to unclassified Shewanella sp. MR-7, Mycobacterium sp. JLS and Mycobacterium sp. KMS, repsectively.

We can replace unclassified species with other species from the same genus.

We also add the nb column to tell mess to download one genome per taxon.

```sh
cat updated.tsv \
| csvtk replace -t -f taxon -p "60481" -r "211586" \
| csvtk replace -t -f taxon -p "164757" -r "83332" \
| csvtk replace -t -f taxon -p "189918" -r "557599" \
> clean.tsv
```

### Subsample

For a quick run (~ 5 min with 10 threads), we can subset the top 10 most abundant taxa for our simulation.

```sh
csvtk head -t -n 10 clean.tsv > subsample.tsv
```

??? example "subsample.tsv"
| taxon | reads |
| ------ | ------- |
| 562 | 1449813 |
| 1396 | 1288824 |
| 1280 | 1212241 |
| 1076 | 824296 |
| 83332 | 699935 |
| 148447 | 656706 |
| 132919 | 622314 |
| 272131 | 590648 |
| 303 | 515978 |
| 32046 | 466493 |

## Simulate

Now that we cleaned up and subsampled bracken's taxonomic profile we can use it for `mess run`.

By default, mess downloads 1 reference genome per taxon and excludes atypical genomes (see [download guide](../guide/download.md).
)
For read simulation, the default parameters generate paired 150 bp illumina reads using art_illumina's HiSeq 2500 error profile.

Using the `--bam` flag yields an alignment file and a ground truth taxonomic profile.
```sh
mess run -i subsample.tsv --bam --threads 10
```

## Output
### fastq
```sh
$ seqkit stats mess_out/fastq/*.fq.gz --all
processed files: 2 / 2 [======================================] ETA: 0s. done
file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 Q20(%) Q30(%) AvgQual GC(%)
mess_out/fastq/subsample_R1.fq.gz FASTQ DNA 8,326,992 1,249,048,800 150 150 150 150 150 150 0 150 98.01 91.67 27.79 50.76
mess_out/fastq/subsample_R2.fq.gz FASTQ DNA 8,326,992 1,249,048,800 150 150 150 150 150 150 0 150 97.31 89.64 26.51 50.76
```

- [x] Total read count correspond to the one requested in the input (256 reads difference)

### Taxonomic profile
Kraken2 composition

| taxon | reads | seq_abundance |
| ------ | ------- | -------------------- |
| 562 | 1449813 | 0.17410469821482438 |
| 1396 | 1288824 | 0.15477190063271803 |
| 1280 | 1212241 | 0.14557522485219607 |
| 1076 | 824296 | 0.09898780485461704 |
| 83332 | 699935 | 0.08405357928573762 |
| 148447 | 656706 | 0.07886230841209485 |
| 132919 | 622314 | 0.07473225248005103 |
| 272131 | 590648 | 0.07092955559868037 |
| 303 | 515978 | 0.06196260757455525 |
| 32046 | 466493 | 0.056020068094525345 |


Simulated composition

I calculated sequence abundances from true read counts in `logs/coverage/subsample.tsv`, and added each genomes taxid at species level (with `taxonkit reformat --show-lineage-taxids`).

| taxon | reads | seq_abundance |
| ------ | ------- | ------------------ |
| 562 | 1449799 | 0.1741083695048584 |
| 1396 | 1288786 | 0.1547720953736955 |
| 1280 | 1212160 | 0.1455699729265982 |
| 1076 | 824275 | 0.0989883261566721 |
| 1773 | 699930 | 0.0840555629211604 |
| 148447 | 656686 | 0.0788623310794582 |
| 132919 | 622291 | 0.0747317879013213 |
| 272131 | 590622 | 0.0709286138379861 |
| 303 | 515970 | 0.0619635517843658 |
| 32046 | 466473 | 0.0560193885138835 |

I used a wilcoxon test to compare abundances
```python
>>> from scipy.stats import wilcoxon
>>> delta = abs(real_df["seq_abundance"] - sim_df["seq_abundance"])
>>> wilcoxon(delta)
WilcoxonResult(statistic=24.0, pvalue=0.76953125)
```

- [x] Non significant difference in sequence abundance between the simulated and real sample (mean delta = ${1.47}^{-06}$, $p=0.77$)





88 changes: 88 additions & 0 deletions docs/tutorials/seqtech/illumina.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# ART
`art_illumina` is the default illumina read simulator in `mess` and can generate reads with built-in or custom error profiles.

## Built-in error profile

from `art_illumina` help:
```sh
GA1 - GenomeAnalyzer I (36bp,44bp), GA2 - GenomeAnalyzer II (50bp, 75bp)
HS10 - HiSeq 1000 (100bp), HS20 - HiSeq 2000 (100bp), HS25 - HiSeq 2500 (125bp, 150bp)
HSXn - HiSeqX PCR free (150bp), HSXt - HiSeqX TruSeq (150bp), MinS - MiniSeq TruSeq (50bp)
MSv1 - MiSeq v1 (250bp), MSv3 - MiSeq v3 (250bp), NS50 - NextSeq500 v2 (75bp)
```
By default, `mess` uses the Hiseq 2500 error model to generate paired 150 bp read lengths with a mean and standard deviation fragment length of 200 and 10 respectively.

Built-in error profile, mean read length, mean and standard deviation fragment length can be changed with `--error`, `--mean-length`, `--frag-len` and `--frag-sd` respectively.


### HiSeq 2000

```sh
mess test \
--tech illumina \
--error HS20 \
--mean-len 100 \
--frag-len 200 \
--frag-sd 10 \
--output hs20
```

### Miseq v3

```sh
mess test \
--tech illumina \
--error MSv3 \
--mean-len 250 \
--frag-len 500 \
--frag-sd 10 \
--output msv3
```

## Custom error profiles
If you want to use updated error profiles, your best bet is to use custom profiles, which can be generated using [art_illumina_profiles](https://github.com/hzi-bifo/art_illumina_profiles).

!!! tip
`art_illumina_profiles` needs alignments from reference data like [Genome in a bottle (GIAB)](https://www.nist.gov/programs-projects/genome-bottle). You can find bam and bam.bai in [NCBI's reference samples ftp ](https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/)



### GIAB hiseq 300x

Download forward and reverse error profiles
```sh
curl -O https://raw.githubusercontent.com/hzi-bifo/art_illumina_profiles/master/example_profile/RMNISTHS_30xdownsample.1.txt \
-O https://raw.githubusercontent.com/hzi-bifo/art_illumina_profiles/master/example_profile/RMNISTHS_30xdownsample.2.txt
```
Run test command with custom profile (maximum read length is 148bp)

```sh
mess test \
--tech illumina \
--custom-err RMNISTHS_30xdownsample \
--mean-len 148 \
--frag-len 300 \
--frag-sd 10 \
--output giab_hs300x
```

### MBARC-26
`CAMISIM` also provides custom pre-built error profiles like the [MBARC-26 error profile](https://www.nature.com/articles/sdata201681), which can be used by `mess`.

Download forward and reverse error profiles
```sh
curl -O https://raw.githubusercontent.com/CAMI-challenge/CAMISIM/master/tools/art_illumina-2.3.6/profiles/ART_MBARC-26_HiSeq_R1.txt \
-O https://raw.githubusercontent.com/CAMI-challenge/CAMISIM/master/tools/art_illumina-2.3.6/profiles/ART_MBARC-26_HiSeq_R2.txt
```

Run test command with custom profile

```sh
mess test \
--tech illumina \
--custom-err ART_MBARC-26_HiSeq_R \
--mean-len 150 \
--frag-len 270 \
--frag-sd 27 \
--output mbarc26
```
8 changes: 8 additions & 0 deletions docs/tutorials/seqtech/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Sequencing technologies
Below are some tutorials detailing the parameters to use for short and long read simulation

## [Illumina](illumina.md)

## [Nanopore](nanopore.md)

## [PacBio](pacbio.md)
Loading

0 comments on commit d642f51

Please sign in to comment.