updated docs

metagenlab · May 15, 2024 · d642f51 · d642f51
1 parent f181561
commit d642f51
Show file tree

Hide file tree

Showing 13 changed files with 419 additions and 17 deletions.
diff --git a/docs/index.md b/docs/index.md
@@ -31,7 +31,7 @@ The Metagenomic Sequence Simulator (MeSS) is a [Snakemake](https://github.com/sn
 
     Simulate sequencing runs for both long and short reads
 
-    [:octicons-arrow-right-24: Tutorials](tutorials/index.md#sequencing-technologies)
+    [:octicons-arrow-right-24: Tutorials](tutorials/seqtech/index.md)
 
 -   :material-speedometer:{ .lg .middle } __Fast and scalable__
 

diff --git a/docs/tutorials/illumina.md b/docs/tutorials/illumina.md
diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md
@@ -1,15 +1,10 @@
 # Tutorials
 Collection of tutorials to showcase `mess` usage examples.
 
-## Taxonomic profiles as input
-`mess` uses taxonomic IDs and sequence abundances as input for read simulation, making it compatible with most popular taxonomic profilers. In this tutorial, we used the output of [bracken](https://github.com/jenniferlu717/Bracken) as input for `mess`.
+## [Taxonomic profiles as input](profilers/index.md)
 
-## Sequencing technologies
-`mess` supports long and short read simulations, and below are some tutorials detailing the parameters to simulate illumina ,Nanopore or PacBio runs. 
-### Illumina
+## [Sequencing technologies](seqtech/index.md)
 
-### Nanopore
 
-### PacBio
 
 
diff --git a/docs/tutorials/nanopore.md b/docs/tutorials/nanopore.md
diff --git a/docs/tutorials/pacbio.md b/docs/tutorials/pacbio.md
diff --git a/docs/tutorials/bracken.md → docs/tutorials/profilers/bracken.md b/docs/tutorials/bracken.md → docs/tutorials/profilers/bracken.md
@@ -10,7 +10,7 @@ curl https://raw.githubusercontent.com/jenniferlu717/Bracken/master/sample_data/
 
 ### Format table
 
-We can use [csvtk](https://github.com/shenwei356/csvtk) to extract and rename the taxonomy_id and new_est_reads columns required for our `mess` input table.
+We can use [csvtk](https://github.com/shenwei356/csvtk) to extract the required columns (taxonomy_id and new_est_reads) for the `mess` input table.
 
 ```sh
 csvtk cut -t -f taxonomy_id,new_est_reads bracken.txt \
@@ -68,7 +68,7 @@ cat updated.tsv \
 For a quick run (~ 5 min with 10 threads), we can subset the top 10 most abundant taxa for our simulation.
 
 ```sh
-csvtk head -t -n 10 clean.tsv 
+csvtk head -t -n 10 clean.tsv > subsample.tsv
 ```
 
 ??? example "subsample.tsv"
@@ -92,6 +92,8 @@ Now that we cleaned up and subsampled bracken's taxonomic profile we can use it
 By default, mess downloads 1 reference genome per taxon and excludes atypical genomes (see [download guide](../guide/download.md).
 )
 For read simulation, the default parameters generate paired 150 bp illumina reads using art_illumina's HiSeq 2500 error profile.
+
+Using the `--bam` flag yields an alignment file and a ground truth taxonomic profile. 
 ```sh
 mess run -i subsample.tsv --bam --threads 10
 ```
@@ -106,6 +108,48 @@ mess_out/fastq/subsample_R1.fq.gz  FASTQ   DNA   8,326,992  1,249,048,800      1
 mess_out/fastq/subsample_R2.fq.gz  FASTQ   DNA   8,326,992  1,249,048,800      150      150      150  150  150  150        0  150   97.31   89.64    26.51  50.76
 ```
 
-- [x] Total read count correspond to the one requested in the input (8327248 - 8326992 = 256 reads difference) 
+- [x] Total read count correspond to the one requested in the input (256 reads difference) 
 
 ### Taxonomic profile
+Bracken composition
+
+| taxon  | reads   | seq_abundance        |
+| ------ | ------- | -------------------- |
+| 562    | 1449813 | 0.17410469821482438  |
+| 1396   | 1288824 | 0.15477190063271803  |
+| 1280   | 1212241 | 0.14557522485219607  |
+| 1076   | 824296  | 0.09898780485461704  |
+| 83332  | 699935  | 0.08405357928573762  |
+| 148447 | 656706  | 0.07886230841209485  |
+| 132919 | 622314  | 0.07473225248005103  |
+| 272131 | 590648  | 0.07092955559868037  |
+| 303    | 515978  | 0.06196260757455525  |
+| 32046  | 466493  | 0.056020068094525345 |
+
+
+Simulated composition 
+
+I calculated sequence abundances from true read counts in `logs/coverage/subsample.tsv`, and added each genomes taxid at species level (with `taxonkit reformat --show-lineage-taxids`).
+
+| taxon  | reads   | seq_abundance      |
+| ------ | ------- | ------------------ |
+| 562    | 1449799 | 0.1741083695048584 |
+| 1396   | 1288786 | 0.1547720953736955 |
+| 1280   | 1212160 | 0.1455699729265982 |
+| 1076   | 824275  | 0.0989883261566721 |
+| 1773   | 699930  | 0.0840555629211604 |
+| 148447 | 656686  | 0.0788623310794582 |
+| 132919 | 622291  | 0.0747317879013213 |
+| 272131 | 590622  | 0.0709286138379861 |
+| 303    | 515970  | 0.0619635517843658 |
+| 32046  | 466473  | 0.0560193885138835 |
+
+I used a wilcoxon test to compare abundances
+```python
+>>> from scipy.stats import wilcoxon
+>>> delta = abs(real_df["seq_abundance"] - sim_df["seq_abundance"])
+>>> wilcoxon(delta)
+WilcoxonResult(statistic=24.0, pvalue=0.76953125)
+```
+
+- [x] Non significant difference in sequence abundance between the simulated and real sample (mean delta = ${1.47}^{-06}$, $p=0.77$) 
diff --git a/docs/tutorials/profilers/index.md b/docs/tutorials/profilers/index.md
@@ -0,0 +1,12 @@
+# Taxonomic profiles as input
+`mess` uses taxonomic IDs and read counts as input for read simulation, making it compatible with most popular taxonomic profilers like:
+
+* [Kraken2](kraken2.md)
+* [Bracken](bracken.md) 
+
+## Requirements
+Install:
+
+* [csvtk (>=0.30.0)](https://github.com/shenwei356/csvtk?tab=readme-ov-file#installation)
+* [taxonkit](https://github.com/shenwei356/taxonkit?tab=readme-ov-file#installation)
+* [taxpasta](https://taxpasta.readthedocs.io/en/latest/#install)
diff --git a/docs/tutorials/profilers/kraken2.md b/docs/tutorials/profilers/kraken2.md
@@ -0,0 +1,167 @@
+## Input
+
+### Download an example
+
+We can use a kraken report [from bracken's sample data](https://github.com/jenniferlu717/Bracken/blob/master/sample_data/sample_output_bracken.report)
+
+```sh
+curl https://raw.githubusercontent.com/jenniferlu717/Bracken/master/sample_data/sample_output_bracken.report -o kreport.txt
+```
+
+### Format report
+
+We can use [taxpasta](https://github.com/taxprofiler/taxpasta) to convert kraken reports (and others) to a format compatible with `mess`
+
+```sh
+taxpasta standardise --profiler kraken2 --output sample.tsv kreport.txt
+```
+### Sort and filter taxids
+
+Sort by descending read count, filter zero counts and unclassified taxids with [csvtk](https://github.com/shenwei356/csvtk)
+
+```sh
+cat sample.tsv \
+    | csvtk filter2 -t -f '$taxonomy_id > 0 && $count > 0' \
+    | csvtk rename -t -f 1,2 -n taxon,reads \
+    | csvtk sort -t -k reads:nr > filtered.tsv
+```
+
+### Update taxids
+
+Update taxids and add their lineage with [taxonkit](https://github.com/shenwei356/taxonkit)
+
+```sh
+csvtk cut -t -U -f taxon filtered.tsv \
+    | taxonkit lineage --show-status-code 2> tax.log \
+    | csvtk -t add-header -n old_taxid,updated_taxid,lineage 1> lineage.tsv
+```
+Checking the tax.log (as of may 2024)
+
+```sh
+$ cat tax.log
+17:07:59.845 [WARN] taxid 86662 was merged into 1405
+17:07:59.845 [WARN] taxid 62928 was merged into 418699
+17:07:59.845 [WARN] taxid 1172 was merged into 264691
+```
+
+Join updated taxids with read counts
+
+```sh
+csvtk join -t filtered.tsv lineage.tsv -f 1 \
+    | csvtk cut -t -f updated_taxid,reads \
+    | csvtk rename -t -f 1,2 -n taxon,reads > updated.tsv
+```
+### Replace unclassified taxids
+
+!!! failure
+    Some taxids have no genome data available !
+
+    For example, 60481, 164757 and 189918 correspond to unclassified Shewanella sp. MR-7, Mycobacterium sp. JLS and Mycobacterium sp. KMS, repsectively. 
+
+    We can replace unclassified species with other species from the same genus. 
+
+    We also add the nb column to tell mess to download one genome per taxon.
+
+```sh
+cat updated.tsv \
+    | csvtk replace -t -f taxon -p "60481" -r "211586" \
+    | csvtk replace -t -f taxon -p "164757" -r "83332" \
+    | csvtk replace -t -f taxon -p "189918" -r "557599" \
+    > clean.tsv
+```
+
+### Subsample
+
+For a quick run (~ 5 min with 10 threads), we can subset the top 10 most abundant taxa for our simulation.
+
+```sh
+csvtk head -t -n 10 clean.tsv > subsample.tsv
+```
+
+??? example "subsample.tsv"
+    | taxon  | reads   |
+    | ------ | ------- |
+    | 562    | 1449813 |
+    | 1396   | 1288824 |
+    | 1280   | 1212241 |
+    | 1076   | 824296  |
+    | 83332  | 699935  |
+    | 148447 | 656706  |
+    | 132919 | 622314  |
+    | 272131 | 590648  |
+    | 303    | 515978  |
+    | 32046  | 466493  |
+
+## Simulate
+
+Now that we cleaned up and subsampled bracken's taxonomic profile we can use it for `mess run`.
+
+By default, mess downloads 1 reference genome per taxon and excludes atypical genomes (see [download guide](../guide/download.md).
+)
+For read simulation, the default parameters generate paired 150 bp illumina reads using art_illumina's HiSeq 2500 error profile.
+
+Using the `--bam` flag yields an alignment file and a ground truth taxonomic profile. 
+```sh
+mess run -i subsample.tsv --bam --threads 10
+```
+
+## Output
+### fastq
+```sh
+$ seqkit stats mess_out/fastq/*.fq.gz --all
+processed files:  2 / 2 [======================================] ETA: 0s. done
+file                               format  type   num_seqs        sum_len  min_len  avg_len  max_len   Q1   Q2   Q3  sum_gap  N50  Q20(%)  Q30(%)  AvgQual  GC(%)
+mess_out/fastq/subsample_R1.fq.gz  FASTQ   DNA   8,326,992  1,249,048,800      150      150      150  150  150  150        0  150   98.01   91.67    27.79  50.76
+mess_out/fastq/subsample_R2.fq.gz  FASTQ   DNA   8,326,992  1,249,048,800      150      150      150  150  150  150        0  150   97.31   89.64    26.51  50.76
+```
+
+- [x] Total read count correspond to the one requested in the input (256 reads difference) 
+
+### Taxonomic profile
+Kraken2 composition
+
+| taxon  | reads   | seq_abundance        |
+| ------ | ------- | -------------------- |
+| 562    | 1449813 | 0.17410469821482438  |
+| 1396   | 1288824 | 0.15477190063271803  |
+| 1280   | 1212241 | 0.14557522485219607  |
+| 1076   | 824296  | 0.09898780485461704  |
+| 83332  | 699935  | 0.08405357928573762  |
+| 148447 | 656706  | 0.07886230841209485  |
+| 132919 | 622314  | 0.07473225248005103  |
+| 272131 | 590648  | 0.07092955559868037  |
+| 303    | 515978  | 0.06196260757455525  |
+| 32046  | 466493  | 0.056020068094525345 |
+
+
+Simulated composition 
+
+I calculated sequence abundances from true read counts in `logs/coverage/subsample.tsv`, and added each genomes taxid at species level (with `taxonkit reformat --show-lineage-taxids`).
+
+| taxon  | reads   | seq_abundance      |
+| ------ | ------- | ------------------ |
+| 562    | 1449799 | 0.1741083695048584 |
+| 1396   | 1288786 | 0.1547720953736955 |
+| 1280   | 1212160 | 0.1455699729265982 |
+| 1076   | 824275  | 0.0989883261566721 |
+| 1773   | 699930  | 0.0840555629211604 |
+| 148447 | 656686  | 0.0788623310794582 |
+| 132919 | 622291  | 0.0747317879013213 |
+| 272131 | 590622  | 0.0709286138379861 |
+| 303    | 515970  | 0.0619635517843658 |
+| 32046  | 466473  | 0.0560193885138835 |
+
+I used a wilcoxon test to compare abundances
+```python
+>>> from scipy.stats import wilcoxon
+>>> delta = abs(real_df["seq_abundance"] - sim_df["seq_abundance"])
+>>> wilcoxon(delta)
+WilcoxonResult(statistic=24.0, pvalue=0.76953125)
+```
+
+- [x] Non significant difference in sequence abundance between the simulated and real sample (mean delta = ${1.47}^{-06}$, $p=0.77$) 
+
+
+
+
+
diff --git a/docs/tutorials/seqtech/illumina.md b/docs/tutorials/seqtech/illumina.md
@@ -0,0 +1,88 @@
+# ART
+`art_illumina` is the default illumina read simulator in `mess` and can generate reads with built-in or custom error profiles.
+
+## Built-in error profile
+
+from `art_illumina` help:
+```sh
+GA1 - GenomeAnalyzer I (36bp,44bp), GA2 - GenomeAnalyzer II (50bp, 75bp)
+HS10 - HiSeq 1000 (100bp),          HS20 - HiSeq 2000 (100bp),      HS25 - HiSeq 2500 (125bp, 150bp)
+HSXn - HiSeqX PCR free (150bp),     HSXt - HiSeqX TruSeq (150bp),   MinS - MiniSeq TruSeq (50bp)
+MSv1 - MiSeq v1 (250bp),            MSv3 - MiSeq v3 (250bp),        NS50 - NextSeq500 v2 (75bp)
+```
+By default, `mess` uses the Hiseq 2500 error model to generate paired 150 bp read lengths with a mean and standard deviation fragment length of 200 and 10 respectively.
+
+Built-in error profile, mean read length, mean and standard deviation fragment length can be changed with `--error`, `--mean-length`, `--frag-len` and `--frag-sd` respectively.
+
+
+### HiSeq 2000
+
+```sh
+mess test \
+    --tech illumina \
+    --error HS20 \
+    --mean-len 100 \
+    --frag-len 200 \
+    --frag-sd 10 \
+    --output hs20
+```
+
+### Miseq v3
+
+```sh
+mess test \
+    --tech illumina \
+    --error MSv3 \
+    --mean-len 250 \
+    --frag-len 500 \
+    --frag-sd 10 \
+    --output msv3
+```
+
+## Custom error profiles
+If you want to use updated error profiles, your best bet is to use custom profiles, which can be generated using [art_illumina_profiles](https://github.com/hzi-bifo/art_illumina_profiles). 
+
+!!! tip
+    `art_illumina_profiles` needs alignments from reference data like [Genome in a bottle (GIAB)](https://www.nist.gov/programs-projects/genome-bottle). You can find bam and bam.bai in [NCBI's reference samples ftp ](https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/)
+
+
+
+### GIAB hiseq 300x
+
+Download forward and reverse error profiles
+```sh
+curl -O https://raw.githubusercontent.com/hzi-bifo/art_illumina_profiles/master/example_profile/RMNISTHS_30xdownsample.1.txt \
+    -O https://raw.githubusercontent.com/hzi-bifo/art_illumina_profiles/master/example_profile/RMNISTHS_30xdownsample.2.txt
+```
+Run test command with custom profile (maximum read length is 148bp)
+
+```sh
+mess test \
+    --tech illumina \
+    --custom-err RMNISTHS_30xdownsample \
+    --mean-len 148 \
+    --frag-len 300 \
+    --frag-sd 10 \
+    --output giab_hs300x
+```
+
+### MBARC-26
+`CAMISIM` also provides custom pre-built error profiles like the [MBARC-26 error profile](https://www.nature.com/articles/sdata201681), which can be used by `mess`.
+
+Download forward and reverse error profiles
+```sh
+curl -O https://raw.githubusercontent.com/CAMI-challenge/CAMISIM/master/tools/art_illumina-2.3.6/profiles/ART_MBARC-26_HiSeq_R1.txt \
+    -O https://raw.githubusercontent.com/CAMI-challenge/CAMISIM/master/tools/art_illumina-2.3.6/profiles/ART_MBARC-26_HiSeq_R2.txt
+```
+
+Run test command with custom profile
+
+```sh
+mess test \
+    --tech illumina \
+    --custom-err ART_MBARC-26_HiSeq_R \
+    --mean-len 150 \
+    --frag-len 270 \
+    --frag-sd 27 \
+    --output mbarc26
+```
diff --git a/docs/tutorials/seqtech/index.md b/docs/tutorials/seqtech/index.md
@@ -0,0 +1,8 @@
+# Sequencing technologies
+Below are some tutorials detailing the parameters to use for short and long read simulation 
+
+## [Illumina](illumina.md)
+
+## [Nanopore](nanopore.md)
+
+## [PacBio](pacbio.md)