-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
13 changed files
with
419 additions
and
17 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,10 @@ | ||
# Tutorials | ||
Collection of tutorials to showcase `mess` usage examples. | ||
|
||
## Taxonomic profiles as input | ||
`mess` uses taxonomic IDs and sequence abundances as input for read simulation, making it compatible with most popular taxonomic profilers. In this tutorial, we used the output of [bracken](https://github.com/jenniferlu717/Bracken) as input for `mess`. | ||
## [Taxonomic profiles as input](profilers/index.md) | ||
|
||
## Sequencing technologies | ||
`mess` supports long and short read simulations, and below are some tutorials detailing the parameters to simulate illumina ,Nanopore or PacBio runs. | ||
### Illumina | ||
## [Sequencing technologies](seqtech/index.md) | ||
|
||
### Nanopore | ||
|
||
### PacBio | ||
|
||
|
Empty file.
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# Taxonomic profiles as input | ||
`mess` uses taxonomic IDs and read counts as input for read simulation, making it compatible with most popular taxonomic profilers like: | ||
|
||
* [Kraken2](kraken2.md) | ||
* [Bracken](bracken.md) | ||
|
||
## Requirements | ||
Install: | ||
|
||
* [csvtk (>=0.30.0)](https://github.com/shenwei356/csvtk?tab=readme-ov-file#installation) | ||
* [taxonkit](https://github.com/shenwei356/taxonkit?tab=readme-ov-file#installation) | ||
* [taxpasta](https://taxpasta.readthedocs.io/en/latest/#install) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,167 @@ | ||
## Input | ||
|
||
### Download an example | ||
|
||
We can use a kraken report [from bracken's sample data](https://github.com/jenniferlu717/Bracken/blob/master/sample_data/sample_output_bracken.report) | ||
|
||
```sh | ||
curl https://raw.githubusercontent.com/jenniferlu717/Bracken/master/sample_data/sample_output_bracken.report -o kreport.txt | ||
``` | ||
|
||
### Format report | ||
|
||
We can use [taxpasta](https://github.com/taxprofiler/taxpasta) to convert kraken reports (and others) to a format compatible with `mess` | ||
|
||
```sh | ||
taxpasta standardise --profiler kraken2 --output sample.tsv kreport.txt | ||
``` | ||
### Sort and filter taxids | ||
|
||
Sort by descending read count, filter zero counts and unclassified taxids with [csvtk](https://github.com/shenwei356/csvtk) | ||
|
||
```sh | ||
cat sample.tsv \ | ||
| csvtk filter2 -t -f '$taxonomy_id > 0 && $count > 0' \ | ||
| csvtk rename -t -f 1,2 -n taxon,reads \ | ||
| csvtk sort -t -k reads:nr > filtered.tsv | ||
``` | ||
|
||
### Update taxids | ||
|
||
Update taxids and add their lineage with [taxonkit](https://github.com/shenwei356/taxonkit) | ||
|
||
```sh | ||
csvtk cut -t -U -f taxon filtered.tsv \ | ||
| taxonkit lineage --show-status-code 2> tax.log \ | ||
| csvtk -t add-header -n old_taxid,updated_taxid,lineage 1> lineage.tsv | ||
``` | ||
Checking the tax.log (as of may 2024) | ||
|
||
```sh | ||
$ cat tax.log | ||
17:07:59.845 [WARN] taxid 86662 was merged into 1405 | ||
17:07:59.845 [WARN] taxid 62928 was merged into 418699 | ||
17:07:59.845 [WARN] taxid 1172 was merged into 264691 | ||
``` | ||
|
||
Join updated taxids with read counts | ||
|
||
```sh | ||
csvtk join -t filtered.tsv lineage.tsv -f 1 \ | ||
| csvtk cut -t -f updated_taxid,reads \ | ||
| csvtk rename -t -f 1,2 -n taxon,reads > updated.tsv | ||
``` | ||
### Replace unclassified taxids | ||
|
||
!!! failure | ||
Some taxids have no genome data available ! | ||
|
||
For example, 60481, 164757 and 189918 correspond to unclassified Shewanella sp. MR-7, Mycobacterium sp. JLS and Mycobacterium sp. KMS, repsectively. | ||
|
||
We can replace unclassified species with other species from the same genus. | ||
|
||
We also add the nb column to tell mess to download one genome per taxon. | ||
|
||
```sh | ||
cat updated.tsv \ | ||
| csvtk replace -t -f taxon -p "60481" -r "211586" \ | ||
| csvtk replace -t -f taxon -p "164757" -r "83332" \ | ||
| csvtk replace -t -f taxon -p "189918" -r "557599" \ | ||
> clean.tsv | ||
``` | ||
|
||
### Subsample | ||
|
||
For a quick run (~ 5 min with 10 threads), we can subset the top 10 most abundant taxa for our simulation. | ||
|
||
```sh | ||
csvtk head -t -n 10 clean.tsv > subsample.tsv | ||
``` | ||
|
||
??? example "subsample.tsv" | ||
| taxon | reads | | ||
| ------ | ------- | | ||
| 562 | 1449813 | | ||
| 1396 | 1288824 | | ||
| 1280 | 1212241 | | ||
| 1076 | 824296 | | ||
| 83332 | 699935 | | ||
| 148447 | 656706 | | ||
| 132919 | 622314 | | ||
| 272131 | 590648 | | ||
| 303 | 515978 | | ||
| 32046 | 466493 | | ||
|
||
## Simulate | ||
|
||
Now that we cleaned up and subsampled bracken's taxonomic profile we can use it for `mess run`. | ||
|
||
By default, mess downloads 1 reference genome per taxon and excludes atypical genomes (see [download guide](../guide/download.md). | ||
) | ||
For read simulation, the default parameters generate paired 150 bp illumina reads using art_illumina's HiSeq 2500 error profile. | ||
|
||
Using the `--bam` flag yields an alignment file and a ground truth taxonomic profile. | ||
```sh | ||
mess run -i subsample.tsv --bam --threads 10 | ||
``` | ||
|
||
## Output | ||
### fastq | ||
```sh | ||
$ seqkit stats mess_out/fastq/*.fq.gz --all | ||
processed files: 2 / 2 [======================================] ETA: 0s. done | ||
file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 Q20(%) Q30(%) AvgQual GC(%) | ||
mess_out/fastq/subsample_R1.fq.gz FASTQ DNA 8,326,992 1,249,048,800 150 150 150 150 150 150 0 150 98.01 91.67 27.79 50.76 | ||
mess_out/fastq/subsample_R2.fq.gz FASTQ DNA 8,326,992 1,249,048,800 150 150 150 150 150 150 0 150 97.31 89.64 26.51 50.76 | ||
``` | ||
|
||
- [x] Total read count correspond to the one requested in the input (256 reads difference) | ||
|
||
### Taxonomic profile | ||
Kraken2 composition | ||
|
||
| taxon | reads | seq_abundance | | ||
| ------ | ------- | -------------------- | | ||
| 562 | 1449813 | 0.17410469821482438 | | ||
| 1396 | 1288824 | 0.15477190063271803 | | ||
| 1280 | 1212241 | 0.14557522485219607 | | ||
| 1076 | 824296 | 0.09898780485461704 | | ||
| 83332 | 699935 | 0.08405357928573762 | | ||
| 148447 | 656706 | 0.07886230841209485 | | ||
| 132919 | 622314 | 0.07473225248005103 | | ||
| 272131 | 590648 | 0.07092955559868037 | | ||
| 303 | 515978 | 0.06196260757455525 | | ||
| 32046 | 466493 | 0.056020068094525345 | | ||
|
||
|
||
Simulated composition | ||
|
||
I calculated sequence abundances from true read counts in `logs/coverage/subsample.tsv`, and added each genomes taxid at species level (with `taxonkit reformat --show-lineage-taxids`). | ||
|
||
| taxon | reads | seq_abundance | | ||
| ------ | ------- | ------------------ | | ||
| 562 | 1449799 | 0.1741083695048584 | | ||
| 1396 | 1288786 | 0.1547720953736955 | | ||
| 1280 | 1212160 | 0.1455699729265982 | | ||
| 1076 | 824275 | 0.0989883261566721 | | ||
| 1773 | 699930 | 0.0840555629211604 | | ||
| 148447 | 656686 | 0.0788623310794582 | | ||
| 132919 | 622291 | 0.0747317879013213 | | ||
| 272131 | 590622 | 0.0709286138379861 | | ||
| 303 | 515970 | 0.0619635517843658 | | ||
| 32046 | 466473 | 0.0560193885138835 | | ||
|
||
I used a wilcoxon test to compare abundances | ||
```python | ||
>>> from scipy.stats import wilcoxon | ||
>>> delta = abs(real_df["seq_abundance"] - sim_df["seq_abundance"]) | ||
>>> wilcoxon(delta) | ||
WilcoxonResult(statistic=24.0, pvalue=0.76953125) | ||
``` | ||
|
||
- [x] Non significant difference in sequence abundance between the simulated and real sample (mean delta = ${1.47}^{-06}$, $p=0.77$) | ||
|
||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
# ART | ||
`art_illumina` is the default illumina read simulator in `mess` and can generate reads with built-in or custom error profiles. | ||
|
||
## Built-in error profile | ||
|
||
from `art_illumina` help: | ||
```sh | ||
GA1 - GenomeAnalyzer I (36bp,44bp), GA2 - GenomeAnalyzer II (50bp, 75bp) | ||
HS10 - HiSeq 1000 (100bp), HS20 - HiSeq 2000 (100bp), HS25 - HiSeq 2500 (125bp, 150bp) | ||
HSXn - HiSeqX PCR free (150bp), HSXt - HiSeqX TruSeq (150bp), MinS - MiniSeq TruSeq (50bp) | ||
MSv1 - MiSeq v1 (250bp), MSv3 - MiSeq v3 (250bp), NS50 - NextSeq500 v2 (75bp) | ||
``` | ||
By default, `mess` uses the Hiseq 2500 error model to generate paired 150 bp read lengths with a mean and standard deviation fragment length of 200 and 10 respectively. | ||
|
||
Built-in error profile, mean read length, mean and standard deviation fragment length can be changed with `--error`, `--mean-length`, `--frag-len` and `--frag-sd` respectively. | ||
|
||
|
||
### HiSeq 2000 | ||
|
||
```sh | ||
mess test \ | ||
--tech illumina \ | ||
--error HS20 \ | ||
--mean-len 100 \ | ||
--frag-len 200 \ | ||
--frag-sd 10 \ | ||
--output hs20 | ||
``` | ||
|
||
### Miseq v3 | ||
|
||
```sh | ||
mess test \ | ||
--tech illumina \ | ||
--error MSv3 \ | ||
--mean-len 250 \ | ||
--frag-len 500 \ | ||
--frag-sd 10 \ | ||
--output msv3 | ||
``` | ||
|
||
## Custom error profiles | ||
If you want to use updated error profiles, your best bet is to use custom profiles, which can be generated using [art_illumina_profiles](https://github.com/hzi-bifo/art_illumina_profiles). | ||
|
||
!!! tip | ||
`art_illumina_profiles` needs alignments from reference data like [Genome in a bottle (GIAB)](https://www.nist.gov/programs-projects/genome-bottle). You can find bam and bam.bai in [NCBI's reference samples ftp ](https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/) | ||
|
||
|
||
|
||
### GIAB hiseq 300x | ||
|
||
Download forward and reverse error profiles | ||
```sh | ||
curl -O https://raw.githubusercontent.com/hzi-bifo/art_illumina_profiles/master/example_profile/RMNISTHS_30xdownsample.1.txt \ | ||
-O https://raw.githubusercontent.com/hzi-bifo/art_illumina_profiles/master/example_profile/RMNISTHS_30xdownsample.2.txt | ||
``` | ||
Run test command with custom profile (maximum read length is 148bp) | ||
|
||
```sh | ||
mess test \ | ||
--tech illumina \ | ||
--custom-err RMNISTHS_30xdownsample \ | ||
--mean-len 148 \ | ||
--frag-len 300 \ | ||
--frag-sd 10 \ | ||
--output giab_hs300x | ||
``` | ||
|
||
### MBARC-26 | ||
`CAMISIM` also provides custom pre-built error profiles like the [MBARC-26 error profile](https://www.nature.com/articles/sdata201681), which can be used by `mess`. | ||
|
||
Download forward and reverse error profiles | ||
```sh | ||
curl -O https://raw.githubusercontent.com/CAMI-challenge/CAMISIM/master/tools/art_illumina-2.3.6/profiles/ART_MBARC-26_HiSeq_R1.txt \ | ||
-O https://raw.githubusercontent.com/CAMI-challenge/CAMISIM/master/tools/art_illumina-2.3.6/profiles/ART_MBARC-26_HiSeq_R2.txt | ||
``` | ||
|
||
Run test command with custom profile | ||
|
||
```sh | ||
mess test \ | ||
--tech illumina \ | ||
--custom-err ART_MBARC-26_HiSeq_R \ | ||
--mean-len 150 \ | ||
--frag-len 270 \ | ||
--frag-sd 27 \ | ||
--output mbarc26 | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# Sequencing technologies | ||
Below are some tutorials detailing the parameters to use for short and long read simulation | ||
|
||
## [Illumina](illumina.md) | ||
|
||
## [Nanopore](nanopore.md) | ||
|
||
## [PacBio](pacbio.md) |
Oops, something went wrong.