From f181561c79bdf8bf79e519e63e4ea8bbb8a2c9cc Mon Sep 17 00:00:00 2001 From: farchaab Date: Wed, 8 May 2024 20:30:50 +0200 Subject: [PATCH] updated docs --- docs/commands/hmp-template.md | 13 +++ docs/getting_started/index.md | 2 +- docs/guide/output.md | 115 +++++++++++++++++-- docs/guide/simulate/fq-processing.md | 0 docs/guide/simulate/index.md | 3 +- docs/guide/simulate/reads.md | 109 +++++++++++------- docs/tutorials/bracken.md | 110 ++++++++++++++++++ docs/tutorials/index.md | 11 +- docs/tutorials/test.md | 162 --------------------------- mkdocs.yml | 4 +- 10 files changed, 314 insertions(+), 215 deletions(-) delete mode 100644 docs/guide/simulate/fq-processing.md delete mode 100644 docs/tutorials/test.md diff --git a/docs/commands/hmp-template.md b/docs/commands/hmp-template.md index e69de29..ec2d938 100644 --- a/docs/commands/hmp-template.md +++ b/docs/commands/hmp-template.md @@ -0,0 +1,13 @@ +## Help + +![`mess hmp-template -h`](../images/mess-hmp-help.svg) + +## Example +### Gut samples +```sh +mess hmp-template --site gut -o gut +``` +### Specific sample from buccal mucosa +```sh +mess hmp-template --site buccal_mucosa --sample SRS013506 -o SRS013506 +``` diff --git a/docs/getting_started/index.md b/docs/getting_started/index.md index cca115a..6893efb 100644 --- a/docs/getting_started/index.md +++ b/docs/getting_started/index.md @@ -12,7 +12,7 @@ mess test --bam ### Output ```sh hl_lines="3-7 15-24 28-30" -📂test_out +📂mess_out ┣ 📂assembly_finder ┃ ┣ 📂download ┃ ┃ ┣ 📂GCF_000418345.1 diff --git a/docs/guide/output.md b/docs/guide/output.md index 9569476..5a79b9f 100644 --- a/docs/guide/output.md +++ b/docs/guide/output.md @@ -1,9 +1,4 @@ -## Command -```sh -mess test --bam -``` -## Output -### Directory structure +## Directory structure ```sh hl_lines="3-7 15-24 28-30" 📂mess_out ┣ 📂assembly_finder @@ -37,4 +32,110 @@ mess test --bam ┃ ┗ 📜sample2_profile.txt ┣ 📜config.yaml ┗ 📜mess.log -``` \ No newline at end of file +``` + +## Coverage table + +Check genome size (total_sequence_length), coverage depth (cov_sim) and other stats for both samples in cov.tsv + +=== "sample1" + | fasta | total_sequence_length | number_of_contigs | reads | bases | seq_abundance | cov_sim | tax_abundance | + | --------------- | --------------------- | ----------------- | ------- | -------- | ------------- | ------- | ------------- | + | GCF_000013425.1 | 2821361 | 1 | 940.454 | 282136.1 | 1.0 | 0.1 | 1.0 | + +=== "sample2" + | fasta | total_sequence_length | number_of_contigs | reads | bases | seq_abundance | cov_sim | tax_abundance | + | --------------- | --------------------- | ----------------- | ------- | -------- | ------------- | ------- | ------------- | + | GCF_003812505.1 | 2257431 | 3 | 752.477 | 225743.1 | 1.0 | 0.1 | 1.0 | + +## Reads + +Let's check the fastq content using [`seqkit stats`](https://bioinf.shenwei.me/seqkit/usage/#stats). + +```sh +$ seqkit stats --all test_bam_out/fastq/*.fq.gz +file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 N50_num Q20(%) Q30(%) AvgQual GC(%) +fastq/sample1_R1.fq.gz FASTQ DNA 940 141,000 150 150 150 150 150 150 0 150 1 97.99 91.72 27.68 32.89 +fastq/sample1_R2.fq.gz FASTQ DNA 940 141,000 150 150 150 150 150 150 0 150 1 97.35 89.7 26.52 33.19 +fastq/sample2_R1.fq.gz FASTQ DNA 752 112,800 150 150 150 150 150 150 0 150 1 98.02 91.75 27.78 31.38 +fastq/sample2_R2.fq.gz FASTQ DNA 752 112,800 150 150 150 150 150 150 0 150 1 97.27 89.7 26.51 31.52 +``` + +- [x] 90% of reads are Q30 +- [x] Requested amount of sequences were simulated (940 and 752 reads for sample1 and sample2 respectively) +- [x] Both genomes are covered at 0.1x (divide sum_len by genome size) + + +## Alignments + +Using the `--bam` flag, sorted and indexed BAM files are generated, which can be quality controlled with [qualimap](http://qualimap.conesalab.org/). Qualimap reports for each sample can be aggregated using [multiQC ](https://multiqc.info/). + +### Qualimap +```sh +cd mess_out/bam +for bam in *.bam; do qualimap bamqc -bam $bam; done +``` +### MultiQC +```sh +multiqc . +``` + +##### General stats + + +| Sample | QualiMap_mqc-generalstats-qualimap-avg_gc | QualiMap_mqc-generalstats-qualimap-median_insert_size | QualiMap_mqc-generalstats-qualimap-1_x_pc | QualiMap_mqc-generalstats-qualimap-5_x_pc | QualiMap_mqc-generalstats-qualimap-10_x_pc | QualiMap_mqc-generalstats-qualimap-30_x_pc | QualiMap_mqc-generalstats-qualimap-50_x_pc | QualiMap_mqc-generalstats-qualimap-median_coverage | QualiMap_mqc-generalstats-qualimap-mean_coverage | QualiMap_mqc-generalstats-qualimap-general_error_rate | QualiMap_mqc-generalstats-qualimap-percentage_aligned | QualiMap_mqc-generalstats-qualimap-mapped_reads | QualiMap_mqc-generalstats-qualimap-total_reads | +| ------------- | ----------------------------------------- | ----------------------------------------------------- | ----------------------------------------- | ----------------------------------------- | ------------------------------------------ | ------------------------------------------ | ------------------------------------------ | -------------------------------------------------- | ------------------------------------------------ | ----------------------------------------------------- | ----------------------------------------------------- | ----------------------------------------------- | ---------------------------------------------- | +| sample1 | | | | | | | | | 0.1 | 0.0 | 100.0 | 1880 | 1880 | +| sample1_stats | 33.48098434004474 | 199 | 6.449724087062946 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | | | | | | +| sample2 | | | | | | | | | 0.0999 | 0.0 | 100.0 | 1504 | 1504 | +| sample2_stats | 31.822164948453608 | 199 | 6.446044198028644 | 0.0003543851395679425 | 0.0 | 0.0 | 0.0 | 0 | | | | | | + +##### qualimap bamqc genome stats + +| Sample | total_reads | mapped_reads | mapped_bases | sequenced_bases | duplication_rate | mean_insert_size | median_insert_size | mean_mapping_quality | gc_percentage | general_error_rate | homopolymer_indels | mean_coverage | percentage_aligned | +| ------- | ----------- | ------------ | ------------ | --------------- | ---------------- | ---------------- | ------------------ | -------------------- | ------------- | ------------------ | ------------------ | ------------- | ------------------ | +| sample1 | 1880 | 1880 | 281999 | 281998 | 0.05 | 199.8862 | 199.0 | 91.0798 | 32.96 | 0.0 | 0.0 | 0.1 | 100.0 | +| sample2 | 1504 | 1504 | 225601 | 225600 | 0.0 | 199.3763 | 199.0 | 83.9774 | 31.39 | 0.0 | 0.0 | 0.0999 | 100.0 | + +- [x] 0% general error rate and homopolymer indels +- [x] At least 83 mapping quality with 100% read alignment +- [x] Average coverage depth values are of 0.1x + +!!! warning + Only 6% of both genome locations are covered at 1x. This is to be expected as our requested coverage depth is at 0.1x. If we want to cover a larger fraction of the genomes, we need to simulate more reads. + +## Taxonomic profiles + +=== "sample1" + ```sh + @SampleID:sample1 + @Version:0.10.0 + @Ranks:superkingdom|phylum|class|order|family|genus|species|strain + @TaxonomyID: + @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE + 2 superkingdom 2 Bacteria 100.000000000000000 + 1239 phylum 2|1239 Bacteria|Bacillota 100.000000000000000 + 91061 class 2|1239|91061 Bacteria|Bacillota|Bacilli 100.000000000000000 + 1385 order 2|1239|91061|1385 Bacteria|Bacillota|Bacilli|Bacillales 100.000000000000000 + 90964 family 2|1239|91061|1385|90964 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae 100.000000000000000 + 1279 genus 2|1239|91061|1385|90964|1279 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus 100.000000000000000 + 1280 species 2|1239|91061|1385|90964|1279|1280 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus|Staphylococcus aureus 100.000000000000000 + 93061 strain 2|1239|91061|1385|90964|1279|1280|93061 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus|Staphylococcus aureus|Staphylococcus aureus subsp. aureus NCTC 8325 100.000000000000000 + ``` + +=== "sample2" + ```sh + @SampleID:sample2 + @Version:0.10.0 + @Ranks:superkingdom|phylum|class|order|family|genus|species|strain + @TaxonomyID: + @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE + 2 superkingdom 2 Bacteria 100.000000000000000 + 1239 phylum 2|1239 Bacteria|Bacillota 100.000000000000000 + 91061 class 2|1239|91061 Bacteria|Bacillota|Bacilli 100.000000000000000 + 1385 order 2|1239|91061|1385 Bacteria|Bacillota|Bacilli|Bacillales 100.000000000000000 + 90964 family 2|1239|91061|1385|90964 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae 100.000000000000000 + 1279 genus 2|1239|91061|1385|90964|1279 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus 100.000000000000000 + 1290 species 2|1239|91061|1385|90964|1279|1290 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus|Staphylococcus hominis 100.000000000000000 + ``` +- [x] Samples have the correct taxonomic profile in biobox format \ No newline at end of file diff --git a/docs/guide/simulate/fq-processing.md b/docs/guide/simulate/fq-processing.md deleted file mode 100644 index e69de29..0000000 diff --git a/docs/guide/simulate/index.md b/docs/guide/simulate/index.md index e0f8064..863b8f5 100644 --- a/docs/guide/simulate/index.md +++ b/docs/guide/simulate/index.md @@ -8,8 +8,7 @@ After fetching genomes in the [download step](../download.md), the pipeline proc ## [b) Get replicates and calculate coverage](coverage.md) -## [c) Read simulation](reads.md) +## [c) Read and alignments processing](reads.md) -## [d) Reads post-processing](fq-processing.md) diff --git a/docs/guide/simulate/reads.md b/docs/guide/simulate/reads.md index e3968c8..118d078 100644 --- a/docs/guide/simulate/reads.md +++ b/docs/guide/simulate/reads.md @@ -1,56 +1,89 @@ -`mess` feeds each [processed genome](fa-processing.md) with its corresponding [coverage depth](coverage.md) to [art_illumina](https://www.niehs.nih.gov/research/resources/software/biostatistics/art) or [pbsim3](https://github.com/yukiteruono/pbsim3) to generate fastq files and alignment files (optionally). +`mess` feeds [split fastas](fa-processing.md) and their [coverage depths](coverage.md) to [art_illumina](https://www.niehs.nih.gov/research/resources/software/biostatistics/art) or [pbsim3](https://github.com/yukiteruono/pbsim3) for read simulation as shown below. ``` mermaid flowchart TB + %%file types fa[fasta] fq[fastq] + fqgz[fastq.gz] sam[sam] - bam[bam] - bai[bam.bai] maf[maf] + bam[bam] + bai[bam
bam.bai] cov[coverage.txt] - tax[taxonomic - profile] - fqgz[fastq.gz] - + tax[biobox
taxonomic
profile] + + %%tools art(art_illumina) + pbsim(pbsim3) cat(cat) pigz(pigz) - pbsim(pbsim3) - bioconvert{bioconvert} - sam2bam(sam2bam) - maf2sam(maf2sam) - seqkit{seqkit} - shufle(shuffle) - replace(replace) - samtools{samtools} - sortidenx(sort & index) - coverage(coverage) - taxonkit(taxonkit - profile2cami) + sam2bam(bioconvert
sam2bam) + maf2sam(bioconvert
maf2sam) + shuffle(seqkit
shuffle) + replace(seqkit
replace) + merge(samtools
merge) + sortindex(samtools
sort & index) + coverage(samtools
coverage) + taxonkit(taxonkit
profile2cami) + %%workflow fa --> art fa --> pbsim + fa2tax[genome taxids] + subgraph simulators + art + pbsim + end art --> fq + art ==>|contigs sam| sam pbsim --> fq - art --> sam - pbsim --> maf - - maf --> bioconvert - sam --> bioconvert - sam2bam --> bam - bam --> samtools - samtools --> sortindex - sortindex --> bai - samtools --> coverage - coverage --> cov - cov --> taxonkit - taxonkit --> tax - bioconvert --> sam2bam - bioconvert --> maf2sam + pbsim ==>|no header contigs
maf| sed(sed) + sed ==>|add contig header| maf + pbsim -.-> ccs + + fq -->|contigs
fastq| pigz + pigz -->|contigs
fastq.gz| cat + cat -->|samples fastq| shuffle + shuffle -->|shuffled fastq| replace + replace -->|anonymized fastq| fqgz + + maf ==> maf2sam + maf2sam ==> sam + sam ==> sam2bam + sam2bam ==> bam + bam ==>|contigs bam| merge + merge ==>|samples bam| sortindex + sortindex ==>|sorted bam| coverage + fa2tax ==> cov + coverage ==>|contigs
coverages| cov + cov ==>|taxids
coverages| taxonkit + + ccs[ccs.sam] -.-> sam2bam + bam -.-> pbccs(pbccs) + pbccs -.->|hifi
fastq.gz
contigs| cat + + taxonkit ==> tax + sortindex ==>|sorted
indexed
bam| bai - fq --> cat - cat --> pigz - pigz --> fqgz -``` \ No newline at end of file + subgraph output + fqgz + bai + tax + end +``` + +>:material-arrow-right: Path for fastq files + +>:material-arrow-right-bold: Path for alignments (sam, bam, maf...) + +>**---**> Path for PacBio hifi fastq + +## Reads processing +Simulators output uncompressed reads for every contig, which are then compressed with pigz and concatenated with cat. +Concatenated fastq are then shuffled and anonymized with seqkit shuffle and seqkit replace respectively. + +## Alignments processing +Simulators can optionally output alignment files, usually in SAM format, which are then converted to sorted and indexed BAM. +In some case, alignment files are in other formats and need some processing to be correctly conterted to BAM. For example, pbsim3 output MAF alignment files with no sequence ID. `mess` which will add sequence IDs, and convert MAF to SAM and finally BAM using bioconvert. \ No newline at end of file diff --git a/docs/tutorials/bracken.md b/docs/tutorials/bracken.md index e8a1829..51a4977 100644 --- a/docs/tutorials/bracken.md +++ b/docs/tutorials/bracken.md @@ -1 +1,111 @@ ## Input + +### Download an example + +We can use a bracken report [from bracken's sample data](https://github.com/jenniferlu717/Bracken/blob/master/sample_data/sample_output_species_abundance.txt). + +```sh +curl https://raw.githubusercontent.com/jenniferlu717/Bracken/master/sample_data/sample_output_species_abundance.txt -o bracken.txt +``` + +### Format table + +We can use [csvtk](https://github.com/shenwei356/csvtk) to extract and rename the taxonomy_id and new_est_reads columns required for our `mess` input table. + +```sh +csvtk cut -t -f taxonomy_id,new_est_reads bracken.txt \ + | csvtk rename -t -f 1,2 -n taxon,reads \ + | csvtk -t sort -k reads:nr > bracken.tsv +``` + +### Update taxids + +Use taxonkit to update removed/merged taxids + +```sh +csvtk cut -t -U -f taxon bracken.tsv \ + | taxonkit lineage --show-status-code 2> tax.log \ + | csvtk -t add-header -n old_taxid,updated_taxid,lineage 1> lineage.tsv +``` + +Checking the tax.log (as of may 2024) + +```sh +$ cat tax.log +11:03:42.688 [WARN] taxid 1172 was merged into 264691 +11:03:42.688 [WARN] taxid 86662 was merged into 1405 +``` + +Get the updated taxids, with taxid count + +```sh +csvtk join -t bracken.tsv lineage.tsv -f 1 \ + | csvtk cut -t -f updated_taxid,reads \ + | csvtk rename -t -f 1,2 -n taxon,reads > updated.tsv +``` + +### Replace unclassified taxids + +!!! failure + Some taxids have no genome data available ! + + For example, 60481, 164757 and 189918 correspond to unclassified Shewanella sp. MR-7, Mycobacterium sp. JLS and Mycobacterium sp. KMS, repsectively. + + We can replace unclassified species with other species from the same genus. + + We also add the nb column to tell mess to download one genome per taxon. + +```sh +cat updated.tsv \ + | csvtk replace -t -f taxon -p "60481" -r "211586" \ + | csvtk replace -t -f taxon -p "164757" -r "83332" \ + | csvtk replace -t -f taxon -p "189918" -r "557599" \ + > clean.tsv +``` + +### Subsample + +For a quick run (~ 5 min with 10 threads), we can subset the top 10 most abundant taxa for our simulation. + +```sh +csvtk head -t -n 10 clean.tsv +``` + +??? example "subsample.tsv" + | taxon | reads | + | ------ | ------- | + | 562 | 1449813 | + | 1396 | 1288824 | + | 1280 | 1212241 | + | 1076 | 824296 | + | 83332 | 699935 | + | 148447 | 656706 | + | 132919 | 622314 | + | 272131 | 590648 | + | 303 | 515978 | + | 32046 | 466493 | + +## Simulate + +Now that we cleaned up and subsampled bracken's taxonomic profile we can use it for `mess run`. + +By default, mess downloads 1 reference genome per taxon and excludes atypical genomes (see [download guide](../guide/download.md). +) +For read simulation, the default parameters generate paired 150 bp illumina reads using art_illumina's HiSeq 2500 error profile. +```sh +mess run -i subsample.tsv --bam --threads 10 +``` + +## Output +### fastq +```sh +$ seqkit stats mess_out/fastq/*.fq.gz --all +processed files: 2 / 2 [======================================] ETA: 0s. done +file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 Q20(%) Q30(%) AvgQual GC(%) +mess_out/fastq/subsample_R1.fq.gz FASTQ DNA 8,326,992 1,249,048,800 150 150 150 150 150 150 0 150 98.01 91.67 27.79 50.76 +mess_out/fastq/subsample_R2.fq.gz FASTQ DNA 8,326,992 1,249,048,800 150 150 150 150 150 150 0 150 97.31 89.64 26.51 50.76 +``` + +- [x] Total read count correspond to the one requested in the input (8327248 - 8326992 = 256 reads difference) + +### Taxonomic profile diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md index 3642826..232c4e0 100644 --- a/docs/tutorials/index.md +++ b/docs/tutorials/index.md @@ -1,10 +1,15 @@ # Tutorials +Collection of tutorials to showcase `mess` usage examples. -## Run the test command +## Taxonomic profiles as input +`mess` uses taxonomic IDs and sequence abundances as input for read simulation, making it compatible with most popular taxonomic profilers. In this tutorial, we used the output of [bracken](https://github.com/jenniferlu717/Bracken) as input for `mess`. ## Sequencing technologies +`mess` supports long and short read simulations, and below are some tutorials detailing the parameters to simulate illumina ,Nanopore or PacBio runs. +### Illumina + ### Nanopore + ### PacBio -## Simulate starting from a taxonomic profile -### Bracken + diff --git a/docs/tutorials/test.md b/docs/tutorials/test.md deleted file mode 100644 index 5d179cd..0000000 --- a/docs/tutorials/test.md +++ /dev/null @@ -1,162 +0,0 @@ -By default, `mess test` generates paired, 150 bp, illumina reads with [art_illumina](https://www.niehs.nih.gov/research/resources/software/biostatistics/art), for two samples, each containing one genome at 0.1x coverage. - -## Input - -The minimal test is comprised of two samples: sample1 and sample2. - -- sample1 contains _Staphylococcus aureus_ reads at 0.1x coverage depth -- sample2 contains _Staphylococcus hominis_ reads at 0.1x coverage depth - - -``` title="minimal_test.tsv" ---8<-- "mess/test_data/minimal_test.tsv" -``` - -## Command - -```sh -mess test --bam -o test_bam_out -``` - -## Output - -### Directory - -This how the output directory should look like (excluding logs and benchmarks) - -```sh -📂test_bam_out - ┣ 📂assembly_finder - ┃ ┣ 📂download - ┃ ┃ ┣ 📂GCF_000013425.1 - ┃ ┃ ┃ ┗ 📜GCF_000013425.1_ASM1342v1_genomic.fna.gz - ┃ ┃ ┣ 📂GCF_003812505.1 - ┃ ┃ ┃ ┗ 📜GCF_003812505.1_ASM381250v1_genomic.fna.gz - ┃ ┣ 📜assembly_summary.tsv - ┃ ┣ 📜config.yaml - ┃ ┣ 📜sequence_report.tsv - ┃ ┗ 📜taxonomy.tsv - ┣ 📂bam - ┃ ┣ 📜sample1.bam - ┃ ┣ 📜sample1.bam.bai - ┃ ┣ 📜sample2.bam - ┃ ┗ 📜sample2.bam.bai - ┣ 📂fastq - ┃ ┣ 📜sample1_R1.fq.gz - ┃ ┣ 📜sample1_R2.fq.gz - ┃ ┣ 📜sample2_R1.fq.gz - ┃ ┗ 📜sample2_R2.fq.gz - ┣ 📂tax - ┃ ┣ 📜sample1_profile.txt - ┃ ┗ 📜sample2_profile.txt - ┣ 📜config.yaml - ┗ 📜cov.tsv -``` - -### Cov.tsv - -Check genome size (total_sequence_length), coverage depth (cov_sim) and other stats for both samples in cov.tsv - -=== "sample1" - | fasta | total_sequence_length | number_of_contigs | reads | bases | seq_abundance | cov_sim | tax_abundance | - | --------------- | --------------------- | ----------------- | ------- | -------- | ------------- | ------- | ------------- | - | GCF_000013425.1 | 2821361 | 1 | 940.454 | 282136.1 | 1.0 | 0.1 | 1.0 | - -=== "sample2" - | fasta | total_sequence_length | number_of_contigs | reads | bases | seq_abundance | cov_sim | tax_abundance | - | --------------- | --------------------- | ----------------- | ------- | -------- | ------------- | ------- | ------------- | - | GCF_003812505.1 | 2257431 | 3 | 752.477 | 225743.1 | 1.0 | 0.1 | 1.0 | - -### Fastq - -Let's check the fastq content using [`seqkit stats`](https://bioinf.shenwei.me/seqkit/usage/#stats). - -```sh -$ seqkit stats --all test_bam_out/fastq/*.fq.gz -file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 N50_num Q20(%) Q30(%) AvgQual GC(%) -fastq/sample1_R1.fq.gz FASTQ DNA 940 141,000 150 150 150 150 150 150 0 150 1 97.99 91.72 27.68 32.89 -fastq/sample1_R2.fq.gz FASTQ DNA 940 141,000 150 150 150 150 150 150 0 150 1 97.35 89.7 26.52 33.19 -fastq/sample2_R1.fq.gz FASTQ DNA 752 112,800 150 150 150 150 150 150 0 150 1 98.02 91.75 27.78 31.38 -fastq/sample2_R2.fq.gz FASTQ DNA 752 112,800 150 150 150 150 150 150 0 150 1 97.27 89.7 26.51 31.52 -``` - -- [x] 90% of reads are Q30 -- [x] Requested amount of sequences were simulated (940 and 752 reads for sample1 and sample2 respectively) -- [x] Both genomes are covered at 0.1x (divide sum_len by genome size) - - -### BAM - -Using the `--bam` flag, sorted and indexed BAM files are generated, which can be quality controlled with [qualimap](http://qualimap.conesalab.org/). Qualimap reports for each sample can be aggregated using [multiQC ](https://multiqc.info/). - -#### Qualimap -```sh -cd test_bam_out/bam -for bam in *.bam; do qualimap bamqc -bam $bam; done -``` -#### MultiQC -```sh -multiqc . -``` - -##### General stats - - -| Sample | QualiMap_mqc-generalstats-qualimap-avg_gc | QualiMap_mqc-generalstats-qualimap-median_insert_size | QualiMap_mqc-generalstats-qualimap-1_x_pc | QualiMap_mqc-generalstats-qualimap-5_x_pc | QualiMap_mqc-generalstats-qualimap-10_x_pc | QualiMap_mqc-generalstats-qualimap-30_x_pc | QualiMap_mqc-generalstats-qualimap-50_x_pc | QualiMap_mqc-generalstats-qualimap-median_coverage | QualiMap_mqc-generalstats-qualimap-mean_coverage | QualiMap_mqc-generalstats-qualimap-general_error_rate | QualiMap_mqc-generalstats-qualimap-percentage_aligned | QualiMap_mqc-generalstats-qualimap-mapped_reads | QualiMap_mqc-generalstats-qualimap-total_reads | -| ------------- | ----------------------------------------- | ----------------------------------------------------- | ----------------------------------------- | ----------------------------------------- | ------------------------------------------ | ------------------------------------------ | ------------------------------------------ | -------------------------------------------------- | ------------------------------------------------ | ----------------------------------------------------- | ----------------------------------------------------- | ----------------------------------------------- | ---------------------------------------------- | -| sample1 | | | | | | | | | 0.1 | 0.0 | 100.0 | 1880 | 1880 | -| sample1_stats | 33.48098434004474 | 199 | 6.449724087062946 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | | | | | | -| sample2 | | | | | | | | | 0.0999 | 0.0 | 100.0 | 1504 | 1504 | -| sample2_stats | 31.822164948453608 | 199 | 6.446044198028644 | 0.0003543851395679425 | 0.0 | 0.0 | 0.0 | 0 | | | | | | - -##### qualimap bamqc genome stats - -| Sample | total_reads | mapped_reads | mapped_bases | sequenced_bases | duplication_rate | mean_insert_size | median_insert_size | mean_mapping_quality | gc_percentage | general_error_rate | homopolymer_indels | mean_coverage | percentage_aligned | -| ------- | ----------- | ------------ | ------------ | --------------- | ---------------- | ---------------- | ------------------ | -------------------- | ------------- | ------------------ | ------------------ | ------------- | ------------------ | -| sample1 | 1880 | 1880 | 281999 | 281998 | 0.05 | 199.8862 | 199.0 | 91.0798 | 32.96 | 0.0 | 0.0 | 0.1 | 100.0 | -| sample2 | 1504 | 1504 | 225601 | 225600 | 0.0 | 199.3763 | 199.0 | 83.9774 | 31.39 | 0.0 | 0.0 | 0.0999 | 100.0 | - -- [x] 0% general error rate and homopolymer indels -- [x] At least 83 mapping quality with 100% read alignment -- [x] Average coverage depth values are of 0.1x - -!!! warning - Only 6% of both genome locations are covered at 1x. This is to be expected as our requested coverage depth is at 0.1x. If we want to cover a larger fraction of the genomes, we need to simulate more reads. - -### tax profiles - - - -=== "sample1" - ```sh - @SampleID:sample1 - @Version:0.10.0 - @Ranks:superkingdom|phylum|class|order|family|genus|species|strain - @TaxonomyID: - @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE - 2 superkingdom 2 Bacteria 100.000000000000000 - 1239 phylum 2|1239 Bacteria|Bacillota 100.000000000000000 - 91061 class 2|1239|91061 Bacteria|Bacillota|Bacilli 100.000000000000000 - 1385 order 2|1239|91061|1385 Bacteria|Bacillota|Bacilli|Bacillales 100.000000000000000 - 90964 family 2|1239|91061|1385|90964 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae 100.000000000000000 - 1279 genus 2|1239|91061|1385|90964|1279 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus 100.000000000000000 - 1280 species 2|1239|91061|1385|90964|1279|1280 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus|Staphylococcus aureus 100.000000000000000 - 93061 strain 2|1239|91061|1385|90964|1279|1280|93061 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus|Staphylococcus aureus|Staphylococcus aureus subsp. aureus NCTC 8325 100.000000000000000 - ``` - -=== "sample2" - ```sh - @SampleID:sample2 - @Version:0.10.0 - @Ranks:superkingdom|phylum|class|order|family|genus|species|strain - @TaxonomyID: - @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE - 2 superkingdom 2 Bacteria 100.000000000000000 - 1239 phylum 2|1239 Bacteria|Bacillota 100.000000000000000 - 91061 class 2|1239|91061 Bacteria|Bacillota|Bacilli 100.000000000000000 - 1385 order 2|1239|91061|1385 Bacteria|Bacillota|Bacilli|Bacillales 100.000000000000000 - 90964 family 2|1239|91061|1385|90964 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae 100.000000000000000 - 1279 genus 2|1239|91061|1385|90964|1279 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus 100.000000000000000 - 1290 species 2|1239|91061|1385|90964|1279|1290 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus|Staphylococcus hominis 100.000000000000000 - ``` -- [x] Samples have the correct taxonomic profile in biobox format \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index a038225..ec05875 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -77,8 +77,8 @@ nav: - guide/simulate/index.md - a) Fasta processing: guide/simulate/fa-processing.md - b) Calculate coverage: guide/simulate/coverage.md - - c) Read simulation and post-processing: guide/simulate/reads.md - - 4) Output: guide/output.md + - c) Read simulation and processing: guide/simulate/reads.md + - 4) Examine outputs: guide/output.md - Commands: - commands/index.md - download: commands/download.md