updated docs

metagenlab · May 8, 2024 · f181561 · f181561
1 parent 9056e10
commit f181561
Show file tree

Hide file tree

Showing 10 changed files with 314 additions and 215 deletions.
diff --git a/docs/commands/hmp-template.md b/docs/commands/hmp-template.md
@@ -0,0 +1,13 @@
+## Help
+
+![`mess hmp-template -h`](../images/mess-hmp-help.svg)
+
+## Example
+### Gut samples
+```sh
+mess hmp-template --site gut -o gut
+```
+### Specific sample from buccal mucosa
+```sh
+mess hmp-template --site buccal_mucosa --sample SRS013506 -o SRS013506
+```
diff --git a/docs/getting_started/index.md b/docs/getting_started/index.md
@@ -12,7 +12,7 @@ mess test --bam
 
 ### Output
 ```sh hl_lines="3-7 15-24 28-30"
-📂test_out
+📂mess_out
  ┣ 📂assembly_finder
  ┃ ┣ 📂download
  ┃ ┃ ┣ 📂GCF_000418345.1

diff --git a/docs/guide/output.md b/docs/guide/output.md
@@ -1,9 +1,4 @@
-## Command
-```sh
-mess test --bam 
-```
-## Output
-### Directory structure
+## Directory structure
 ```sh hl_lines="3-7 15-24 28-30"
 📂mess_out
  ┣ 📂assembly_finder
@@ -37,4 +32,110 @@ mess test --bam
  ┃ ┗ 📜sample2_profile.txt
  ┣ 📜config.yaml
  ┗ 📜mess.log
-```
+```
+
+## Coverage table
+
+Check genome size (total_sequence_length), coverage depth (cov_sim) and other stats for both samples in cov.tsv
+
+=== "sample1"
+    | fasta | total_sequence_length | number_of_contigs | reads | bases | seq_abundance | cov_sim | tax_abundance |
+    | --------------- | --------------------- | ----------------- | ------- | -------- | ------------- | ------- | ------------- |
+    | GCF_000013425.1 | 2821361 | 1 | 940.454 | 282136.1 | 1.0 | 0.1 | 1.0 |
+
+=== "sample2"
+    | fasta           | total_sequence_length | number_of_contigs | reads   | bases    | seq_abundance | cov_sim | tax_abundance |
+    | --------------- | --------------------- | ----------------- | ------- | -------- | ------------- | ------- | ------------- |
+    | GCF_003812505.1 | 2257431               | 3                 | 752.477 | 225743.1 | 1.0           | 0.1     | 1.0           |
+
+## Reads
+
+Let's check the fastq content using [`seqkit stats`](https://bioinf.shenwei.me/seqkit/usage/#stats).
+
+```sh
+$ seqkit stats --all test_bam_out/fastq/*.fq.gz
+file                    format  type  num_seqs  sum_len  min_len  avg_len  max_len   Q1   Q2   Q3  sum_gap  N50  N50_num  Q20(%)  Q30(%)  AvgQual  GC(%)
+fastq/sample1_R1.fq.gz  FASTQ   DNA        940  141,000      150      150      150  150  150  150        0  150        1   97.99   91.72    27.68  32.89
+fastq/sample1_R2.fq.gz  FASTQ   DNA        940  141,000      150      150      150  150  150  150        0  150        1   97.35    89.7    26.52  33.19
+fastq/sample2_R1.fq.gz  FASTQ   DNA        752  112,800      150      150      150  150  150  150        0  150        1   98.02   91.75    27.78  31.38
+fastq/sample2_R2.fq.gz  FASTQ   DNA        752  112,800      150      150      150  150  150  150        0  150        1   97.27    89.7    26.51  31.52
+```
+
+- [x] 90% of reads are Q30
+- [x] Requested amount of sequences were simulated (940 and 752 reads for sample1 and sample2 respectively)
+- [x] Both genomes are covered at 0.1x (divide sum_len by genome size)
+
+
+## Alignments
+
+Using the `--bam` flag, sorted and indexed BAM files are generated, which can be quality controlled with [qualimap](http://qualimap.conesalab.org/). Qualimap reports for each sample can be aggregated using [multiQC ](https://multiqc.info/).
+
+### Qualimap
+```sh
+cd mess_out/bam 
+for bam in *.bam; do qualimap bamqc -bam $bam; done
+```
+### MultiQC
+```sh
+multiqc .
+```
+
+##### General stats
+
+
+| Sample        | QualiMap_mqc-generalstats-qualimap-avg_gc | QualiMap_mqc-generalstats-qualimap-median_insert_size | QualiMap_mqc-generalstats-qualimap-1_x_pc | QualiMap_mqc-generalstats-qualimap-5_x_pc | QualiMap_mqc-generalstats-qualimap-10_x_pc | QualiMap_mqc-generalstats-qualimap-30_x_pc | QualiMap_mqc-generalstats-qualimap-50_x_pc | QualiMap_mqc-generalstats-qualimap-median_coverage | QualiMap_mqc-generalstats-qualimap-mean_coverage | QualiMap_mqc-generalstats-qualimap-general_error_rate | QualiMap_mqc-generalstats-qualimap-percentage_aligned | QualiMap_mqc-generalstats-qualimap-mapped_reads | QualiMap_mqc-generalstats-qualimap-total_reads |
+| ------------- | ----------------------------------------- | ----------------------------------------------------- | ----------------------------------------- | ----------------------------------------- | ------------------------------------------ | ------------------------------------------ | ------------------------------------------ | -------------------------------------------------- | ------------------------------------------------ | ----------------------------------------------------- | ----------------------------------------------------- | ----------------------------------------------- | ---------------------------------------------- |
+| sample1       |                                           |                                                       |                                           |                                           |                                            |                                            |                                            |                                                    | 0.1                                              | 0.0                                                   | 100.0                                                 | 1880                                            | 1880                                           |
+| sample1_stats | 33.48098434004474                         | 199                                                   | 6.449724087062946                         | 0.0                                       | 0.0                                        | 0.0                                        | 0.0                                        | 0                                                  |                                                  |                                                       |                                                       |                                                 |                                                |
+| sample2       |                                           |                                                       |                                           |                                           |                                            |                                            |                                            |                                                    | 0.0999                                           | 0.0                                                   | 100.0                                                 | 1504                                            | 1504                                           |
+| sample2_stats | 31.822164948453608                        | 199                                                   | 6.446044198028644                         | 0.0003543851395679425                     | 0.0                                        | 0.0                                        | 0.0                                        | 0                                                  |                                                  |                                                       |                                                       |                                                 |                                                |
+
+##### qualimap bamqc genome stats
+
+| Sample  | total_reads | mapped_reads | mapped_bases | sequenced_bases | duplication_rate | mean_insert_size | median_insert_size | mean_mapping_quality | gc_percentage | general_error_rate | homopolymer_indels | mean_coverage | percentage_aligned |
+| ------- | ----------- | ------------ | ------------ | --------------- | ---------------- | ---------------- | ------------------ | -------------------- | ------------- | ------------------ | ------------------ | ------------- | ------------------ |
+| sample1 | 1880        | 1880         | 281999       | 281998          | 0.05             | 199.8862         | 199.0              | 91.0798              | 32.96         | 0.0                | 0.0                | 0.1           | 100.0              |
+| sample2 | 1504        | 1504         | 225601       | 225600          | 0.0              | 199.3763         | 199.0              | 83.9774              | 31.39         | 0.0                | 0.0                | 0.0999        | 100.0              |
+
+- [x] 0% general error rate and homopolymer indels
+- [x] At least 83 mapping quality with 100% read alignment
+- [x] Average coverage depth values are of 0.1x
+
+!!! warning
+    Only 6% of both genome locations are covered at 1x. This is to be expected as our requested coverage depth is at 0.1x. If we want to cover a larger fraction of the genomes, we need to simulate more reads.
+
+## Taxonomic profiles
+
+=== "sample1"
+    ```sh
+    @SampleID:sample1
+    @Version:0.10.0
+    @Ranks:superkingdom|phylum|class|order|family|genus|species|strain
+    @TaxonomyID:
+    @@TAXID	RANK	TAXPATH	TAXPATHSN	PERCENTAGE
+    2	superkingdom	2	Bacteria	100.000000000000000
+    1239	phylum	2|1239	Bacteria|Bacillota	100.000000000000000
+    91061	class	2|1239|91061	Bacteria|Bacillota|Bacilli	100.000000000000000
+    1385	order	2|1239|91061|1385	Bacteria|Bacillota|Bacilli|Bacillales	100.000000000000000
+    90964	family	2|1239|91061|1385|90964	Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae	100.000000000000000
+    1279	genus	2|1239|91061|1385|90964|1279	Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus	100.000000000000000
+    1280	species	2|1239|91061|1385|90964|1279|1280	Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus|Staphylococcus aureus	100.000000000000000
+    93061	strain	2|1239|91061|1385|90964|1279|1280|93061	Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus|Staphylococcus aureus|Staphylococcus aureus subsp. aureus NCTC 8325	100.000000000000000
+    ```
+
+=== "sample2"
+    ```sh
+    @SampleID:sample2
+    @Version:0.10.0
+    @Ranks:superkingdom|phylum|class|order|family|genus|species|strain
+    @TaxonomyID:
+    @@TAXID	RANK	TAXPATH	TAXPATHSN	PERCENTAGE
+    2	superkingdom	2	Bacteria	100.000000000000000
+    1239	phylum	2|1239	Bacteria|Bacillota	100.000000000000000
+    91061	class	2|1239|91061	Bacteria|Bacillota|Bacilli	100.000000000000000
+    1385	order	2|1239|91061|1385	Bacteria|Bacillota|Bacilli|Bacillales	100.000000000000000
+    90964	family	2|1239|91061|1385|90964	Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae	100.000000000000000
+    1279	genus	2|1239|91061|1385|90964|1279	Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus	100.000000000000000
+    1290	species	2|1239|91061|1385|90964|1279|1290	Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus|Staphylococcus hominis	100.000000000000000
+    ```
+- [x] Samples have the correct taxonomic profile in biobox format
diff --git a/docs/guide/simulate/fq-processing.md b/docs/guide/simulate/fq-processing.md
diff --git a/docs/guide/simulate/index.md b/docs/guide/simulate/index.md
@@ -8,8 +8,7 @@ After fetching genomes in the [download step](../download.md), the pipeline proc
 
 ## [b) Get replicates and calculate coverage](coverage.md)
 
-## [c) Read simulation](reads.md)
+## [c) Read and alignments processing](reads.md)
 
-## [d) Reads post-processing](fq-processing.md)
 
 
diff --git a/docs/guide/simulate/reads.md b/docs/guide/simulate/reads.md
@@ -1,56 +1,89 @@
-`mess` feeds each [processed genome](fa-processing.md) with its corresponding [coverage depth](coverage.md) to [art_illumina](https://www.niehs.nih.gov/research/resources/software/biostatistics/art) or [pbsim3](https://github.com/yukiteruono/pbsim3) to generate fastq files and alignment files (optionally).
+`mess` feeds [split fastas](fa-processing.md) and their [coverage depths](coverage.md) to [art_illumina](https://www.niehs.nih.gov/research/resources/software/biostatistics/art) or [pbsim3](https://github.com/yukiteruono/pbsim3) for read simulation as shown below.
 
 
 ``` mermaid
 flowchart TB
+  %%file types
   fa[fasta]
   fq[fastq]
+  fqgz[fastq.gz]
   sam[sam]
-  bam[bam]
-  bai[bam.bai]
   maf[maf]
+  bam[bam]
+  bai[bam <br/>bam.bai]
   cov[coverage.txt]
-  tax[taxonomic 
-  profile]
-  fqgz[fastq.gz]
-  
+  tax[biobox<br/>taxonomic<br/>profile]
+
+  %%tools
   art(art_illumina)
+  pbsim(pbsim3)
   cat(cat)
   pigz(pigz)
-  pbsim(pbsim3)
-  bioconvert{bioconvert}
-  sam2bam(sam2bam)
-  maf2sam(maf2sam)
-  seqkit{seqkit}
-  shufle(shuffle)
-  replace(replace)
-  samtools{samtools}
-  sortidenx(sort & index)
-  coverage(coverage)
-  taxonkit(taxonkit 
-  profile2cami)
+  sam2bam(bioconvert <br/>sam2bam)
+  maf2sam(bioconvert <br/>maf2sam)
+  shuffle(seqkit <br/>shuffle)
+  replace(seqkit <br/>replace)
+  merge(samtools <br/>merge)
+  sortindex(samtools <br/>sort & index)
+  coverage(samtools <br/>coverage)
+  taxonkit(taxonkit <br/>profile2cami)
 
+  %%workflow 
   fa --> art
   fa --> pbsim
+  fa2tax[genome taxids]
+  subgraph simulators
+    art
+    pbsim
+  end 
   art --> fq
+  art ==>|contigs sam| sam
   pbsim --> fq
-  art --> sam
-  pbsim --> maf
-
-  maf --> bioconvert
-  sam --> bioconvert
-  sam2bam --> bam
-  bam --> samtools
-  samtools --> sortindex
-  sortindex --> bai
-  samtools --> coverage
-  coverage --> cov
-  cov --> taxonkit
-  taxonkit --> tax
-  bioconvert --> sam2bam
-  bioconvert --> maf2sam
+  pbsim ==>|no header contigs <br/>maf| sed(sed)
+  sed ==>|add contig header| maf
+  pbsim -.-> ccs
+
+  fq -->|contigs<br/>fastq| pigz
+  pigz -->|contigs<br/>fastq.gz| cat
+  cat -->|samples fastq| shuffle
+  shuffle -->|shuffled fastq| replace
+  replace -->|anonymized fastq| fqgz
+
+  maf ==> maf2sam
+  maf2sam ==> sam
+  sam ==> sam2bam
+  sam2bam ==> bam
+  bam ==>|contigs bam| merge
+  merge ==>|samples bam| sortindex
+  sortindex ==>|sorted bam| coverage
+  fa2tax ==> cov
+  coverage ==>|contigs<br/>coverages| cov
+  cov ==>|taxids<br/>coverages| taxonkit
+
+  ccs[ccs.sam] -.-> sam2bam
+  bam -.-> pbccs(pbccs)
+  pbccs -.->|hifi <br/>fastq.gz<br/>contigs| cat
+
+  taxonkit ==> tax
+  sortindex ==>|sorted<br/>indexed<br/>bam| bai  
   
-  fq --> cat
-  cat --> pigz
-  pigz --> fqgz
-```
+  subgraph output
+    fqgz
+    bai
+    tax 
+  end
+```
+
+>:material-arrow-right: Path for fastq files
+
+>:material-arrow-right-bold: Path for alignments (sam, bam, maf...)
+
+>**---**> Path for PacBio hifi fastq
+
+## Reads processing
+Simulators output uncompressed reads for every contig, which are then compressed with pigz and concatenated with cat.
+Concatenated fastq are then shuffled and anonymized with seqkit shuffle and seqkit replace respectively. 
+
+## Alignments processing
+Simulators can optionally output alignment files, usually in SAM format, which are then converted to sorted and indexed BAM.
+In some case, alignment files are in other formats and need some processing to be correctly conterted to BAM. For example, pbsim3 output MAF alignment files with no sequence ID. `mess` which will add sequence IDs, and convert MAF to SAM and finally BAM using bioconvert. 
diff --git a/docs/tutorials/bracken.md b/docs/tutorials/bracken.md
@@ -1 +1,111 @@
 ## Input
+
+### Download an example
+
+We can use a bracken report [from bracken's sample data](https://github.com/jenniferlu717/Bracken/blob/master/sample_data/sample_output_species_abundance.txt).
+
+```sh
+curl https://raw.githubusercontent.com/jenniferlu717/Bracken/master/sample_data/sample_output_species_abundance.txt -o bracken.txt
+```
+
+### Format table
+
+We can use [csvtk](https://github.com/shenwei356/csvtk) to extract and rename the taxonomy_id and new_est_reads columns required for our `mess` input table.
+
+```sh
+csvtk cut -t -f taxonomy_id,new_est_reads bracken.txt \
+    | csvtk rename -t -f 1,2 -n taxon,reads \
+    | csvtk -t sort -k reads:nr > bracken.tsv
+```
+
+### Update taxids
+
+Use taxonkit to update removed/merged taxids
+
+```sh
+csvtk cut -t -U -f taxon bracken.tsv \
+    | taxonkit lineage --show-status-code 2> tax.log \
+    | csvtk -t add-header -n old_taxid,updated_taxid,lineage 1> lineage.tsv
+```
+
+Checking the tax.log (as of may 2024)
+
+```sh
+$ cat tax.log
+11:03:42.688 [WARN] taxid 1172 was merged into 264691
+11:03:42.688 [WARN] taxid 86662 was merged into 1405
+```
+
+Get the updated taxids, with taxid count
+
+```sh
+csvtk join -t bracken.tsv lineage.tsv -f 1 \
+    | csvtk cut -t -f updated_taxid,reads \
+    | csvtk rename -t -f 1,2 -n taxon,reads > updated.tsv
+```
+
+### Replace unclassified taxids
+
+!!! failure
+    Some taxids have no genome data available !
+
+    For example, 60481, 164757 and 189918 correspond to unclassified Shewanella sp. MR-7, Mycobacterium sp. JLS and Mycobacterium sp. KMS, repsectively. 
+
+    We can replace unclassified species with other species from the same genus. 
+
+    We also add the nb column to tell mess to download one genome per taxon.
+
+```sh
+cat updated.tsv \
+    | csvtk replace -t -f taxon -p "60481" -r "211586" \
+    | csvtk replace -t -f taxon -p "164757" -r "83332" \
+    | csvtk replace -t -f taxon -p "189918" -r "557599" \
+    > clean.tsv
+```
+
+### Subsample
+
+For a quick run (~ 5 min with 10 threads), we can subset the top 10 most abundant taxa for our simulation.
+
+```sh
+csvtk head -t -n 10 clean.tsv 
+```
+
+??? example "subsample.tsv"
+    | taxon  | reads   |
+    | ------ | ------- |
+    | 562    | 1449813 |
+    | 1396   | 1288824 |
+    | 1280   | 1212241 |
+    | 1076   | 824296  |
+    | 83332  | 699935  |
+    | 148447 | 656706  |
+    | 132919 | 622314  |
+    | 272131 | 590648  |
+    | 303    | 515978  |
+    | 32046  | 466493  |
+
+## Simulate
+
+Now that we cleaned up and subsampled bracken's taxonomic profile we can use it for `mess run`.
+
+By default, mess downloads 1 reference genome per taxon and excludes atypical genomes (see [download guide](../guide/download.md).
+)
+For read simulation, the default parameters generate paired 150 bp illumina reads using art_illumina's HiSeq 2500 error profile.
+```sh
+mess run -i subsample.tsv --bam --threads 10
+```
+
+## Output
+### fastq
+```sh
+$ seqkit stats mess_out/fastq/*.fq.gz --all
+processed files:  2 / 2 [======================================] ETA: 0s. done
+file                               format  type   num_seqs        sum_len  min_len  avg_len  max_len   Q1   Q2   Q3  sum_gap  N50  Q20(%)  Q30(%)  AvgQual  GC(%)
+mess_out/fastq/subsample_R1.fq.gz  FASTQ   DNA   8,326,992  1,249,048,800      150      150      150  150  150  150        0  150   98.01   91.67    27.79  50.76
+mess_out/fastq/subsample_R2.fq.gz  FASTQ   DNA   8,326,992  1,249,048,800      150      150      150  150  150  150        0  150   97.31   89.64    26.51  50.76
+```
+
+- [x] Total read count correspond to the one requested in the input (8327248 - 8326992 = 256 reads difference) 
+
+### Taxonomic profile