Skip to content

Commit

Permalink
updated docs
Browse files Browse the repository at this point in the history
  • Loading branch information
farchaab committed May 8, 2024
1 parent 9056e10 commit f181561
Show file tree
Hide file tree
Showing 10 changed files with 314 additions and 215 deletions.
13 changes: 13 additions & 0 deletions docs/commands/hmp-template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
## Help

![`mess hmp-template -h`](../images/mess-hmp-help.svg)

## Example
### Gut samples
```sh
mess hmp-template --site gut -o gut
```
### Specific sample from buccal mucosa
```sh
mess hmp-template --site buccal_mucosa --sample SRS013506 -o SRS013506
```
2 changes: 1 addition & 1 deletion docs/getting_started/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ mess test --bam

### Output
```sh hl_lines="3-7 15-24 28-30"
📂test_out
📂mess_out
┣ 📂assembly_finder
┃ ┣ 📂download
┃ ┃ ┣ 📂GCF_000418345.1
Expand Down
115 changes: 108 additions & 7 deletions docs/guide/output.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,4 @@
## Command
```sh
mess test --bam
```
## Output
### Directory structure
## Directory structure
```sh hl_lines="3-7 15-24 28-30"
📂mess_out
┣ 📂assembly_finder
Expand Down Expand Up @@ -37,4 +32,110 @@ mess test --bam
┃ ┗ 📜sample2_profile.txt
┣ 📜config.yaml
┗ 📜mess.log
```
```

## Coverage table

Check genome size (total_sequence_length), coverage depth (cov_sim) and other stats for both samples in cov.tsv

=== "sample1"
| fasta | total_sequence_length | number_of_contigs | reads | bases | seq_abundance | cov_sim | tax_abundance |
| --------------- | --------------------- | ----------------- | ------- | -------- | ------------- | ------- | ------------- |
| GCF_000013425.1 | 2821361 | 1 | 940.454 | 282136.1 | 1.0 | 0.1 | 1.0 |

=== "sample2"
| fasta | total_sequence_length | number_of_contigs | reads | bases | seq_abundance | cov_sim | tax_abundance |
| --------------- | --------------------- | ----------------- | ------- | -------- | ------------- | ------- | ------------- |
| GCF_003812505.1 | 2257431 | 3 | 752.477 | 225743.1 | 1.0 | 0.1 | 1.0 |

## Reads

Let's check the fastq content using [`seqkit stats`](https://bioinf.shenwei.me/seqkit/usage/#stats).

```sh
$ seqkit stats --all test_bam_out/fastq/*.fq.gz
file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 N50_num Q20(%) Q30(%) AvgQual GC(%)
fastq/sample1_R1.fq.gz FASTQ DNA 940 141,000 150 150 150 150 150 150 0 150 1 97.99 91.72 27.68 32.89
fastq/sample1_R2.fq.gz FASTQ DNA 940 141,000 150 150 150 150 150 150 0 150 1 97.35 89.7 26.52 33.19
fastq/sample2_R1.fq.gz FASTQ DNA 752 112,800 150 150 150 150 150 150 0 150 1 98.02 91.75 27.78 31.38
fastq/sample2_R2.fq.gz FASTQ DNA 752 112,800 150 150 150 150 150 150 0 150 1 97.27 89.7 26.51 31.52
```

- [x] 90% of reads are Q30
- [x] Requested amount of sequences were simulated (940 and 752 reads for sample1 and sample2 respectively)
- [x] Both genomes are covered at 0.1x (divide sum_len by genome size)


## Alignments

Using the `--bam` flag, sorted and indexed BAM files are generated, which can be quality controlled with [qualimap](http://qualimap.conesalab.org/). Qualimap reports for each sample can be aggregated using [multiQC ](https://multiqc.info/).

### Qualimap
```sh
cd mess_out/bam
for bam in *.bam; do qualimap bamqc -bam $bam; done
```
### MultiQC
```sh
multiqc .
```

##### General stats


| Sample | QualiMap_mqc-generalstats-qualimap-avg_gc | QualiMap_mqc-generalstats-qualimap-median_insert_size | QualiMap_mqc-generalstats-qualimap-1_x_pc | QualiMap_mqc-generalstats-qualimap-5_x_pc | QualiMap_mqc-generalstats-qualimap-10_x_pc | QualiMap_mqc-generalstats-qualimap-30_x_pc | QualiMap_mqc-generalstats-qualimap-50_x_pc | QualiMap_mqc-generalstats-qualimap-median_coverage | QualiMap_mqc-generalstats-qualimap-mean_coverage | QualiMap_mqc-generalstats-qualimap-general_error_rate | QualiMap_mqc-generalstats-qualimap-percentage_aligned | QualiMap_mqc-generalstats-qualimap-mapped_reads | QualiMap_mqc-generalstats-qualimap-total_reads |
| ------------- | ----------------------------------------- | ----------------------------------------------------- | ----------------------------------------- | ----------------------------------------- | ------------------------------------------ | ------------------------------------------ | ------------------------------------------ | -------------------------------------------------- | ------------------------------------------------ | ----------------------------------------------------- | ----------------------------------------------------- | ----------------------------------------------- | ---------------------------------------------- |
| sample1 | | | | | | | | | 0.1 | 0.0 | 100.0 | 1880 | 1880 |
| sample1_stats | 33.48098434004474 | 199 | 6.449724087062946 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | | | | | |
| sample2 | | | | | | | | | 0.0999 | 0.0 | 100.0 | 1504 | 1504 |
| sample2_stats | 31.822164948453608 | 199 | 6.446044198028644 | 0.0003543851395679425 | 0.0 | 0.0 | 0.0 | 0 | | | | | |

##### qualimap bamqc genome stats

| Sample | total_reads | mapped_reads | mapped_bases | sequenced_bases | duplication_rate | mean_insert_size | median_insert_size | mean_mapping_quality | gc_percentage | general_error_rate | homopolymer_indels | mean_coverage | percentage_aligned |
| ------- | ----------- | ------------ | ------------ | --------------- | ---------------- | ---------------- | ------------------ | -------------------- | ------------- | ------------------ | ------------------ | ------------- | ------------------ |
| sample1 | 1880 | 1880 | 281999 | 281998 | 0.05 | 199.8862 | 199.0 | 91.0798 | 32.96 | 0.0 | 0.0 | 0.1 | 100.0 |
| sample2 | 1504 | 1504 | 225601 | 225600 | 0.0 | 199.3763 | 199.0 | 83.9774 | 31.39 | 0.0 | 0.0 | 0.0999 | 100.0 |

- [x] 0% general error rate and homopolymer indels
- [x] At least 83 mapping quality with 100% read alignment
- [x] Average coverage depth values are of 0.1x

!!! warning
Only 6% of both genome locations are covered at 1x. This is to be expected as our requested coverage depth is at 0.1x. If we want to cover a larger fraction of the genomes, we need to simulate more reads.

## Taxonomic profiles

=== "sample1"
```sh
@SampleID:sample1
@Version:0.10.0
@Ranks:superkingdom|phylum|class|order|family|genus|species|strain
@TaxonomyID:
@@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE
2 superkingdom 2 Bacteria 100.000000000000000
1239 phylum 2|1239 Bacteria|Bacillota 100.000000000000000
91061 class 2|1239|91061 Bacteria|Bacillota|Bacilli 100.000000000000000
1385 order 2|1239|91061|1385 Bacteria|Bacillota|Bacilli|Bacillales 100.000000000000000
90964 family 2|1239|91061|1385|90964 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae 100.000000000000000
1279 genus 2|1239|91061|1385|90964|1279 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus 100.000000000000000
1280 species 2|1239|91061|1385|90964|1279|1280 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus|Staphylococcus aureus 100.000000000000000
93061 strain 2|1239|91061|1385|90964|1279|1280|93061 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus|Staphylococcus aureus|Staphylococcus aureus subsp. aureus NCTC 8325 100.000000000000000
```

=== "sample2"
```sh
@SampleID:sample2
@Version:0.10.0
@Ranks:superkingdom|phylum|class|order|family|genus|species|strain
@TaxonomyID:
@@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE
2 superkingdom 2 Bacteria 100.000000000000000
1239 phylum 2|1239 Bacteria|Bacillota 100.000000000000000
91061 class 2|1239|91061 Bacteria|Bacillota|Bacilli 100.000000000000000
1385 order 2|1239|91061|1385 Bacteria|Bacillota|Bacilli|Bacillales 100.000000000000000
90964 family 2|1239|91061|1385|90964 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae 100.000000000000000
1279 genus 2|1239|91061|1385|90964|1279 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus 100.000000000000000
1290 species 2|1239|91061|1385|90964|1279|1290 Bacteria|Bacillota|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus|Staphylococcus hominis 100.000000000000000
```
- [x] Samples have the correct taxonomic profile in biobox format
Empty file.
3 changes: 1 addition & 2 deletions docs/guide/simulate/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,7 @@ After fetching genomes in the [download step](../download.md), the pipeline proc

## [b) Get replicates and calculate coverage](coverage.md)

## [c) Read simulation](reads.md)
## [c) Read and alignments processing](reads.md)

## [d) Reads post-processing](fq-processing.md)


109 changes: 71 additions & 38 deletions docs/guide/simulate/reads.md
Original file line number Diff line number Diff line change
@@ -1,56 +1,89 @@
`mess` feeds each [processed genome](fa-processing.md) with its corresponding [coverage depth](coverage.md) to [art_illumina](https://www.niehs.nih.gov/research/resources/software/biostatistics/art) or [pbsim3](https://github.com/yukiteruono/pbsim3) to generate fastq files and alignment files (optionally).
`mess` feeds [split fastas](fa-processing.md) and their [coverage depths](coverage.md) to [art_illumina](https://www.niehs.nih.gov/research/resources/software/biostatistics/art) or [pbsim3](https://github.com/yukiteruono/pbsim3) for read simulation as shown below.


``` mermaid
flowchart TB
%%file types
fa[fasta]
fq[fastq]
fqgz[fastq.gz]
sam[sam]
bam[bam]
bai[bam.bai]
maf[maf]
bam[bam]
bai[bam <br/>bam.bai]
cov[coverage.txt]
tax[taxonomic
profile]
fqgz[fastq.gz]
tax[biobox<br/>taxonomic<br/>profile]
%%tools
art(art_illumina)
pbsim(pbsim3)
cat(cat)
pigz(pigz)
pbsim(pbsim3)
bioconvert{bioconvert}
sam2bam(sam2bam)
maf2sam(maf2sam)
seqkit{seqkit}
shufle(shuffle)
replace(replace)
samtools{samtools}
sortidenx(sort & index)
coverage(coverage)
taxonkit(taxonkit
profile2cami)
sam2bam(bioconvert <br/>sam2bam)
maf2sam(bioconvert <br/>maf2sam)
shuffle(seqkit <br/>shuffle)
replace(seqkit <br/>replace)
merge(samtools <br/>merge)
sortindex(samtools <br/>sort & index)
coverage(samtools <br/>coverage)
taxonkit(taxonkit <br/>profile2cami)
%%workflow
fa --> art
fa --> pbsim
fa2tax[genome taxids]
subgraph simulators
art
pbsim
end
art --> fq
art ==>|contigs sam| sam
pbsim --> fq
art --> sam
pbsim --> maf
maf --> bioconvert
sam --> bioconvert
sam2bam --> bam
bam --> samtools
samtools --> sortindex
sortindex --> bai
samtools --> coverage
coverage --> cov
cov --> taxonkit
taxonkit --> tax
bioconvert --> sam2bam
bioconvert --> maf2sam
pbsim ==>|no header contigs <br/>maf| sed(sed)
sed ==>|add contig header| maf
pbsim -.-> ccs
fq -->|contigs<br/>fastq| pigz
pigz -->|contigs<br/>fastq.gz| cat
cat -->|samples fastq| shuffle
shuffle -->|shuffled fastq| replace
replace -->|anonymized fastq| fqgz
maf ==> maf2sam
maf2sam ==> sam
sam ==> sam2bam
sam2bam ==> bam
bam ==>|contigs bam| merge
merge ==>|samples bam| sortindex
sortindex ==>|sorted bam| coverage
fa2tax ==> cov
coverage ==>|contigs<br/>coverages| cov
cov ==>|taxids<br/>coverages| taxonkit
ccs[ccs.sam] -.-> sam2bam
bam -.-> pbccs(pbccs)
pbccs -.->|hifi <br/>fastq.gz<br/>contigs| cat
taxonkit ==> tax
sortindex ==>|sorted<br/>indexed<br/>bam| bai
fq --> cat
cat --> pigz
pigz --> fqgz
```
subgraph output
fqgz
bai
tax
end
```

>:material-arrow-right: Path for fastq files
>:material-arrow-right-bold: Path for alignments (sam, bam, maf...)
>**---**> Path for PacBio hifi fastq
## Reads processing
Simulators output uncompressed reads for every contig, which are then compressed with pigz and concatenated with cat.
Concatenated fastq are then shuffled and anonymized with seqkit shuffle and seqkit replace respectively.

## Alignments processing
Simulators can optionally output alignment files, usually in SAM format, which are then converted to sorted and indexed BAM.
In some case, alignment files are in other formats and need some processing to be correctly conterted to BAM. For example, pbsim3 output MAF alignment files with no sequence ID. `mess` which will add sequence IDs, and convert MAF to SAM and finally BAM using bioconvert.
110 changes: 110 additions & 0 deletions docs/tutorials/bracken.md
Original file line number Diff line number Diff line change
@@ -1 +1,111 @@
## Input

### Download an example

We can use a bracken report [from bracken's sample data](https://github.com/jenniferlu717/Bracken/blob/master/sample_data/sample_output_species_abundance.txt).

```sh
curl https://raw.githubusercontent.com/jenniferlu717/Bracken/master/sample_data/sample_output_species_abundance.txt -o bracken.txt
```

### Format table

We can use [csvtk](https://github.com/shenwei356/csvtk) to extract and rename the taxonomy_id and new_est_reads columns required for our `mess` input table.

```sh
csvtk cut -t -f taxonomy_id,new_est_reads bracken.txt \
| csvtk rename -t -f 1,2 -n taxon,reads \
| csvtk -t sort -k reads:nr > bracken.tsv
```

### Update taxids

Use taxonkit to update removed/merged taxids

```sh
csvtk cut -t -U -f taxon bracken.tsv \
| taxonkit lineage --show-status-code 2> tax.log \
| csvtk -t add-header -n old_taxid,updated_taxid,lineage 1> lineage.tsv
```

Checking the tax.log (as of may 2024)

```sh
$ cat tax.log
11:03:42.688 [WARN] taxid 1172 was merged into 264691
11:03:42.688 [WARN] taxid 86662 was merged into 1405
```

Get the updated taxids, with taxid count

```sh
csvtk join -t bracken.tsv lineage.tsv -f 1 \
| csvtk cut -t -f updated_taxid,reads \
| csvtk rename -t -f 1,2 -n taxon,reads > updated.tsv
```

### Replace unclassified taxids

!!! failure
Some taxids have no genome data available !

For example, 60481, 164757 and 189918 correspond to unclassified Shewanella sp. MR-7, Mycobacterium sp. JLS and Mycobacterium sp. KMS, repsectively.

We can replace unclassified species with other species from the same genus.

We also add the nb column to tell mess to download one genome per taxon.

```sh
cat updated.tsv \
| csvtk replace -t -f taxon -p "60481" -r "211586" \
| csvtk replace -t -f taxon -p "164757" -r "83332" \
| csvtk replace -t -f taxon -p "189918" -r "557599" \
> clean.tsv
```

### Subsample

For a quick run (~ 5 min with 10 threads), we can subset the top 10 most abundant taxa for our simulation.

```sh
csvtk head -t -n 10 clean.tsv
```

??? example "subsample.tsv"
| taxon | reads |
| ------ | ------- |
| 562 | 1449813 |
| 1396 | 1288824 |
| 1280 | 1212241 |
| 1076 | 824296 |
| 83332 | 699935 |
| 148447 | 656706 |
| 132919 | 622314 |
| 272131 | 590648 |
| 303 | 515978 |
| 32046 | 466493 |

## Simulate

Now that we cleaned up and subsampled bracken's taxonomic profile we can use it for `mess run`.

By default, mess downloads 1 reference genome per taxon and excludes atypical genomes (see [download guide](../guide/download.md).
)
For read simulation, the default parameters generate paired 150 bp illumina reads using art_illumina's HiSeq 2500 error profile.
```sh
mess run -i subsample.tsv --bam --threads 10
```

## Output
### fastq
```sh
$ seqkit stats mess_out/fastq/*.fq.gz --all
processed files: 2 / 2 [======================================] ETA: 0s. done
file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 Q20(%) Q30(%) AvgQual GC(%)
mess_out/fastq/subsample_R1.fq.gz FASTQ DNA 8,326,992 1,249,048,800 150 150 150 150 150 150 0 150 98.01 91.67 27.79 50.76
mess_out/fastq/subsample_R2.fq.gz FASTQ DNA 8,326,992 1,249,048,800 150 150 150 150 150 150 0 150 97.31 89.64 26.51 50.76
```

- [x] Total read count correspond to the one requested in the input (8327248 - 8326992 = 256 reads difference)

### Taxonomic profile
Loading

0 comments on commit f181561

Please sign in to comment.