Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review MultiQC Custom Content in Main Table #90

Closed
apeltzer opened this issue Nov 19, 2018 · 13 comments
Closed

Review MultiQC Custom Content in Main Table #90

apeltzer opened this issue Nov 19, 2018 · 13 comments
Assignees
Labels

Comments

@apeltzer
Copy link
Member

I'd like to have a better and improved report in the end, summarizing certain important metrics in a smart way, e.g. but not limited to. Naming is something I'd not like to standardize too much since different labs tend to have different names for certain metrics.

  • "Cluster Factor": (Number of duplicate reads / Total number of reads in sample)
    • Ideally, this would be summarized based on information retrieved from some modules running in the pipeline.
  • "Target Efficiency": (Captured bases / targetted bases)

...

Could you please add things that are of specific interest to you?
These here are on my ToDo list:

  • mt/Nuc ratio
  • mapped reads / % mapped reads
  • merged reads / % merged reads
  • GC content

Schmutzi / estimates for contamination
....

Any ideas what you'd like to see here @EisenRa @jfy133 @JudithNeukamm ?

@apeltzer apeltzer added this to the V2.1 "Ulm" milestone Nov 19, 2018
@EisenRa
Copy link

EisenRa commented Nov 25, 2018

@apeltzer looking awesome so far!

Things that I think would be great are:

  • DamageProfiler: both C->T/G->A, and fragment length distribution.
  • In the 'General Statistics' section, % of genome with coverage >= 2, 3, 4. (Currently there is 1, 5, 10, etc.)
  • Qualimap mapping scores

@jfy133 jfy133 added the major label Dec 9, 2018
@jfy133
Copy link
Member

jfy133 commented Dec 10, 2018

Not so much a metric, but have 'number of reads after quality filtering' column, when that flag is initiated.

@apeltzer
Copy link
Member Author

To summarize a bit what Phil said, I guess we'll need a custom plugin for MultiQC to provide some of these here.

So we should create a list of required metrics and how to compute them and have these available then.

@jfy133
Copy link
Member

jfy133 commented Dec 13, 2018

For reference, this are some of the typical EAGER 1 headers. I will update definitions (as I understand them) over the next couple of days.

  • Sample Name: Input sample name
  • # of Raw Reads prior Clip & Merge (C&M): total number of raw reads in all FASTQ files
  • #' reads after C&M prior mapping: total number of reads in input file for mapping after adapter removal (including either merged only, or all merged/high quality singletons as requested). From AdapterRemoval.settings file.
  • # of Merged Reads: Number of reads that were merged - not including clipped singletons (comment: I think redundant with below) From AdapterRemoval.settings file.
  • % Merged Reads: Percentage of reads that were merged. Calculated from AdapterRemoval.settings file.
  • # reads not attempted to map: (comment: I was never really sure what this was)
  • # mapped reads prior RMDup: Number of reads that mapped to the reference file, prior remove duplicate step. From samtools stats.
  • #' of Duplicates removed: Number of exact duplicate reads removed (e.g. PCR duplicates), where either the sequence was the same in two or more reads with the same start (samtools markduplicates) and end coordinate on the reference (DeDup). From deduplication tool log.
  • Mapped Reads after RMDup Number of reads after exact duplicate reads removal (e.g. PCR duplicates). From DeDup Log
  • Endogenous DNA (%) Percentage of on-target mapped and de-duplicated reads over the total number of reads in the library after adapter removal and merging (on and off-target). From AdapterRemoval and samtools logs and manual calculation.
  • Cluster Factor Ratio of duplicated reads over de-duplicated reads. Calculated as Number of reads post-RMDup over number of reads pre-RMDup. Higher values suggest low-complexity or over-amplified library. From DeDup log and manual calculation.
  • Mean Coverage The average number of reads covering each base across the entire reference genome after deduplication. Also known as depth or fold coverage. Note does not measure evenness of coverage. From qualimap output.
  • std. dev. Coverage The standard deviation of the average number of reads covering each base across the reference genome after de-duplication. From qualimap output.
  • Coverage >= 1X in % Percentage of bases covered at least one time across the whole genome. Also known as breadth coverage after deduplication. From qualimap output.
  • Coverage >= 2X in % Percentage of bases covered at least two times across the whole genome. Also known as breadth coverage after deduplication. From qualimap output.
  • Coverage >= 3X in % Percentage of bases covered at least two times across the whole genome. Also known as breadth coverage after deduplication. From qualimap output.
  • Coverage >= 4X in % Percentage of bases covered at least two times across the whole genome. Also known as breadth coverage after deduplication. From qualimap output.
  • Coverage >= 5X in % Percentage of bases covered at least two times across the whole genome. Also known as breadth coverage after deduplication. From qualimap output.
  • '#' of reads on mitochondrium Number of reads that aligned to the 'mitochondrium' or selected 'chromosome' or fasta entry in the reference genome. From samtools stats [Or qualitmap?].
  • AVG Coverage on mitochondrium Average number of reads covering each base across the 'mitochondrium' or selected 'chromosome' or fasta entry in the reference genome after deduplication
  • MT/NUC Ratio The ratio between number of reads algined to the mitochondrium or selected 'chormosome' or FASTA entry in the reference genome to the number of reads aligned to all other chromosome or FASTA entry . From samtools stats and manual calculation.
  • DMG 1st Base 3' Frequency of G->A substitutions from reference at the 1st base of the 3' end of each de-duplicated read. This is represents typical ancient DNA damage from deamination. Higher the indicates more damage, if higher than 2nd base. From DamageProfiler log.
  • DMG 2nd Base 3' Frequency of G->A substitutions from reference at the 2nd base of the 3' end of each de-duplicated read. This is represents typical ancient DNA damage from deamination. Higher the indicates more damage, if higher than 2nd base. From DamageProfiler log.
  • DMG 1st Base 5' Frequency of C->T substitutions from reference at the 1st base of the 5' end of each de-duplicated read. This is represents typical ancient DNA damage from deamination. Higher the indicates more damage, if higher than 2nd base. From DamageProfiler log.
  • DMG 2nd Base 5' Frequency of C-T substitutions from reference at the 2nd base of the 5' end of each de-duplicated read. This is represents typical ancient DNA damage from deamination. Higher the indicates more damage, if higher than 2nd base. From DamageProfiler log.
  • average fragment length Average median length in base pairs of all aligned reads after de-duplication. From DamageProfiler log and manual calculaton.
  • median fragment length Median median length in base pairs of all aligned reads after de-duplication. From DamageProfiler log and manual calculation.
  • GC content in % Average GC content of the genome. From qualimap output.

@jfy133
Copy link
Member

jfy133 commented Jan 9, 2019

Additional stuff from EAGER1 modules which wasn't in original ReportTable

  • Genotyping stuff maybe from GATK? I don't know would be useful there or if there is a fast stats scanner (e.g. maybe from bcftools stats?) -> nice for pathogen stuff would be number of multi-allelic sites ("number of multiallelic sites" field in output)

Additional stuff which has been added to EAGER2 already
FastP

  • # reads after PolyG trimming Number of reads that were trimmed of poly-G tails. From fastp html report. [Might need to consider whether to also calculate this after C&M which has a length filter...)

@jfy133
Copy link
Member

jfy133 commented Feb 15, 2019

I can potentially supply an R/tidyverse script that converts a MultiQC Json to a nice table however a couple of things to note:

  • I will still need a way to get a list of the IDs of things when I need to collapse stats across multi lane samples
  • Apparently MultiQC Json isn't meeting JSON standards due to some of the values of the module stats files being varied (i.e. NA vs NaN vs null etc), so I have to do some hacky things to clean that up. Might be worth speaking to MultiQC crew to standardise that somehow (which would require per-module reporting of what each 'missing data' value standards for)
  • Not sure how I will deal with DeDup stats of multi-lane things.

Point two references:
rstudio/DT#496
jeroen/jsonlite#70
jeroen/jsonlite#94

@jfy133
Copy link
Member

jfy133 commented Apr 1, 2019

Extra values for the polyG to consider is effect on GC content, but we need to decide how to display this (i.e. which modules to get this info from).

@jfy133
Copy link
Member

jfy133 commented Apr 1, 2019

MultiQC ReportTable Requests

In all cases: either adapterremoval_config functionality (a la Qualimap),
or additional columns exported to GeneralStats which are already in
multiqc_adapter_removal.txt

Adapter Removal

  • total
  • full-length_cp
  • truncated_cp
  • retained_reads

Samtools

  • mapped_passed_pct
  • quality filtered info?

Qualimap

  • mean_coverage

@jfy133 jfy133 changed the title MultiQC Custom Content in Main Table Review MultiQC Custom Content in Main Table Dec 4, 2019
@jfy133 jfy133 self-assigned this Dec 4, 2019
@apeltzer
Copy link
Member Author

apeltzer commented Dec 4, 2019

Need to update this according to requirements we have - there have been a lot of updates on this one recently, e.g. supprt for mtnucratio in MultiQC, mean coverage updates in MultiQC using QualiMap.

@jfy133
Copy link
Member

jfy133 commented Dec 5, 2019

Most needed stuff is there, but no endogenous DNA. Maybe a small python script required and a MultiQC module? @aidaanva would you be interested?

It would just need to take a single samtools flagstat file and divide the mapped field (5th row) by the total (1st) and spit out a percent.

@aidaanva
Copy link
Contributor

I can do it

@apeltzer
Copy link
Member Author

Your script could actually run in the process multiqc step and simply divide the output as requested, then creating some custom content for multiqc itself :-) https://github.com/ewels/MultiQC/blob/master/docs/custom_content.md

@jfy133
Copy link
Member

jfy133 commented Feb 29, 2020

Endogenous DNA added. Will close for for 2.1, but can reopen in the future for next MultiQC Release

@jfy133 jfy133 closed this as completed Feb 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants