Skip to content
/ MMULT Public

MMULT: multiple sample analysis from large-scale WGBS data

Notifications You must be signed in to change notification settings

lijinbio/MMULT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MMULT: multiple sample analysis from large-scale WGBS data

MMULT is a scalable and efficient solution for DNA methylation analysis of large-scale whole-genome bisulfite sequencing (WGBS) data. MMULT consists of several submodules for analysis of multiple samples in single biological group and multiple groups. MMULT is specifically efficient for large-scale samples (e.g., hundreds of samples).

Submodules

  1. BBGLM

BBGLM resolves DNA methylation dynamics using beta-binomial generalized linear model. Utilization of stan library (https://github.com/stan-dev/cmdstan) enables fast execution of model fit for CpG dinucleotides in the whole-genome scale. An example p-value and q-value distributions by BBGLM is as below.

BBGLM_pvalue_qvalue

Allowed options:
  -h [ --help ]                   Produce help message.
  -m [ --methfile ] arg           Methylation BED files. The BED file is 
                                  generated by `MCALL` in MOABS. Replicates in 
                                  a group are concatenated by comma `,`. 
                                  Multiple groups can be specified. For 
                                  example, `-m g1_r1.bed,g1_r2.bed -m g2_r1.bed
                                  -m g3_r1.bed,g3_r2.bed,g3_r3.bed`.
  -c [ --chrom ] arg              A specific chromosome for analysis. Can be 
                                  specified multiple times for multiple 
                                  chromosomes. The size can be encoded for a 
                                  chromosome. For example, `-c chr1:248956422 
                                  -c chr2:242193529`. The size can be used to 
                                  split a chromosome for running in small 
                                  batches. Default: all chromosomes appear in 
                                  methylation files.
  -l [ --length ] arg (=20000000) Split length of coordinates in a chromomsome.
                                  This is necessary for many replicates with a 
                                  limited memory. To enable small-batch 
                                  running, size info should be specificed by 
                                  `-c|--chrom`. Because the size of chr1 in 
                                  hg38 is >200 million, 1/10th (20M) can be 
                                  good to go. Default: 20000000.
  -d [ --mindepth ] arg (=1)      Minimum depth for a CpG coverage. Default: 1.
  -t [ --readthreads ] arg (=10)  Number of read threads. Default: 10.
  -b [ --batchthreads ] arg (=5)  Number of batch threads. Default: 5.
  --qval arg (=0.05)              Q-value threshold for DMC. Default: 0.05.
  --nominaldiff arg (=0.2)        Nominal methylation difference threshold for 
                                  DMC. Default: 0.2.
  --maxdistdmcs arg (=300)        Maximum distance between consecutive DMCs for
                                  DMR. Default: 300.
  --mindmc arg (=3)               Minimum number ofDMCs in a DMR. Default: 3.
  -o [ --outfile ] arg            Output file.

Examples: 
  bbglm -m g1_r1.bed,g1_r2.bed -m g2_r1.bed -m g3_r1.bed,g3_r2.bed,g3_r3.bed -o output.txt

Date: 2020/08/19
Authors: Jin Li <[email protected]>
  1. CpGCDIFEnrich

The CpGCDIFEnrich module consolidates methylation differences from individual comparisons using KL-divergence. Credible methylation difference (CDIF, https://github.com/sunnyisgalaxy/moabs) represents methylation difference in an individual comparison between two biological conditions. For a CpG site, CpGCDIFEnrich formulates the consolidated methylation difference using KL-divergence of sample CDIFs compared to the background distirbution of CDIFs among whole-genome CpGs. Below image shows the background distribution of CDIFs, one positive CpG with a high KL-divergence value, and a scatterplot of KL-divergence and sum of CDIFs.

cpgcdifenrich_hist

Allowed options:
  -h [ --help ]                   Produce help message.
  -c [ --compfile ] arg           Comparison files. The comparison file is 
                                  generated by `MCOMP` in MOABS. For example, 
                                  `-c H001VsNL -c H002VsNL`.
  -r [ --chrom ] arg              A specific chromosome for analysis. Can be 
                                  specified multiple times for multiple 
                                  chromosomes. The size can be encoded for a 
                                  chromosome. For example, `-c chr1:248956422 
                                  -c chr2:242193529`. The size can be used to 
                                  split a chromosome for running in small 
                                  batches. Default: all chromosomes appear in 
                                  comparison files.
  -l [ --length ] arg (=20000000) Split length of coordinates in a chromomsome.
                                  This is necessary for many replicates with a 
                                  limited memory. To enable small-batch 
                                  running, size info should be specificed by 
                                  `-r|--chrom`. Because the size of chr1 in 
                                  hg38 is >200 million, 1/10th (20M) can be 
                                  good to go. Default: 20000000.
  -b [ --numbins ] arg (=100)     Number of bins. Default: 100.
  -t [ --numthreads ] arg (=10)   Number of threads. Default: 10.
  --kldthr arg (=0.67957)         KL-divergence threshold for a DMC. A quarter 
                                  of nats. Default: 0.67957.
  --cdifthr arg (=0.2)            CDIF threshold for a DMC. Default: 0.2.
  --maxzerocdif arg (=0.05)       Maximum percent of zero CDIFs for a DMC. A 
                                  CpG with both positive and negative CDIFs 
                                  will be ignored. A negative value will not 
                                  check zero CDIFs. Default: 5%.
  --maxdistdmcs arg (=300)        Maximum distance between consecutive DMCs for
                                  a DMR. Default: 300.
  --mindmc arg (=3)               Minimum number ofDMCs in a DMR. Default: 3.
  -o [ --outfile ] arg            Output file.

Examples: 
  cpgcdifenrich -c H001VsNL -c H002VsNL -o output.txt

Date: 2020/08/19
Authors: Jin Li <[email protected]>
  1. VMCVMRNME

This module aims to detect variable methylated CpGs (VMCs) and variable methylation regions (VMRs) of samples in a single biological group. A VMC is denoted under less randomness (smaller normalized entropy) and large variation. VMCs and VMRs enable a feasible solution for subtype detection in large-scale samples using methylation profiles. An example subtyping solution using VMCs is as below.

cc_vmcvmrnme

Allowed options:
  -h [ --help ]                   Produce help message.
  -m [ --methfile ] arg           Methylation BED files. The BED file is 
                                  generated by `MCALL` in MOABS. Replicates are
                                  concatenated by comma `,`. For example, `-m 
                                  r1.bed,r2.bed,r3.bed`.
  -c [ --chrom ] arg              One specific-chromosome for analysis. Can be 
                                  specified multiple times for multiple 
                                  chromosomes. Default: all chromosomes appear 
                                  in methylation BED files.
  -o [ --outfile ] arg            Output file.
  -k [ --state ] arg (=2)         Number of discretization states. Default: 2.
  -w [ --window ] arg (=150)      Window size for genome scan. Default: 150.
  -b [ --mincpg ] arg (=3)        Minimum CpGs in a window. Default: 3.
  -d [ --mindepth ] arg (=3)      Minimum depth for a CpG coverage. Default: 3.
  -t [ --numthreads ] arg (=8)    Number of threads. Default: 8.
  -v [ --vmrmethod ] arg (=0)     VMR detection method. 0: identify VMCs first 
                                  and detect VMRs from consecutive VMCs; 1: 
                                  Genome scan method by fixed-size windows. 
                                  Default: 0.
  -s [ --sd ] arg (=0.2)          sd for VMC. Default: 0.2.
  -n [ --nme ] arg (=0.25)        NME for VMC. Default: 0.25.
  -x [ --maxdistvmcs ] arg (=300) Maximum distance between consecutive VMCs for
                                  VMR. Default: 300.
  --minsample arg (=5)            Minimum samples for a CpG. Default: 5.
  --vmcfile arg                   VMC file.
  --windowfile arg                VMR file by genome scan.

Examples: 
  vmcvmrnme -m r1.bed,r2.bed,r3.bed -o output.txt

Date: 2020/05/20
Authors: Jin Li <[email protected]>

Installation

It is encouraged to install MMULT via Bioconda due to runtime dependencies will be installed automatically by Conda. Namely,

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda install MMULT

where dependent software are below.

Software URL
boost https://www.boost.org
sundials https://github.com/LLNL/sundials
rapidjson https://github.com/Tencent/rapidjson
eigen http://eigen.tuxfamily.org
tbb-devel https://github.com/oneapi-src/oneTBB

Contact

Maintainer: Jin Li, [email protected]. PI: De-Qiang Sun, [email protected].

About

MMULT: multiple sample analysis from large-scale WGBS data

Resources

Stars

Watchers

Forks

Packages

No packages published