MMULT: multiple sample analysis from large-scale WGBS data

MMULT is a scalable and efficient solution for DNA methylation analysis of large-scale whole-genome bisulfite sequencing (WGBS) data. MMULT consists of several submodules for analysis of multiple samples in single biological group and multiple groups. MMULT is specifically efficient for large-scale samples (e.g., hundreds of samples).

Submodules

BBGLM

BBGLM resolves DNA methylation dynamics using beta-binomial generalized linear model. Utilization of stan library (https://github.com/stan-dev/cmdstan) enables fast execution of model fit for CpG dinucleotides in the whole-genome scale. An example p-value and q-value distributions by BBGLM is as below.

Allowed options:
  -h [ --help ]                   Produce help message.
  -m [ --methfile ] arg           Methylation BED files. The BED file is 
                                  generated by `MCALL` in MOABS. Replicates in 
                                  a group are concatenated by comma `,`. 
                                  Multiple groups can be specified. For 
                                  example, `-m g1_r1.bed,g1_r2.bed -m g2_r1.bed
                                  -m g3_r1.bed,g3_r2.bed,g3_r3.bed`.
  -c [ --chrom ] arg              A specific chromosome for analysis. Can be 
                                  specified multiple times for multiple 
                                  chromosomes. The size can be encoded for a 
                                  chromosome. For example, `-c chr1:248956422 
                                  -c chr2:242193529`. The size can be used to 
                                  split a chromosome for running in small 
                                  batches. Default: all chromosomes appear in 
                                  methylation files.
  -l [ --length ] arg (=20000000) Split length of coordinates in a chromomsome.
                                  This is necessary for many replicates with a 
                                  limited memory. To enable small-batch 
                                  running, size info should be specificed by 
                                  `-c|--chrom`. Because the size of chr1 in 
                                  hg38 is >200 million, 1/10th (20M) can be 
                                  good to go. Default: 20000000.
  -d [ --mindepth ] arg (=1)      Minimum depth for a CpG coverage. Default: 1.
  -t [ --readthreads ] arg (=10)  Number of read threads. Default: 10.
  -b [ --batchthreads ] arg (=5)  Number of batch threads. Default: 5.
  --qval arg (=0.05)              Q-value threshold for DMC. Default: 0.05.
  --nominaldiff arg (=0.2)        Nominal methylation difference threshold for 
                                  DMC. Default: 0.2.
  --maxdistdmcs arg (=300)        Maximum distance between consecutive DMCs for
                                  DMR. Default: 300.
  --mindmc arg (=3)               Minimum number ofDMCs in a DMR. Default: 3.
  -o [ --outfile ] arg            Output file.

Examples: 
  bbglm -m g1_r1.bed,g1_r2.bed -m g2_r1.bed -m g3_r1.bed,g3_r2.bed,g3_r3.bed -o output.txt

Date: 2020/08/19
Authors: Jin Li <[email protected]>

CpGCDIFEnrich

The CpGCDIFEnrich module consolidates methylation differences from individual comparisons using KL-divergence. Credible methylation difference (CDIF, https://github.com/sunnyisgalaxy/moabs) represents methylation difference in an individual comparison between two biological conditions. For a CpG site, CpGCDIFEnrich formulates the consolidated methylation difference using KL-divergence of sample CDIFs compared to the background distirbution of CDIFs among whole-genome CpGs. Below image shows the background distribution of CDIFs, one positive CpG with a high KL-divergence value, and a scatterplot of KL-divergence and sum of CDIFs.

Allowed options:
  -h [ --help ]                   Produce help message.
  -c [ --compfile ] arg           Comparison files. The comparison file is 
                                  generated by `MCOMP` in MOABS. For example, 
                                  `-c H001VsNL -c H002VsNL`.
  -r [ --chrom ] arg              A specific chromosome for analysis. Can be 
                                  specified multiple times for multiple 
                                  chromosomes. The size can be encoded for a 
                                  chromosome. For example, `-c chr1:248956422 
                                  -c chr2:242193529`. The size can be used to 
                                  split a chromosome for running in small 
                                  batches. Default: all chromosomes appear in 
                                  comparison files.
  -l [ --length ] arg (=20000000) Split length of coordinates in a chromomsome.
                                  This is necessary for many replicates with a 
                                  limited memory. To enable small-batch 
                                  running, size info should be specificed by 
                                  `-r|--chrom`. Because the size of chr1 in 
                                  hg38 is >200 million, 1/10th (20M) can be 
                                  good to go. Default: 20000000.
  -b [ --numbins ] arg (=100)     Number of bins. Default: 100.
  -t [ --numthreads ] arg (=10)   Number of threads. Default: 10.
  --kldthr arg (=0.67957)         KL-divergence threshold for a DMC. A quarter 
                                  of nats. Default: 0.67957.
  --cdifthr arg (=0.2)            CDIF threshold for a DMC. Default: 0.2.
  --maxzerocdif arg (=0.05)       Maximum percent of zero CDIFs for a DMC. A 
                                  CpG with both positive and negative CDIFs 
                                  will be ignored. A negative value will not 
                                  check zero CDIFs. Default: 5%.
  --maxdistdmcs arg (=300)        Maximum distance between consecutive DMCs for
                                  a DMR. Default: 300.
  --mindmc arg (=3)               Minimum number ofDMCs in a DMR. Default: 3.
  -o [ --outfile ] arg            Output file.

Examples: 
  cpgcdifenrich -c H001VsNL -c H002VsNL -o output.txt

Date: 2020/08/19
Authors: Jin Li <[email protected]>

VMCVMRNME

This module aims to detect variable methylated CpGs (VMCs) and variable methylation regions (VMRs) of samples in a single biological group. A VMC is denoted under less randomness (smaller normalized entropy) and large variation. VMCs and VMRs enable a feasible solution for subtype detection in large-scale samples using methylation profiles. An example subtyping solution using VMCs is as below.

Allowed options:
  -h [ --help ]                   Produce help message.
  -m [ --methfile ] arg           Methylation BED files. The BED file is 
                                  generated by `MCALL` in MOABS. Replicates are
                                  concatenated by comma `,`. For example, `-m 
                                  r1.bed,r2.bed,r3.bed`.
  -c [ --chrom ] arg              One specific-chromosome for analysis. Can be 
                                  specified multiple times for multiple 
                                  chromosomes. Default: all chromosomes appear 
                                  in methylation BED files.
  -o [ --outfile ] arg            Output file.
  -k [ --state ] arg (=2)         Number of discretization states. Default: 2.
  -w [ --window ] arg (=150)      Window size for genome scan. Default: 150.
  -b [ --mincpg ] arg (=3)        Minimum CpGs in a window. Default: 3.
  -d [ --mindepth ] arg (=3)      Minimum depth for a CpG coverage. Default: 3.
  -t [ --numthreads ] arg (=8)    Number of threads. Default: 8.
  -v [ --vmrmethod ] arg (=0)     VMR detection method. 0: identify VMCs first 
                                  and detect VMRs from consecutive VMCs; 1: 
                                  Genome scan method by fixed-size windows. 
                                  Default: 0.
  -s [ --sd ] arg (=0.2)          sd for VMC. Default: 0.2.
  -n [ --nme ] arg (=0.25)        NME for VMC. Default: 0.25.
  -x [ --maxdistvmcs ] arg (=300) Maximum distance between consecutive VMCs for
                                  VMR. Default: 300.
  --minsample arg (=5)            Minimum samples for a CpG. Default: 5.
  --vmcfile arg                   VMC file.
  --windowfile arg                VMR file by genome scan.

Examples: 
  vmcvmrnme -m r1.bed,r2.bed,r3.bed -o output.txt

Date: 2020/05/20
Authors: Jin Li <[email protected]>

Installation

It is encouraged to install MMULT via Bioconda due to runtime dependencies will be installed automatically by Conda. Namely,

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda install MMULT

where dependent software are below.

Software	URL
boost	https://www.boost.org
sundials	https://github.com/LLNL/sundials
rapidjson	https://github.com/Tencent/rapidjson
eigen	http://eigen.tuxfamily.org
tbb-devel	https://github.com/oneapi-src/oneTBB

Contact

Maintainer: Jin Li, [email protected]. PI: De-Qiang Sun, [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
include		include
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Makefile.am		Makefile.am
README.mkd		README.mkd
build.sh		build.sh
configure.ac		configure.ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMULT: multiple sample analysis from large-scale WGBS data

Submodules

Installation

Contact

About

Releases 2

Packages

Languages

lijinbio/MMULT

Folders and files

Latest commit

History

Repository files navigation

MMULT: multiple sample analysis from large-scale WGBS data

Submodules

Installation

Contact

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages