Copyright 2017: Alexander Gosdschan, Katarzyna Wreczycka, Bren Osberg, Ricardo Wurmus. This work is distributed under the terms of the GNU General Public License, version 3 or later. It is free to use for all purposes.
PiGx is a data processing pipeline for raw fastq read data of bisulfite experiments; it produces reports on aggregate methylation and coverage and can be used to produce information on differential methylation and segmentation. It was first developed by the Akalin group at MDC in Berlin in 2017.
The figure below provides a sketch of the process.
PiGx uses the GNU build system. If you want to install PiGx from source (here you can find the latest release), please make sure that all required dependencies are installed and then follow these steps after unpacking the latest release tarball:
./configure --prefix=/some/where
make install
By default the configure
script expects tools to be in a directory
listed in the PATH
environment variable. If the tools are installed
in a location that is not on the PATH
you can tell the configure
script about them with variables. Run ./configure --help
for a list
of all variables and options.
The following tools must be available:
- fastqc
- trim_galore
- cutadapt
- bismark_genome_preparation
- deduplicate_bismark
- bismark
- bowtie2
- samtools [>=1.3]
- snakemake
- Python [>=3.5]
- pandoc
- pandoc-citeproc
- R
- methylKit [>=1.3.1]
- genomation
- GenomeInfoDb
- DT
- annotationhub
- rtracklayer
- rmarkdown [>=1.5]
- bookdown
All of these dependencies must be present in the environment at configuration time.
You can install PiGx through Guix (TODO: add details here after release).
Run the configure
script to probe your environment for tools needed
by the pipeline. If you cannot be bothered to install all packages
manually, we recommend using GNU Guix. The
following command spawns a sub-shell in which all dependencies are
available:
guix environment -l guix.scm
To run PiGx on your experimental data, first enter the necessary parameters in the spreadsheet file (see following section), and then from the terminal type
$ pigx_bs [options]
To see all available options type the --help
option
$ pigx_bs --help
usage: pigx_bs [-h] [-v] [-p PROGRAMS] [-c CONFIGFILE] [-s SNAKEPARAMS]
tablesheet
PiGx BSseq Pipeline.
PiGx is a data processing pipeline for raw fastq read data of
bisulfite experiments. It produces methylation and coverage
information and can be used to produce information on differential
methylation and segmentation.
positional arguments:
tablesheet The tablesheet containing the basic configuration information for
running the pipeline.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-p PROGRAMS, --programs PROGRAMS A JSON file containing the absolute paths of the required tools.
-c CONFIGFILE, --configfile CONFIGFILE The config file used for calling the underlying snakemake process. By
default the file 'config.json' is dynamically created from tablesheet
and programs file.
-s SNAKEPARAMS, --snakeparams SNAKEPARAMS Additional parameters to be passed down to snakemake, e.g.
--dryrun do not execute anything
--forceall re-run the whole pipeline
The input parameters specifying the desired behaviour of PiGx should be entered into the tablesheet file. When PiGx is run, the data from this file will be used to automatically generate a configuration file.
Here is an example tablesheet:
[ GENERAL PARAMETERS ]
PATHIN="in/"
PATHOUT="out/"
GENOMEPATH="genome/"
GENOME_VERSION="hg19"
bismark_args=" -N 0 -L 20 "
fastqc_args=""
trim_galore_args=""
bam_methCall_args_mincov="0"
bam_methCall_args_minqual="10"
NICE="19"
numjobs="6"
cluster_run="FALSE"
contact_email="NONE"
bismark_cores="3"
bismark_MEM="19G"
MEM_default="8G"
qname="all"
h_stack="128m"
diffmeth_cores="20"
[ SAMPLES ]
Read1,Read2,SampleID,ReadType,Treatment
PE_1.fq.gz,PE_2.fq.gz,PEsample,WGBS,0
SE_techrep1.fq.gz,,SEsample,WGBS,1
SE_techrep2.fq.gz,,SEsample_v2,WGBS,2
[ DIFFERENTIAL METHYLATION ]
0, 1
The tablesheet contains 3 paragraphs:
- general parameters,
- a table with sample specific information containing the names of fastq files, unique sample ids, the type of bisulfite sequencing experiment (could be RRBS or WGBS,only WGBS is available right now) and treatment group for differential methylation detection
- treatment groups considered for differential methylation detection
General parameters have to contain variables:
Click to expand explanations
Variable name | description |
---|---|
PATHIN | string: location of the experimental\nall input data files (.fastq[.gz|.bz2]) |
PATHOUT | string: ultimate location of the output data and report files |
GENOMEPATH | string: location of the reference genome data for alignment |
GENOME_VERSION | string: an UCSC assembly release name e.g. "hg19" |
bismark_args | string: optional arguments supplied to bismark during alignment. See the [Bismark User Guide], e.g. " -N 0 -L 20 " |
fastqc_args | string: optional arguments supplied to FastQC during alignment. See the FastQC, e.g. "" |
trim_galore_args | string: optional arguments supplied to Trim Galore! during alignment. See the Trim Galore! e.g. "" |
bam_methCall_args_mincov | string: minimum read coverage to be included in the methylKit objects. defaults to 10. Any methylated base/region in the text files below the mincov value will be ignored. |
bam_methCall_args_minqual | string: minimum phred quality score to call a methylation status for a base, e.g. "10" |
cluster_run | string: a boolean whether the pipeline should be run on cluster, e.g. "FALSE" |
numjobs | string: number of jobs sent to cluster, e.g. "6" |
contact_email | string: email address to which information about cluster job is sent |
bismark_cores | string: number of cores used by bismark, e.g. "3" |
bismark_MEM | string: amount of memory used by bismark, e.g. "19G" |
MEM_default | string: amount of memory used for all jobs besides bismark, e.g. "8G" |
qname | string: queue name (used for cluster jobs), e.g. "all" |
h_stack | string: stack size limit (used for cluster jobs), e.g. "128m" |
diffmeth_cores | integer: denoting how many cores should be used for parallel differential methylation calculations |
NICE | integer: from -20 to 19; higher values make the program execution less demanding on computational resources |
Make sure that all input files (paired or single end) are present in the folder
indicated by PATHIN
. All output produced by the pipeline will written to the folder indicated by PATHOUT
,
with subdirectories corresponding to the various stages of the process.
The directory pointed to by GENOMEPATH
has to contain the reference genome being mapped to.