#Pipeliner
Pipeliner provides access to some of the NGS data analysis pipelines used by CCBR on the NIH Biowulf Linux Cluster. Pipeliner allows the user to select a set of data files, such as fastq sequence reads, and process them via a sequence of programs that constitute an analysis 'pipeline' to reach an endpoint, such as a list of mutations with accompanying annotations. The program provides a graphical interface in which pipeline options are configured using pulldowns and text entry fields. Once configured, the pipeline is executed on the Biowulf cluster.
- An account on the NIH Biowulf Linux Cluster
Pipeliner runs on the NIH Biowulf cluster and uses a graphical interface. To use the program one must log into Biowulf using ssh with X11 packet forwarding. For instance:
ssh -Y [email protected]
Once logged in, launch the Pipeliner program:
/path/to/pipeliner/install/runpipe.sh
If you are running your own installation of Pipeliner, you will need to edit the paths to the programs and files given in these configuration files located within the Pipeliner installation directory:
hg19.json
mm10.json
standard-bin.json
Within the Pipeliner program, the following steps must be executed in order:
- Set the working directory--this is the directory in which all output files will appear and must already exist
- Set the data source--this is the directory containing the starting data which is generally a set of files containing paired end fastq reads
- Initialize the working directory--this step first clears the working directory, then makes a few subdirectories within it and populates these with a number of files needed to run the pipelines
- Link the data files to the working directory--this step creates symbolic links from the data files into the working directory to prevent the need to copy or move large files
- Perform a dry run to verify that the pipeline is configured properly--if the dry run produces errors, these must be corrected prior to launching
- Launch the pipeline--this submits the pipeline job to the Biowulf cluster
Pipeliner uses a program called Snakemake to manage pipeline workflows.
Snakemake, in turn, accepts a configuration file formatted in JSON (Javascript Object Notation).
hg19.json
:references for human genome version 19/GRch37standard-bin.json
:paths to programs used in the pipelinerules.json
:lists of Snakemake rules assigned to named pipelinescluster.json
:SLURM parameters for each rule (memory requested, time, # cpus)
pipeliner.py
:main program, including GUI componentsmakeasnake.py
:called by Pipeliner to build Snakefiles required by Snakemakestats2html.py
:builds reports at end of a pipeline run
Reports
:contains scripts for aggregate report generation and aggregate reports created by PipelinerQC
:contains QC reports generated during various pipeline stepsScripts
:contains custom scripts used by some pipelines
Data
:resides within the installation Pipeliner directory and contains files specifying adapter sequences to be trimmed from reads, in fasta format. These should be referenced in the appropriate reference json file, e.g.hg19.json
.
- ExomeSeqPairs
- ExomeSeqGermline
- RNASeq(counts-based)
- ChipSeq
- mirSeq
- Human (hg19)
- Mouse (mm10)
You can configure new pipelines by adding Snakemake rules to the Rules directory and referencing them within 'rules.json' as requirements for named pipelines by adding the name of the rule (key) to the json keys and adding the name of the pipeline requiring the rule to the list (the value) for the key. A new entry for the pipeline must also be added to the pulldown in the GUI--this requires that pipeliner.py
be edited.