Skip to content
Gert-Jan Hendriks edited this page Jun 27, 2019 · 5 revisions

NASC-seq data processing pipeline

This wiki explains the basics of the NASC-seq data processing pipeline.

An example of basic usage of the NASC-seq data processing pipeline is shown below. For more details see one of the following topics:

NASC-seq experimental protocol

Installation

NASC-seq data analysis with Amazon Web Services

NASC-seq pipeline required file structures

NASC-seq data sources


Usage

-h, --help show this help message and exit

-e, --experimentdir [dir] experiment directory. See NASC-seq required file structure for more information.

-p, --numCPU [integer] number of CPUs to use

-f, --flag [flag] flag specifying the step in the NASC-seq pipeline that will be performed:

trim Trim fastq files using trimgalore

align Align fastq files using STAR to hg38 (uses 4 threads / cell)

removegenome Removes genome from shared memory

removeduplicates Removes duplicates from aligned bam files using picard

annotate Annotate features in bam files using Rsubread and index files

conversiontag Tag conversions in bam file headers

vcfFilter Select possible SNPs from shared mismatches between cells

tagFilter Remove SNPs from conversion tags and index files (uses 1 thread / cell)

cellQC Perform basic QC visualization to decide on QC cutoffs. This can be used to decide on quality cutoffs that can be added to the config.py file.

cellFilter Filter bam files based on QC cutoffs in the config.py file

calculatePE Calculate error probability

prepareData Prepares pickles from data for scalable processing using AWS or similar

processData Processes pickles and creates output pickle which will be used for data summarization Will process 1 cell at a time on the given number of threads (-p / --numCPU), and loop over all cells in the prepared data.

summarize Summarize corrected data and prepare files with new and old reads as well as additional files with the modes of the estimates, the confidence intervals, the standard deviations and the means.

Performing NASC-seq analysis

A configuration file should be prepared in the root of the experiment directory. For the layout of the configuration file see the example config file (/NASC-seq/data/config_example.py). In addition to the locations of some of the dependencies, users will have to refer to a genome using STAR (gnv), a gtf file (gtf) and an SJBD file (sjdb) to facilitate memory sharing while aligning. The config file furthermore includes links to a file with strand information for all features in the genome (strandFile), a stan model file (stanFile).

Running the analysis can be done step-by-step by following the flags in the order presented under 'Usage'. When rerunning the analysis on the example data, exclude the 'vcfFilter' step since this depends on the availability of broader data (i.e. a position is excluded when it is found converted in many cells). Instead, the supplied result of this step in the QC/vcfFilter folder can be used to remove potential SNPs from the example data.

Output

While running the pipeline a number of fastq files as well as bam files will be produced and saved in the fastqFiles and bamFiles folders respectively.

The folders will contain the following partially processed datafiles:

  Experimentdir
      bamFiles
          aligned_bam             STAR output
          duplRemoved_bam         PICARD duplicate removal output
          annotated_bam           Rsubread annotated output
          annotated_sorted_bam    Sorted Rsubread annotated output
          tagged_bam              Conversion-tagged output
          filteredTagged_bam      SNP-filtered conversion-tagged output