-
Notifications
You must be signed in to change notification settings - Fork 3
Home
This wiki explains the basics of the NASC-seq data processing pipeline.
An example of basic usage of the NASC-seq data processing pipeline is shown below. For more details see one of the following topics:
NASC-seq experimental protocol
NASC-seq data analysis with Amazon Web Services
NASC-seq pipeline required file structures
-h, --help
show this help message and exit
-e, --experimentdir [dir]
experiment directory. See NASC-seq required file structure for more information.
-p, --numCPU [integer]
number of CPUs to use
-f, --flag [flag]
flag specifying the step in the NASC-seq pipeline that will be performed:
trim
Trim fastq files using trimgalore
align
Align fastq files using STAR to hg38 (uses 4 threads / cell)
removegenome
Removes genome from shared memory
removeduplicates
Removes duplicates from aligned bam files using picard
annotate
Annotate features in bam files using Rsubread and index files
conversiontag
Tag conversions in bam file headers
vcfFilter
Select possible SNPs from shared mismatches between cells
tagFilter
Remove SNPs from conversion tags and index files (uses 1 thread / cell)
cellQC
Perform basic QC visualization to decide on QC cutoffs. This can be used
to decide on quality cutoffs that can be added to the config.py file.
cellFilter
Filter bam files based on QC cutoffs in the config.py file
calculatePE
Calculate error probability
prepareData
Prepares pickles from data for scalable processing using AWS or similar
processData
Processes pickles and creates output pickle which will be used for data summarization
Will process 1 cell at a time on the given number of threads (-p / --numCPU), and loop
over all cells in the prepared data.
summarize
Summarize corrected data and prepare files with new and old reads as well as
additional files with the modes of the estimates, the confidence intervals,
the standard deviations and the means.
A configuration file should be prepared in the root of the experiment directory. For the layout of the configuration file see the example config file (/NASC-seq/data/config_example.py). In addition to the locations of some of the dependencies, users will have to refer to a genome using STAR (gnv), a gtf file (gtf) and an SJBD file (sjdb) to facilitate memory sharing while aligning. The config file furthermore includes links to a file with strand information for all features in the genome (strandFile), a stan model file (stanFile).
Running the analysis can be done step-by-step by following the flags in the order presented under 'Usage'. When rerunning the analysis on the example data, exclude the 'vcfFilter' step since this depends on the availability of broader data (i.e. a position is excluded when it is found converted in many cells). Instead, the supplied result of this step in the QC/vcfFilter folder can be used to remove potential SNPs from the example data.
While running the pipeline a number of fastq files as well as bam files will be produced and saved in the fastqFiles and bamFiles folders respectively.
The folders will contain the following partially processed datafiles:
Experimentdir
bamFiles
aligned_bam STAR output
duplRemoved_bam PICARD duplicate removal output
annotated_bam Rsubread annotated output
annotated_sorted_bam Sorted Rsubread annotated output
tagged_bam Conversion-tagged output
filteredTagged_bam SNP-filtered conversion-tagged output