This pipeline is used to analyze ssDRIP-seq data. The following operations can be automatically performed:
BaseAnalysis
- Alignment of reads to reference sequences
- Duplicates removing
- Strand splitting
- Peak calling
- Normalized bam to bigwig file
DeseqAnalysis
- DESeq2 for peaks(merge the peaks of all samples)
- Peaks annotation
DownstreamAnalysis
- Mfuzz cluster(difference peak with qvalue<=0.01)
- Correlation of samples
- Motif for peaks(used HOMER)
- Peaks length distribution
- GCskew and ATskew
- Sense and Antisense metaplot
- Peaks content distribution(the proportion of peaks in TSS,TTS and gene body region)
ssDripPipeline can be installed using the conda package manager Bioconda. To install ssDripPipeline using Bioconda, download and install Anaconda Python 3.7, following the instructions at: https://www.anaconda.com/distribution/.
To install ssDripPipeline into the current conda environment:
$ conda install -c conda-forge -c bioconda ssdrippipeline
Alternately, to create a new environment named ssDripPipeline_env
with ssDripPipeline:
$ conda create -n ssDripPipeline_env -c conda-forge -c bioconda ssdrippipeline python=3.7
Activate your conda environment:
$ conda activate ssDripPipeline_env
Verify that ssDripPipeline is installed using the command:
$ ssDRIPSeqAnalysis.py
Usage:
ssDRIPSeqAnalysis.py <DripConfig.json> <BaseAnalysis|DeseqAnalysis|DownstreamAnalysis|AllPip>
ssDRIPSeqAnalysis.py contains four subcommands:
BaseAnalysis
$ ssDRIPSeqAnalysis.py DripConfig.json BaseAnalysis
DeseqAnalysis
The results of BaseAnalysis is required
$ ssDRIPSeqAnalysis.py DripConfig.json DeseqAnalysis
DownstreamAnalysis
The results of DeseqAnalysis and BaseAnalysis are required
$ ssDRIPSeqAnalysis.py DripConfig.json DownstreamAnalysis
AllPip
Execute BaseAnalysis, DeseqAnalysis, DownstreamAnalysis in turn
$ ssDRIPSeqAnalysis.py DripConfig.json AllPip
ssDRIPSeqAnalysis.py requires a configuration file in json format as input.
If you copy the following json configuration file, please delete the comments, because json does not support comments.
DripConfig.json
{
//The name of your project.
"ProjectName":"test",
//A file that contains the sample name and its corresponding data.
"Target":"target",
//Path of sequencing data, it must be an absolute path.
"SeqDataPath":"/data/yeat/cleandata/",
//bowtie2-build outputs, it must be an absolute path. GenomeIndex indicates the prefix name of bowtie2-build results.
"GenomeIndex":"/data/yeast/genome/GenomeIndex",
//Effective genome size, it used for peak calling, normalization and peaks proportion calculation.
"GenomeSize":"12345",
//Threads of some third-party software.
"Thread":"10",
//Filter out chromosomes that do not need to be analyzed, such as chloroplasts, mitochondria and spike-in chromosomes.
"FilterChromFile":"/data/yeast/Analysis/filter.txt",
//Size of the bins, in base, for the bamCoverage and plotCorrelation in the deeptools.
"BinSize":"5",
//The number of times a set random regions are selected as a background for motif analysis and normalization. The number and length distribution of a set of random regions are consistent with the number and length distribution of peak regions.
"RepeatNum":"10",
//The file contains two columns. The first column is the chromosome name, and the second column is the chromosome length. The chromosome name must be consistent with the name in the genome fasta file. The filtered chromosomes should not be included in this file. In addition, the absolute path cannot be missing.
"ChromSize":"/data/yeast/Analysis/chrom.size",
//Fasta file of genome for GetATGCSkewBw and FindMotif. The absolute path cannot be missing.
"GenomeFastaFile":"/data/yeast/genome/Genome.fasta",
//If there is no control sample for DESeq2 analysis, you can write a random name, such as hehe.
"ControlSample":"hehe",
//Bed file of gene. The absolute path cannot be missing.
"GeneBed":"/data/yeast/genome/gene.bed",
//Distance upstream of the start site of the regions and distance downstream of the end site of the regions. Used for skew metaplot and sense/antisense metaplot.
"MetaplotExtend":"1000",
//TSS/TTS left and right extension distance. TSS extends a certain distance from left and right as a promoter region. TTS extends a certain distance from left and right as a terminator region.
"ContentExtend":"150",
//The name of the chromosome in the fasta file of the Escherichia coli genome. If you do not do spike-in normalization, please ignore this.
"SpikeNameList":["Ecoli"],
//Debug
"Debug":"False",
//Divide each input interval to fixed-sized windows in chromosome.
"Win":"100",
//How many base pairs to step before creating a new window. This parameter is used together with the win parameter to calculate ATskew and GCskew.
"Step":"50"
}
The target file format is as follows:
If you don’t use spike-in, the target file will contain four columns. Each column is separated by Tab. The first column is the group name. The second column is the sample name. Repeat samples have the same group name. The third and fourth columns are paired-end sequencing data in fastq format.
SampleA SampleA_rep1 SampleA_rep1_R1.fastq.gz SampleA_rep1_R2.fastq.gz
SampleA SampleA_rep2 SampleA_rep2_R1.fastq.gz SampleA_rep2_R2.fastq.gz
SampleB SampleB_rep1 SampleB_rep1_R1.fastq.gz SampleB_rep1_R2.fastq.gz
SampleB SampleB_rep2 SampleB_rep2_R1.fastq.gz SampleB_rep2_R2.fastq.gz
If you use spike-in, the targe file will contain seven columns. The first four columns are the same as above. The fifth column is the sample name of the input. The sixth and seventh columns are the paired-end sequencing data of input.
SampleA SampleA_rep1 SampleA_rep1_R1.fastq.gz SampleA_rep1_R2.fastq.gz SampleA_rep1_input SampleA_rep1_input_R1.fastq.gz SampleA_rep1_input_R2.fastq.gz
SampleA SampleA_rep2 SampleA_rep2_R1.fastq.gz SampleA_rep2_R2.fastq.gz SampleA_rep2_input SampleA_rep2_input_R1.fastq.gz SampleA_rep2_input_R2.fastq.gz
SampleB SampleB_rep1 SampleB_rep1_R1.fastq.gz SampleB_rep1_R2.fastq.gz SampleB_rep1_input SampleB_rep1_input_R1.fastq.gz SampleB_rep1_input_R2.fastq.gz
SampleB SampleB_rep2 SampleB_rep2_R1.fastq.gz SampleB_rep2_R2.fastq.gz SampleB_rep2_input SampleB_rep2_input_R1.fastq.gz SampleB_rep2_input_R2.fastq.gz
The following is the directory structure of the output results.
├── SampleA_rep1
│ ├── SampleA_rep1_fwd_nucleus_norm.bw
│ ├── SampleA_rep1_fwd_peaks.bed
│ ├── SampleA_rep1_nucleus_norm.bw
│ ├── SampleA_rep1_peaks.bed
│ ├── SampleA_rep1_pip.sh #Scripts used in BaseAnalysis
│ ├── SampleA_rep1_rev_nucleus_norm.bw
│ ├── SampleA_rep1_rev_peaks.bed
│ ├── SampleA_rep1_scale.xls
├── SampleA_rep2
├── SampleB_rep1
├── SampleB_rep2
├── test_deseq #Analysis results of DESeq2. *_anno.xls is the result of difference analysis. *_norm_anno.xls is the normalized R-loop abundance. *_counts_final_anno.xls is the original read counts.
│ ├── all #No split strand
│ │ ├── merge_test.bed
│ │ ├── test_Callus_Flagleaf_diffexpr_results_anno.xls
│ │ ├── test_Callus_Flagleaf_diffexpr_results.xls
│ │ ├── test_Callus_Spike_diffexpr_results_anno.xls
│ │ ├── test_Callus_Spike_diffexpr_results.xls
│ │ ├── test_counts_final_anno.xls
│ │ ├── test_counts_final.xls
│ │ ├── test_counts.npz
│ │ ├── test_counts.xls
│ │ ├── test_deseq.r #Scripts used in DESeq2
│ │ ├── test_deseq_scale.txt
│ │ ├── test_norm_anno.xls
│ │ ├── test_norm.xls
│ │ ├── test_pca_deseq2.pdf
│ │ ├── test_Spike_Flagleaf_diffexpr_results_anno.xls
│ │ └── test_Spike_Flagleaf_diffexpr_results.xls
│ ├── fwd #Fwd strand
│ └── rev #Rev strand
├── test_analysis # The results of DownstreamAnalysis
│ ├── cluster
│ ├── correlation
│ ├── motif
│ ├── peaks_content_distribution
│ │ ├── test_peaks_content_distribution.xls
│ │ ├── tss_150.bed
│ │ └── tts_150.bed
│ ├── peaks_length_distribution
│ │ └── test_peaks_length_distribution.npy
│ ├── sense_antisense
│ │ ├── SampleA_rep1_antisense.gz
│ │ ├── SampleA_rep1_sense.gz
│ │ ├── SampleA_rep2_antisense.gz
│ │ ├── SampleA_rep2_sense.gz
│ │ ├── SampleB_rep1_antisense.gz
│ │ ├── SampleB_rep1_sense.gz
│ │ ├── SampleB_rep2_antisense.gz
│ │ └── SampleB_rep2_sense.gz
│ └── skew
│ ├── SampleA_rep1_fwd_GCATSkew.gz
│ ├── SampleA_rep1_GCATSkew.gz
│ ├── SampleA_rep1_rev_GCATSkew.gz
│ ├── SampleA_rep2_fwd_GCATSkew.gz
│ ├── SampleA_rep2_GCATSkew.gz
│ ├── SampleA_rep2_rev_GCATSkew.gz
│ ├── SampleB_rep1_fwd_GCATSkew.gz
│ ├── SampleB_rep1_GCATSkew.gz
│ ├── SampleB_rep1_rev_GCATSkew.gz
│ ├── SampleB_rep2_fwd_GCATSkew.gz
│ ├── SampleB_rep2_GCATSkew.gz
│ ├── SampleB_rep2_rev_GCATSkew.gz
│ ├── gene_GCATSkew.gz
│ ├── test_ATSkew.bw
│ └── test_GCSkew.bw
├── test_stat.xls #Sample statistics
├── DripConfig.json
├── filter.txt #DripConfig[FilterChromFile]
├── test_ana.sh #Scripts used in DownstreamAnalysis
└── test_bwscale.xls #Scale factor for normalization
The following software is used by this pipeline. When installing ssDripPipeline, these software will be installed automatically, and you do not need to install them yourself.
- Only supports paired-end sequencing data.
- ssDripPipeline does not include quality control, adapter cutting and tail trimming.
- Effective genome size
- Step by step protocols wiki.
- Sam/Bam format
- Bigwig format
- Bed format
- Metaplot
- A detailed explanation of the ssDripPipeline results can be found in these two articles(paper1,paper2).
Xu, W., Li, K., Li, Q., Li, S., Zhou, J., & Sun, Q. (2022). Quantitative, Convenient, and Efficient Genome-Wide R-Loop Profiling by ssDRIP-Seq in Multiple Organisms. Methods in molecular biology (Clifton, N.J.), 2528, 445–464.