StoatyDive is a tool to evaluate and classify predicted peak profiles to assess the binding specificity of a protein to its targets. It can be used for sequencing data such as CLIP-seq or ChIP-Seq, or any other type of peak profile data.
You can install StoatyDive via conda or with a direct download of this repository.
conda install stoatydive
git clone https://github.com/BackofenLab/StoatyDive.git
cd StoatyDive
sudo apt-get install -y r-base
python setup.py install
or
https://github.com/BackofenLab/StoatyDive/archive/v1.1.0.tar.gz
Requirements: python >= 3.6, bedtools >= 2.27.1, numpy>=1.13.3, matplotlib>=2.1, scipy >= 1.3, R >= 3.4.4
StoatyDive.py [-h] [options] -a *.bed -b *.bam/*bed -c *.txt
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-a *.bed, --input_bed *.bed
Path to the peak file in bed6 format.
-b *.bam/*.bed, --input_bam *.bam/*.bed
Path to the read file used for the peak calling in bed
or bam format.
-c *.txt, --chr_file *.txt
Path to the chromosome length file.
-o path/, --output_folder path/
Write results to this path. [Default: Operating Path]
-t float, --thresh float
Set a normalized CV threshold to divide the peak
profiles into more specific (0) and more unspecific
(1). [Default: 1.0]
--peak_correction Activate peak correction. The peaks are recentered
(shifted) for the correct sumit.
--max_translocate Set this flag if you want to shift the peak profiles
based on the maximum value inside the profile instead
of a Gaussian blur translocation.
--peak_length int Set maximum peak length for the constant peak length.
--max_norm_value float
Provide a maximum value for CV to make the normalized
CV plot more comparable.
--border_penalty Adds a penalty for non-centered peaks.
--scale_max float Provide a maximum value for the CV plot.
--maxcl int Maximal number of clusters of the kmeans clustering of
the peak profiles. The algorithm will be optimized,
i.e., the parameter is just a constraint and not
absolute. [Default: 15]
-k int, --numcl int You can forcefully set the number of cluster of peak
profiles.
--sm Turn on the peak profile smoothing for the peak
profile classification. It is recommended to turn it
on.
--lam float Parameter for the peak profile classification. Set
lambda for the smoothing of the peak profiles. A
higher value (> default) will underfit. A lower value
(< default) will overfit. [Default: 0.3]
--turn_off_classification
Turn off the peak profile classification.
It is recommended to use StoatyDive with --border_penalty
and --peak_correction
.
Adding the border penalty takes care of peaks that are not correctly centered and might
just overlap with a short appendage of a read stack. The length normalization takes
care of different sized peaks. All peak are extended to a certain length with --peak_correction
. The user can either provide a peak length with peak_length
or
StoatyDive just takes the maximal peak length of the given peak set.
The user can set a CV threshold with -t, --thresh
to divide the predicted
peaks into more specific (0) and more unspecific sites (1). The default is set
to 1.0
.
StoatyDive runs uMAP and k-means clustering to classify the peak profiles.
You can skip this step with --turn_off_classification
. The parameter --maxcl
is crucial for the k-means clustering. Just leave it in default, since the number
of clusters will be optimized internally. The parameter --maxcl
is just an upper
bound. If you assume that 15 cluster is not enough then increase the parameter. You
can also force StoatyDive to use k clusters with -k
.
StoatyDive can smooths the peak profiles, with a spline regression, using the
option --sm
, which is recommended. This helps reduce the amount of noise. The
parameter --lam
can be adjusted for your data. The default of 0.3
was optimized
with some test data. A higher value (> default) will underfit. A lower value (< default) will overfit.
StoatyDive translocates (centers) the profiles with a Gaussian blur, because the dimensional reduction
methods uMAP is not translation invariant. You can also decide to center the peak profiles based on the
maximal intensity value inside with --max_translocate
. This can be useful, for example, if you use
truncation events of iCLIP.
- You can set a maximal value for the normalized CV distribution plot with
max_norm_value
. This option helps, if you want to compare several normalized CV distribution plots from different experiments. Take the highest CV from all experiments as a maximal value. - You can set a maximal value for the CV distribution plot with
scale_max
. This option helps, if you want to compare several normalized CV distribution plots from different experiments. Take the highest CV from all experiments as a maximal value.
Example 1:
StoatyDive.py -a test/broad_peaks/peaks.bed -b test/broad_peaks/reads.bed -c test/chrom_sizes.txt --peak_correction --border_penalty --turn_off_classification -o test/broad_peaks/
Example 2:
StoatyDive.py -a test/sharp_peaks/peaks.bed -b test/sharp_peaks/reads.bed -c test/chrom_sizes.txt --peak_correction --border_penalty --turn_off_classification -o test/sharp_peaks/
Example 3:
StoatyDive.py -a test/mixed_peaks/peaks.bed -b test/mixed_peaks/reads.bam -c test/chrom_sizes.txt --peak_correction --peak_length 50 --border_penalty --sm -o test/mixed_peaks/
A | B |
---|---|
Figure 1. CV distributions of A: test/broad_peaks/, B: test/sharp_peaks. The diagram will give you a first impression of the binding specificity of your protein of interest. The diagram also tells you about the performance/quality of your experiment. An experiment with lots of unspecific binding sites will have a CV distribution close to zero, as in our example A. An experiment with lots of specific binding sites will have a CV distribution with a high expected CV, as in our example B.
A | B |
---|---|
Figure 2. Normalized CV distributions of A: test/broad_peaks/, B: test/sharp_peaks. The normalized CV distribution helps to identify specific and unspecific sites within an experiment. The normalized CV is in a range [0,1]. A specific site will have a value of 1. An unspecific site will have a value of 0.
The final tabular file is a ranked, tab separated list of your predicted binding sites:
- Chromosome
- Start of Peak
- End of peak
- Peak ID/Name
- CV
- Strand
- Peak length
- r (hyperparameter of negative binomial)
- p (hyperparameter of negative binomial)
- Normalized CV
- Difference between the maximal value of the left border and the maximal value of the center region of the peak. (Penalty used for the border penalty.)
- Difference between the maximal value of the right border and the maximal value of the center region of the peak. (Penalty used for the border penalty.)
- Internal peak index
- Type of Peak: 0 = More specifc binding site; 1 = More unpsecific binding site.
- Relative Position of the maximal coverage/intensity (sumit) inside the peak.
- Class of the Peak (Cluster).
Use column 14
to divide your peak into the two general categories of
specific and unspecific. Use column 16
to find peaks with a specific shape. If the user
skipped the classification, then the final_tab_.bed is the CV_tab_.bed. The
tabular has no 15/16th column.
You will get some plots for the classification, saved in the folder clustering_*
.
A | B |
---|---|
Figure 3. Example cluster profile for data test/mixed_peaks/; Cluster 3 A: raw profile, B: smoothed profile. If you turned on the smoothing you will get four types of cluster sets. The first one shows you some example raw peak profiles assigned to the specific cluster (e.g. cluster_3.pdf for cluster 3; Figure A). The second one shows you some example smoothed and sometimes translocated peak profiles to the specific cluster (e.g. cluster_smoothed3.pdf for cluster 3; Figure B). Profile like figure B are used for the classification. The profiles are colored based on the clusters as seen as in the uMAP plot.
C |
---|
D |
---|
Figure 4. Cluster profile summary for data test/mixed_peaks/. The third one is an example profile, such as Figure A for each cluster in one plot (overview_cluster.pdf; Figure C). The fourth one are the average profiles of each cluster (e.g. cluster_average_profiles.pdf; Figure D).
Figure 5. Kmeans optimization for data test/mixed_peaks/.
The plot kmeans_Optimization.pdf
shows you the optimization scheme. If you data
has a very low complexity, that is to say, you have lots of similar peak profiles,
then the percent of variance explained will be very low (second plot). It is also
indicated by strong fluctuations in the other diagrams. If you have very distinguishable
peak profiles, as in our example, then the variance explained will be > 90%
.
Figure 6. uMAP plot for data test/mixed_peaks/.
The plot uMAP.pdf
shows you the data in the new dimension found by the uMAP
dimensional reduction algorithm. In correspondance to the the k-means optimization,
highly distinguishable peaks will appear in the plot as very clearly separated
point clusters.