-
Notifications
You must be signed in to change notification settings - Fork 104
Cupcake: supporting scripts for Iso Seq after clustering step
Last Updated: 08/23/2022
IMPORTANT: Cupcake now uses Python 3.7!
Using Cupcake scripts:
- Collapse redundant isoforms (has genome)
- Collapse redundant isoforms (no genome)
- Obtain associated count information
- Obtain associated count information by barcode
- Filter collapse results by minimum FL count support
- Filter away 5' degraded isoforms
- Chaining samples together
- Finding fusion genes
- Summarizing junction and scrubbing junctions across multiple samples
- Making colored BED format shaded by isoform abundance
- Summarizing and plotting transcript/exon/intron stats after collapse
Cupcake is a set of supporting scripts for processing Iso-Seq data. It expects either high-quality isoform sequences (output of Iso-Seq clustering step) or full-length reads (output of Iso-Seq classify step).
Cupcake is not Iso-Seq! Official Iso-Seq software documentation through (pb)BioConda is here
No! Cupcake is not Iso-Seq. It is a set of downstream supporting scripts.
Understand that, after running through the Iso-Seq pipeline, you should now have the HQ isoform sequences or FLNC reads. Each sequence is assumed to be full-length, but sequences may be redundant as they represent the same transcripts.
So, the recommended next steps are:
- Best practice for aligning Iso Seq to reference genome: minimap2, GMAP, STAR, BLAT
- Collapse identical isoforms to obtain final set of unique, full-length, high-quality isoforms
- Obtain associated count information for each unique isoform
- Running SQANTI3 for classification and library artifact removal
- Fusion gene finding
Unlike most of the other scripts in Cupcake (where you can just use it by directly copying the script or adding it to PYTHONPATH), using Cupcake does require installation.
The prerequisite for installation is BioPython and bx-python. You must already have BioPython installed. I also recommend using a Python distribution package like Anaconda where you can maintain a clean install environment.
Anaconda already comes with BioPython. If you already use Cogent or ANGEL, you likely already have Anaconda already installed. Then installation Cupcake is like below:
To create an anaconda environment for Python 3.7:
export PATH="/home/anacondaPy37/bin:$PATH"
conda -V
conda update conda
conda create -n anaCogent python=3.7 anaconda
// remember to log out log in
export PATH="/home/anacondaPy37/bin:$PATH"
conda activate anaCogent
Note: if you don't already have Cython, you have to install it
conda install -n anaCogent -c anaconda cython
After having anaconda environment created:
# activate anaconda environment, here I assume it's called anaCogent
$ source activate anaCogent
(anaCogent)$ git clone https://github.com/Magdoll/cDNA_Cupcake.git
(anaCogent)$ cd cDNA_Cupcake
(anaCogent)$ python setup.py build
(anaCogent)$ python setup.py install
After installation is complete, confirm the Cupcake scripts are in your bin:
(anaCogent)$ which collapse_isoforms_by_sam.py
~/anaconda3/envs/anaCogent/bin/collapse_isoforms_by_sam.py
UPDATE 08/23/2022!! collapse_isoforms_by_sam.py
is now implemented as isoseq3 collapse in the official Iso-Seq software suite. It is recommended that you transition to isoseq3 collapse
as the Cupcake collapse script will be deprecated by 2023.
The collapse script usage is:
usage: collapse_isoforms_by_sam.py [-h] [--input INPUT] [--fq] -s SAM -o
PREFIX [-c MIN_ALN_COVERAGE]
[-i MIN_ALN_IDENTITY]
[--max_fuzzy_junction MAX_FUZZY_JUNCTION]
[--max_5_diff MAX_5_DIFF]
[--max_3_diff MAX_3_DIFF]
[--flnc_coverage FLNC_COVERAGE]
[--gen_mol_count] [--dun-merge-5-shorter]
Test data is available in the cupcake/test_data subdirectory. The input we will need is hq_isoforms.fastq, which is the output from running Iso-Seq.
I've already provided the aligned, sorted sam file hq_isoforms.fastq.sorted.sam, but in case you want to try it for yourself, the minimap2 command is:
minimap2 -ax splice -t 30 -uf --secondary=no -C5 \
hg38.fa hq_isoforms.fastq > hq_isoforms.fastq.sam
sort -k 3,3 -k 4,4n hq_isoforms.fastq.sam > hq_isoforms.fastq.sorted.sam
Now we can run the collapse script:
collapse_isoforms_by_sam.py --input hq_isoforms.fastq --fq \
-s hq_isoforms.fastq.sorted.sam --dun-merge-5-shorter -o test
The output files are test.collapsed.gff, test.collapsed.rep.fq, and test.collapsed.group.txt.
There's also test.ignored_ids.txt, which shows us which sequences were ignored because they were unmapped or have too low coverage (you can change cutoff using -c
option, default is 0.99), too low identity (-i
option, default is 0.95).
We can also look at test.collapsed.group.txt to see how much collapse/merging of identical isoforms there is. For example,
PB.4.1 sample1_HQ_transcript/59
PB.4.2 sample1_HQ_transcript/63,sample1_HQ_transcript/64
PB.4.3 sample1_HQ_transcript/71
shows that the unique isoform PB.4.2
collapsed 2 identical isoforms.
The naming system for the post-collapse isoform is PB.<loci_index>.<isoform_index>
. Each locus consists of a strand-specific locus with isoforms that overlap by at least 1 bp. So PB.11.1
and PB.11.2
means this locus has two isoforms.
If you do not have a reference genome, please look at the tutorial here for using either CD-HIT or Cogent for collapse.
The usage for obtaining count information after collapse is:
usage: Get abundance/read stat information after running collapse script.
[-h] collapse_prefix cluster_report
Now that we have unique isoforms, we can go back and retrieve the number of associated FL and nFL counts. To get this information, we need cluster_report.csv. The CSV report tells us every full-length read that was associated with each cluster. Note that not every cluster becomes a HQ isoform, and not all HQ isoforms get used into the post-collapse unique isoforms (may be filtered out due to lack of mapping or low identity/coverage), so not all reads will contribute to the unique isoform count information.
Let's run the abundance script:
get_abundance_post_collapse.py test.collapsed cluster_report.csv
And we get two output. One is the detailed per-read report test.collapsed.read_stat.txt.
The read stat file associates each FL read (id
) with an isoform (pbid
):
id length is_fl stat pbid
m64012_190727_053041/105120134/ccs NA Y unique PB.16.1
m64012_190727_053041/21628950/ccs NA Y unique PB.16.1
m64012_190727_053041/114950281/ccs NA Y unique PB.16.1
m64012_190727_053041/87229073/ccs NA Y unique PB.16.1
The next file is test.collapsed.abundance.txt, which provides the FL count information.
#
# -----------------
# Field explanation
# -----------------
# count_fl: Number of associated FL reads
# norm_fl: count_fl / total number of FL reads, mapped or unmapped
# Total Number of FL reads: 1065
#
pbid count_fl norm_fl
PB.1.1 2 1.8779e-03
PB.1.2 6 5.6338e-03
PB.1.3 3 2.8169e-03
PB.2.1 3 2.8169e-03
PB.3.1 2 1.8779e-03
PB.4.1 8 7.5117e-03
If you barcoded your samples and ran IsoSeq with the barcodes pooled, you can use the demux scripts to get demultiplexed FL count information per barcode/sample.
The usage for filter_by_count.py
is:
usage: filter_by_count.py
[--min_count MIN_COUNT] [--dun_use_group_count] input_prefix
Now that we have the collapsed result and the count information associated with each unique isoform, we can further filter out isoforms with low FL count support. Currently, by default, an HQ isoform (and hence a unique isoform) must have at least 2 or more FL supporting it.
If we use a minimum of 10 FL counts as a cutoff:
filter_by_count.py --min_count 10 --dun_use_group_count test.collapsed
We will see that the output files are test.collapsed.min_fl_10.gff, test.collapsed.min_fl_10.abundance.txt, and test.collapsed.min_fl_10.rep.fq.
The usage for filter_away_subset.py
is:
usage: filter_away_subset.py [--fuzzy_junction] input_prefix
If we revisit the collapsed results, we'll notice that PB.10.3, PB.10.4, PB.10.5, PB.10.6
have the same exonic structures as PB.10.1
or PB.10.2
but are missing some of the 5' exons.
Since the Clontech SMARTer cDNA kit used to create the full-length cDNA does not do cap trap, it is possible these are 5' RNA degradation products and not biologically meaningful. In that case, we may wish to filter it away.
Using the output from the collapsed step:
filter_away_subset.py test.collapsed
We now obtained output files named test.collapsed.filtered.gff, test.collapsed.filtered.abundance.txt, and test.collapsed.filtered.rep.fq.
We can see that the filtered result leaves only PB.10.1
and PB.10.2
:
If you run multiple Iso-Seq jobs, after mapping them back to the genome and getting the non-redundant, collapsed transcripts, each sample will have its own set of unique IDs, such as PB.1.1
, PB.1.2
, PB.2.1
, and so on.
This tutorial shows how to chain samples together to display the correspondence between the different IDs.
You will need to first make a config file:
SAMPLE=<sample name>;<path>
SAMPLE=<sample name>;<path>
GROUP_FILENAME=touse.group.txt
GFF_FILENAME=touse.gff
COUNT_FILENAME=touse.count.txt
FASTQ_FILENAME=optional.rep.fastq
You can add as many SAMPLE
lines for as many samples as you have. Note: Do not include blank spaces, semicolons, and other punctuations in the sample names. <path>
denotes the directory in which the sample output resides.
In each sample <path>
, the script expects the following three files to exist: GROUP_FILENAME
, GFF_FILENAME
, and COUNT_FILENAME
. If your file names are different, please rename them or change the names in the config file. FASTQ_FILENAME
is optional.
Full usage is:
chain_samples.py [-h] [--fuzzy_junction FUZZY_JUNCTION]
[--dun-merge-5-shorter] [--max_3_diff MAX_3_DIFF]
[--cpus CPUS]
config_file {norm_fl,count_fl}
positional arguments:
config_file
{norm_fl,count_fl} Which count field to use for chained sample (default:
count_fl)
optional arguments:
-h, --help show this help message and exit
--fuzzy_junction FUZZY_JUNCTION
Max allowed distance in junction to be considered
identical (default: 0 bp)
--dun-merge-5-shorter
Don't collapse shorter 5' transcripts (default: off)
--max_3_diff MAX_3_DIFF
Maximum 3' difference allowed (default: 30bp)
--cpus CPUS Number of CPUs to use for multi-threading (default: 8)
where --fuzzy_junction
is the maximum number of base differences allowed in junction sites to still be chained together (default is 5 bp). --dun-merge-5-shorter
will chain a truncated 5' isoform in sample A with a longer, compatible isoform in sample B.
Try running the test data with two sets of params to see the difference:
## case 1: treat 5' truncated isoforms as same
chain_samples.py sample.config count_fl
## case 2: treat 5' truncated isoforms as different
chain_samples.py sample.config count_fl --dun-merge-5-shorter
The first command will generate a (chained) set of 2 isoforms. The latter will generate 3. This is because PB.2.1
in sample A is a truncated form of PB.2.1
in sample B.
NOTE: to use iterative chaining, you must first start with a clean directory. Absolutely nothing besides the config files ('sample.config1', 'sample.config2', etc....). And once you start running them iteratively, you should never delete any files (intermediate files are used as input to the next iteration) or change the order in which you iteratively chain the files.
The way you run iterative chaining is by making a series of config files. Let's say you have 5 samples. We will start with chaining 3 of them, then add the 4th to the result of the first 3, then the 5th result of the first 4.
To do so, you need to have a series of config files.
# sample.config1
SAMPLE=sample1;path1
SAMPLE=sample2;path2
SAMPLE=sample3;path3
GROUP_FILENAME=touse.group.txt
GFF_FILENAME=touse.gff
COUNT_FILENAME=touse.count.txt
FASTQ_FILENAME=optional.rep.fastq
Again, we expect that in path1/
directory, there will be files path1/touse.group.txt
, 'path1/touse.gff, 'path1/touse.count.txt
...and so on.
We will first run this with chain_samples.py sample.config1 count_nfl
. The intermediate output is tmp_<sample>.XXX
. You will see tmp files for sample2
and sample3
but not sample1
. DO NOT delete these. We need all of them for "unpacking" the chains as we go on.
The next config file builds up on the last one:
# sample.config2
tmpSAMPLE=sample1;path1
tmpSAMPLE=sample2;path2
tmpSAMPLE=sample3;path3
SAMPLE=sample4;path4
GROUP_FILENAME=touse.group.txt
GFF_FILENAME=touse.gff
COUNT_FILENAME=touse.count.txt
FASTQ_FILENAME=optional.rep.fastq
It is important that the first three lines from sample.config1
remain unchanged except for the prefix change from SAMPLE=
to tmpSAMPLE=
. Do not change the order of the samples from one config file to another! Whenever you add a new sample, it must come after the existing samples. Basically we specify that the first 3 samples have already been chained by using the prefix tmpSAMPLE=
.
Now we run chain_samples.py sample.config2 count_fl
. You will see that now we have tmp_sample4.XXX
.
# sample.config3
tmpSAMPLE=sample1;path1
tmpSAMPLE=sample2;path2
tmpSAMPLE=sample3;path3
tmpSAMPLE=sample4;path4
SAMPLE=sample5;path5
GROUP_FILENAME=touse.group.txt
GFF_FILENAME=touse.gff
COUNT_FILENAME=touse.count.txt
FASTQ_FILENAME=optional.rep.fastq
Now we run chain_samples.py sample.config3 count_fl
. The final results are tmp_sample5.XXX
, which, by the way, gets copied as all_samples.chained_XXX
as well. So you can directly use the latest all_samples.chained_XXX
as your final output.
A few final notes on iterative chaining.
- As good practice, you should keep the
GROUP_FILENAME
,GFF_FILENAME
,COUNT_FILENAME
,FASTQ_FILENAME
configuration the same for all samples. - Always start with an initially clean directory, with nothing but
sample.config
files. - When you add on chaining, simply do so by adding more
SAMPLE=
lines after previously chained samples and change the previously chained samples fromSAMPLE=
totmpSAMPLE=
. - You can add more than one new sample at a time.
To drive home my point, here's a valid config file:
# THIS IS A VALID CONFIG
tmpSAMPLE=sample1;path1
tmpSAMPLE=sample4;path4
SAMPLE=sample5;path5
SAMPLE=sample6;path6
SAMPLE=sample7;path7
GROUP_FILENAME=touse.group.txt
GFF_FILENAME=touse.gff
COUNT_FILENAME=touse.count.txt
FASTQ_FILENAME=optional.rep.fastq
And here is a invalid config file:
# THIS IS NOT A VALID CONFIG. Chained samples must precede all new samples.
tmpSAMPLE=sample1;path1
tmpSAMPLE=sample4;path4
SAMPLE=sample5;path5
tmpSAMPLE=sample6;path6
tmpSAMPLE=sample7;path7
GROUP_FILENAME=touse.group.txt
GFF_FILENAME=touse.gff
COUNT_FILENAME=touse.count.txt
FASTQ_FILENAME=optional.rep.fastq
The usage of the script fusion_finder.py
is pretty much identical to usage of the collapse_isoforms_by_sam.py
script.
In fact, collapse_isoforms_by_sam.py
by default ignores chimeric transcripts because it requires that a single continuous alignment has at least 99% coverage. (The 99% threshold can be lowered using the -c
option.)
usage: fusion_finder.py [-h] [--input INPUT] [--fq] -s SAM -o PREFIX
[--cluster_report_csv CLUSTER_REPORT_CSV]
[--dun-merge-5-shorter] [-c MIN_LOCUS_COVERAGE]
[--min_locus_coverage_bp MIN_LOCUS_COVERAGE_BP]
[-t MIN_TOTAL_COVERAGE] [-d MIN_DIST_BETWEEN_LOCI]
optional arguments:
--input INPUT Input FA/FQ filename
--fq Input is a fastq file (default is fasta)
-s SAM, --sam SAM Sorted GMAP SAM filename
-o PREFIX, --prefix PREFIX
Output filename prefix
--cluster_report_csv CLUSTER_REPORT_CSV
cluster_report.csv, optional, if given will generate
count info.
--dun-merge-5-shorter
Don't collapse shorter 5' transcripts (default: turned
off)
-c MIN_LOCUS_COVERAGE, --min_locus_coverage MIN_LOCUS_COVERAGE
Minimum per-locus coverage in percentage (default:
0.05)
--min_locus_coverage_bp MIN_LOCUS_COVERAGE_BP
Minimum per-locus coverage in bp (default: 1 bp)
-t MIN_TOTAL_COVERAGE, --min_total_coverage MIN_TOTAL_COVERAGE
Minimum total coverage (default: 0.99)
-d MIN_DIST_BETWEEN_LOCI, --min_dist_between_loci MIN_DIST_BETWEEN_LOCI
Minimum distance between loci, in bp (default: 10000)
The criteria for fusion candidates used in fusion_finder.py
is currently that a single transcript must:
- Map to two or more separate loci
- Each mapped locus must align at least 5% of the transcript
- The combined alignment coverage (over all mapped loci) must be at least 99%
- Each mapped locus is at least 10 kb apart
The thresholds (5% mapping, 99% total mapping, 10 kb apart) are rather ad hoc to rule out generally low-quality transcripts, junk transcripts, and mismappings. If you have better suggestions for parameter values, please provide them using the Issues and I could eventually open up the parameter choices.
If you don't already have a SAM alignment, create one using the following commands:
minimap2 -ax splice -t 30 -uf --secondary=no -C5 \
hg38.fa hq_isoforms.fastq > hq_isoforms.fastq.sam
sort -k 3,3 -k 4,4n hq_isoforms.fastq.sam > hq_isoforms.fastq.sorted.sam
IMPORTANT: You MUST RUN minimap2 with the --secondary=no
so there are no secondary alignments!
Next run fusion_finder.py
:
fusion_finder.py --input hq_isoforms.fastq --fq -s hq_isoforms.fastq.sorted.sam \
-o lq_isoforms.fasta.fusion \
--cluster_report_csv cluster_report.csv
The output GFF file hq_isoforms.fasta.fusion.gff
contains the information for fusion candidates. Each candidate will have two or more "fusion components". For example, you may see GFF entries for PBfusion.1.1
, PBfusion.1.2
, PBfusion.2.1
, and PBfusion.2.2
. The pair (PBfusion.1.1
, PBfusion.1.2
) denotes the two fusion component of the transcript PBfusion.1
. This naming system is a little different from the output of collapse_isoforms_by_sam.py
where PB.1.1
and PB.1.2
mean two separate transcripts mapping to the same locus.
cluster_report.csv
is optional. You can find this file in your SMRTLink job directory <jobid>/tasks/pbtranscript.tasks.combine_cluster_bins-0/cluster_report.csv
. If provided, it will generate an abundance file hq_isoforms.fasta.fusion.abundance.txt
.
Read more about how to follow up with classifying and filtering fusion candidates.
We now recommend using SQANTI3 to summarize junctions and filter for potential artifacts.
See the Alzheimer Iso-Seq UCSC track for an example of colored BED files.
Input:
- GTF/collapsed GFF
- SQANTI3 classification output (which used the input GTF above)
Output:
- A BED12 file that contains RGB color code where darkest (black) is the most abundant isoform in a gene, and light pink is the least abundant
Prerequisite:
- Cupcake installed (this repo)
-
gtfToGenePred
andgenePredToBed
(download from UCSC utilities)
Given the GTF/collapsed GFF output from running collapse_isoforms_by_sam.py
(or other tools), you can convert the GTF to BED format, then color the isoforms using the FL count information, normalized per gene.
Usage for the script is:
usage: color_bed12_post_sqanti.py [-h]
class_filename bed12_filename output_prefix
positional arguments:
class_filename SQANTI(3) classification filename
bed12_filename Input BED12 filename (converted from same SQANTI3 input GTF)
output_prefix
For example:
gtfToGenePred input.gtf input.genePred
genePredToBed input.genePred input.bed12
color_bed12_post_sqanti.py input.sqanti_classification.txt input.bed12 output.colored
The script will automatically look for the FL
column in SQANTI3 (you must run SQANTI3 with --fl_count
supplied). If there are multiple samples, the FL counts will have the columns FL.sample1
, FL.sample2
in the SQANTI3 classification output file, and the script will output files such as output.colored.CPM.sample1.bed12
,output.colored.CPM.sample2.bed12
...etc.
Right now we are using a unit called CPM ("full-length counts per million") and that number is displayed for each isoform in the format of
"PB.X.Y | SQANTI3-classified gene name | CPM"
NOTE: This is a substitute for those who do not have a reference annotation to run SQANTI3, which is a tool that generates more comprehensive stats. The only prerequisite you need to generate the stats & figures here is a reference genome that you can align to and run collapse_isoforms_by_sam.py
on.
The script usage is:
usage: simple_stats_post_collapse.py [-h] input_prefix
positional arguments:
input_prefix Input prefix, ex: hq.5merge.collapsed
For example:
# first you need to run collapse script to get a GFF file
$ collapse_isoforms_by_sam.py --input hq_isoforms.fastq --fq \
-s hq_isoforms.fastq.sorted.sam --dun-merge-5-shorter -o test
$ simple_stats_post_collapse.py test.collapsed
This will give you two output files, test.collapsed.simple_stats.txt
and test.collapsed.exon_stats.txt
. You can then use an R script such as [] to plot figures.
(The R pre-requisites are ggplot2, dplyr, and ggthemes)
$ Rscript <path_to_Cupcake>/beta/plot_simple_stats_post_collapse.R \
hq.collapsed.simple_stats.txt \
hq.collapsed.exon_stats.txt \
test.myfigures
This will generate several test.myfigures.Rplot.*.png
files in your directory.