Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

Targeted Iso Seq QC Wiki

Elizabeth Tseng edited this page Aug 23, 2022 · 11 revisions

Last Updated: 01/23/2020

This wiki shows scripts used for targeted Iso-Seq analysis.

Python Requirement

How to set up the scripts

No installation is required. You just download the Cupcake repository from GitHub and add the appropriate paths.

$ git clone https://github.com/Magdoll/cDNA_Cupcake.git
$ export PYTHONPATH=$PYTHONPATH:<path_to>/cDNA_Cupcake/sequence

To see that you have successfully gotten BioPython, bx-python, and the sequence/ subdir working, try the following in Python interpreter:

$ python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
>>> import Bio  # testing BioPython
>>> import bx   # testing bx-python
>>> import BED  # testing Cupcake sequence/

All of the above should work without error messages.

Calculate on-target rate from SAM alignment + probe BED file

You will need:

  1. The input FASTA or FASTQ file
  2. The aligned SAM file or GTF
  3. A tab-delimited probe BED file with at least the first three columns and an optional 4th column of <chrom> <start> <end> <gene_name>. No headers allowed!

You can generate the aligned SAM file using minimap2:

minimap2 -ax splice -t 30 -uf --secondary=no \
   hg38.fa input.fasta > input.fasta.sam

If your input is GTF instead of SAM, you must use the --gtf option.

The script usage is:

python calc_probe_hit_from_sam.py [BED] [FASTA/FASTQ] [SAM/GTF] 
  --start_base {0 or 1}
  --end_base   {0 or 1}
  -o [OUTPUT]
  [--gtf]

Where --start_base, --end_base indicates whether the start/end index is 0-based or 1-based.

For example:

python <path_to>/cDNA_Cupcake/sequence/calc_probe_hit_from_sam.py \
         --gtf \
         my-probes.bed input.fasta input.gtf \
         --start_base 0 --end_base 0 \
         -o input.gtf.probe_hit.txt

The output file will have the following fields:

read_id         : the read ID
read_len        : length of read
num_probe       : number of distinct probes overlapped
num_base_overlap: total base of the read that overlaps with a probe
loci            : mapped start-end of the read on the genome
genes           : the 4th column of the probe BED file, if given