Skip to content

EG Workshop LRT 2

GenomeRIK edited this page Nov 12, 2020 · 17 revisions

Processing Nanopore cDNA sequencing data

See TAMA Wiki for TAMA Manual: https://github.com/GenomeRIK/tama/wiki

Path to TAMA:

  /home/training/bin/tama

Path to reference files:

  /home/training/long_read_transcriptomics/rkuo/0_ref_files

Basecalled Fastq reads (using Guppy):

  /home/training/long_read_transcriptomics/rkuo/0_ref_files/fastq_1

Run SeqKit

  #!/bin/bash
  hpath='/home/training/long_read_transcriptomics/rkuo/0_ref_files/fastq_1/' # path to fastq files from Guppy
  opath=`pwd` # define output path
  cd ${hpath} # go to fastq folder
  files=(`ls | grep "fastq.gz$" `) # get all fastq file names
  numfiles=`ls | grep "fastq.gz$" | wc -l` # count number of files
  let "numfiles=$numfiles-1"
  for j in $(seq 0 $numfiles) # loop through processing each file
  do
  echo ${j}
  input="${files[$j]}"
  outfile=`echo ${files[$j]} | awk -F ".gz" '{print "seqkit_"$1}'`
  seqkit seq -Q 7 ${input} > ${opath}/${outfile} # run sekit on each file
  done

Example to run: bash script.sh

The output is a new fastq file for each original fastq file with low quality reads removed.

Combine all the fastq outputs from seqkit

  files=(`ls | grep "^seqkit" |grep "fastq$"`)
  numfiles=`ls |grep "^seqkit" |grep "fastq$" | wc -l`
  let "numfiles=$numfiles-1"
  outfile="all_seqkit.fastq"
  rm ${outfile}
  for j in $(seq 0 $numfiles)
  do
  echo ${j}
  cat ${files[$j]} >> ${outfile}
  done

The output is a single file containing all the sequences from all the input fastq files.

Run Pychopper

Remember to activate the base conda environment: conda activate base

  pychopper='/home/training/miniconda3/bin/cdna_classifier.py'
  fastq='/home/training/long_read_transcriptomics/rkuo/2_nanopore/2_seqkit/all_seqkit.fastq' # replace this with the path to your seqkit fastq file
  suffix='sample_ont'
  ppath='/home/training/long_read_transcriptomics/rkuo/2_nanopore/3_pychopper/'
  pfile='primers_teloprime_ont.fa'
  python ${pychopper} -b ${ppath}${pfile} -m edlib -r report_${suffix}.pdf -u unclassified_${suffix}.fq -w rescued_${suffix}.fq ${fastq} full_length_output_${suffix}.fq

Outputs:

  cdna_classifier_report.tsv
  full_length_output_sample_ont.fq
  report_sample_ont.pdf
  rescued_sample_ont.fq
  unclassified_sample_ont.fq

Convert fastq to fasta

full_length_output_sample_ont.fq -> full_length_output_sample_ont.fa

  spath='/home/training/bin/tama/tama_go/format_converter/'
  pscript='tama_convert_nanopore_fastq_fasta.py'
  fastq='full_length_output_sample_ont.fq'
  outfile='full_length_output_sample_ont.fa'
  python ${spath}${pscript} ${fastq} ${outfile}

Map reads to the genome

Use the fasta file we just created: full_length_output_sample_ont.fa

  ref='/home/training/long_read_transcriptomics/rkuo/0_ref_files/Gallus_gallus.GRCg6a.dna_sm.chromosome.28.fa'
  query='full_length_output_sample_ont.fa' # make sure to adjust path for your folder structure
  outfile='mm2_sample_ont_chrom28.sam'
  minimap2 --secondary=no -ax splice  -t 2 ${ref} ${query} > ${outfile}

Output is a SAM file.

Convert SAM to BAM

SAM is text file. BAM is a compressed binary version of SAM.

  sam='mm2_sample_ont_chrom28.sam'
  bam=`echo ${sam} | sed 's/sam$/bam/'`
  samtools view -bS  ${sam} > ${bam}

Sort BAM file

  bam='mm2_sample_ont_chrom28.bam'
  outfile=`echo ${bam} | sed 's/.bam/_sort.bam/'` # just an automatic way to make new name
  samtools sort ${bam} -o ${outfile}

Check BAM file:

  samtools view mm2_sample_ont_chrom28.bam | awk '{print $2}' | sort | uniq -c

Remember to change conda environment:

  conda deactivate

Run TAMA Collapse

  spath='/home/training/bin/tama/'
  pscript='tama_collapse.py'
  fpath='/home/training/long_read_transcriptomics/rkuo/2_nanopore/4_mapping/' ## adjust this pathway for where your BAM file is
  sam='mm2_sample_ont_chrom28_sort.bam' #the bam file we just made, adjust the name if you used a different name
  prefix='tc_mm2_sample_ont_chrom28'
  capflag='capped'
  awobble='100'
  zwobble='100'
  sjt='10'
  fasta='/home/training/long_read_transcriptomics/rkuo/0_ref_files/Gallus_gallus.GRCg6a.dna_sm.chromosome.28.fa' # chaeck that your fasta file is there and looks ok
  python ${spath}${pscript} -s ${fpath}${sam} -f ${fasta} -p ${prefix} -d merge_dup -x ${capflag} -a ${awobble} -z ${zwobble} -sj sj_priority -sjt ${sjt} -log log_off -b BAM

Outputs:

  tc_mm2_sample_ont_chrom28.bed
  tc_mm2_sample_ont_chrom28_local_density_error.txt
  tc_mm2_sample_ont_chrom28_polya.txt
  tc_mm2_sample_ont_chrom28_read.txt
  tc_mm2_sample_ont_chrom28_strand_check.txt
  tc_mm2_sample_ont_chrom28_trans_read.bed
  tc_mm2_sample_ont_chrom28_trans_report.txt
  tc_mm2_sample_ont_chrom28_varcov.txt
  tc_mm2_sample_ont_chrom28_variants.txt