Skip to content

Latest commit

 

History

History
157 lines (116 loc) · 5.22 KB

README.md

File metadata and controls

157 lines (116 loc) · 5.22 KB

pbtk logo

pbtk

PacBio BAM toolkit


Availability

Latest version can be installed via bioconda package pbtk.

Please refer to our official pbbioconda page for information on Installation, Support, License, Copyright, and Disclaimer.

Tools

This repository is replacing individual tool repositories and binaries from pbbam. In bioconda, pbtk is a dependency of pbbam, so you won't see immediately that those binaries are longer from pbbam directly.

  • bam2fasta
  • bam2fastq
  • ccs-kinetics-bystrandify
  • extracthifi
  • pbindex
  • pbindexdump
  • pbmerge
  • zmwfilter

Usage

bam2fastx

Tools bam2fasta and bam2fastq have identical interfaces and transform multiple PacBio BAM and/or DataSet XML files into a compressed FASTA or FASTQ file, respectively:

# generates out.fasta.gz
bam2fasta -o out in.bam
bam2fasta -o out in.xml

# generates out.fastq.gz
bam2fastq -o out in_1.bam in_2.bam in_3.xml in_4.bam

Option -u disables compression (drops .gz extension), while option -c <int> determines the Gzip compression level.

Option -p/--seqid-prefix <str> adds the provided prefix to each sequence header.

Additionally, input files can be split depending on barcode pairs into multiple files:

# generates multiple out.{barcode}_{barcodePair}.fasta.gz
bam2fasta --split-barcodes -o out in1.bam in2.bam

ccs-kinetics-bystrandify

Converts a PacBio BAM or DataSet XML file containing CCS kinetics tags to a pseudo-bystrand file with pw and ip tags that can be used as a substitute for subreads in applications expecting such kinetics information:

ccs-kinetics-bystrandify in.bam out.bam
ccs-kinetics-bystrandify in.xml out.xml

Option --min-coverage <int> specifies the minimum number of passes per strand (tags fn and rn) for creating a strand-specific read.

extracthifi

Simple tool for extracting reads with accuracy above QV 20 (0.99) from a given BAM file:

extracthifi in.bam out.bam

pbindex

Minimalistic tool which creates an index file that enables random access into PacBio BAM files:

# generates in.bam.pbi
pbindex in.bam

pbindexdump

Tool which transforms PBI files to JSON or c++ format:

pbindexdump in.bam.pbi > out.json
pbindexdump --format cpp in.bam.pbi > out.cpp

Option --json-indent-level <int> defines the indentation of the JSON file, while option --json-raw modifies the output JSON file to more closely reflect the PBI file format.

Alternatively, hole numbers in plain text can be reported with:

pbindexdump --zmws-only in.bam.pbi > out.txt

Note: in case of subreads, the output text file can contain multiple equal hole numbers (as opposed to zmwfilter --show-all which reports only unique ones).

pbmerge

Simple tool which merges several PacBio BAM files together, either by providing them on the command line, a DataSet XML or a file containing one file name per line:

pbmerge in1.bam in2.bam in3.bam > out.bam
pbmerge -o out.bam in.xml
pbmerge in.fofn > out.bam

Option --no-pbi disables creation of the index file.

zmwfilter

Utility tool for filtering PacBio BAM, DataSet XML or FASTX files. Plain filtering based on ZMW hole numbers is supported for any input format, given that the output format is the same, by providing an include list or an exclude list. That can be either in form of a comma separated list on the command line or a single file containing one hole number per line:

zmwfilter --include 1,2,4,8,16 in.bam out.bam
zmwfilter --include hole_numbers.txt in.fasta out.fasta

zmwfilter --exclude 42 in.xml out.bam
zmwfilter --exclude hole_numbers.txt in.xml out.fastq

ZMW hole numbers present in a PacBio file can be obtained with option --show-all and without providing an output file:

zmwfilter --show-all in.bam > out.txt

Note: Functionality described below is for BAM and DataSet XML files only.

Filtering reads by their names can be achieved by providing a file which contains one read name per line (following PacBio query template name convention):

zmwfilter --names read_names.txt in.bam out.bam

BAM files can also be randomly downsampled to a provided number of ZMWs or to a fraction of the total count (for reproducibility use a fixed seed):

zmwfilter --downsample 0.333 in.xml out.bam
zmwfilter --downsample-count 1024 --downsample-seed 42 in.bam out.bam

Additionally, filtering can be constrained by providing a minimal number of passes (incompatible with --names <str>):

zmwfilter --num-passes 2 --include hole_numbers.txt in.bam out.bam
zmwfilter --num-passes 4 --downsample 0.333 in.bam out.bam

Note: options --include <str>, --exclude <str>, --show-all, --names <str>, --downsample <float> and --downsample-count <int> are all mutually exclusive!

Changelog

  • 3.4.0

    • SMRT Link v25.1 release
    • Support mixed BAM types in pbmerge
  • 3.1.1

    • SMRT Link v13 release
    • Fix ccs-bystrandify-kinetics output
  • 3.0.0

    • Add zmwfilter —show-all
    • Add pbindexdump —zmws-only
    • Add REVIO platform
  • 1.0.0

    • Gather all tools into pbtk