Skip to content

Tandem repeat genotyping and visualization from PacBio HiFi data

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE.md
Unknown
LICENSE-THIRDPARTY.json
Notifications You must be signed in to change notification settings

PacificBiosciences/trgt

Repository files navigation

TRGT

Tandem repeat genotyping tool for HiFi sequencing data

TRGT is a tool for targeted genotyping of tandem repeats from PacBio HiFi data. In addition to the basic size genotyping, TRGT profiles sequence composition, mosaicism, and CpG methylation of each analyzed repeat and visualization of reads overlapping the repeats.

Early version warning

Please note that TRGT is still under active development. We anticipate some changes to the input and output file formats of TRGT.

Availability

Joint analysis of multiple samples

TRGT outputs VCFs containing repeat alleles from each region in the repeat catalog. To facilitate analysis of repeats across multiple samples, VCFs can be either merged into a multi-sample VCF using the merge sub-command or converted into a database using the TDB tool (formerly called TRGTdb). TDB offers many advantages over multi-sample VCFs, including simpler data extraction, support for queries, and reduced file sizes.

Documentation

Need help?

If you notice any missing features, bugs, or need assistance with analyzing the output of TRGT, please don't hesitate to reach out by email or open a GitHub issue.

Support information

TRGT is currently in active development and is intended for research use only and not for use in diagnostic procedures. While efforts have been made to ensure that TRGT lives up to the quality that PacBio strives for, we make no warranty regarding this software.

As TRGT is not covered by any service level agreement or the like, please do not contact a PacBio Field Applications Scientists or PacBio Customer Service for assistance with any TRGT release. Please report all issues through GitHub instead. We make no warranty that any such issue will be addressed, to any extent or within any time frame.

Citation

Please consider citing the paper describing TRGT:

Dolzhenko E, English A, Dashnow H, De Sena Brandine G, Mokveld T, Rowell WJ, Karniski C, Kronenberg Z, Danzi MC, Cheung W, Bi C, Farrow E, Wenger A, Martínez-Cerdeño V, Bartley TD, Jin P, Nelson D, Zuchner S, Pastinen T, Quinlan AR, Sedlazeck FJ, Eberle MA. Characterization and visualization of tandem repeats at genome scale. 2024

Full Changelog

  • 0.3.4
    • Improved label spacing in TRVZ plots
  • 0.4.0
    • Added TRVZ tutorial
    • Added sample karyotype parameter (XX or XY)
    • Renamed VCF genotype field ALCI to ALLR
    • Made genotyping algorithm changes to improve accuracy
  • 0.5.0
    • The genotyper now uses information about SNPs adjacent to repeats
    • BAM files now contain read-to-allele assignments
    • Added support for gzip compressed repeat files
    • Improved error handling and error messages
  • 0.6.0
    • Add alignment CIGARs to spanning.bam reads
    • Increase read extraction region
    • Cluster genotyper reports confidence intervals
    • Improved error handling of invalid input files (genome, catalog and reads)
  • 0.7.0
    • Read phasing information can now be used during repeat genotyping (via HP tags)
    • Users can now define complex repeats by specifying motif sequences in the MOTIFS field and setting STRUC to <locus_name>
    • The original MAPQ values in the input reads are now reported in the BAM output
    • BAMlet sample name can now be provided using the --sample-name flag; if it not provided, it is extracted from the input BAM or file stem (addressing issue #18)
  • 0.8.0
    • Breaking change: Motif spans and counts (MS and MC fields) and purity assessment (AP field) are now performed with an HMM-based algorithm for all repeats; expect some differences in results relative to the previous versions
    • Allele purity of zero-length alleles are now reported as missing values in the VCFs
    • The spanning.bam output file now carries over the QUAL values and mapping strand from the input reads
    • Added an advanced flag --output-flank-len that controls the number of flanking bases reported in the spanning.bam files and shown in trvz plots
    • A crash that may occur on BAMs where methylation was called twice has been fixed
    • Optimizations to the --genotyper=cluster mode, including haploid genotyping of the X chromosome when --karyotype is set to XY
  • 0.9.0
    • Add support for polyalanine repeats (by allowing characters N in the motif sequence)
    • Fix a bug causing TRVZ to error out on polyalanine repeats
  • 1.0.0
    • Breaking change: TRGT and TRVZ are now merged into a single binary. Users need to run subcommands trgt genotype and trgt plot for genotyping and visualization, respectively.
    • Breaking change: A padding base is now automatically added to all genotyped allele sequences in the VCF file, ensuring better compliance with VCF standards and handling of zero-length alleles.
    • Added a new subcommand trgt validate. This command allows for validation of a repeat catalog against a given reference genome and reports statistics for any malformed entries.
    • Lower memory footprint: Better memory management significantly reduces memory usage with large repeat catalogs.
    • Updated error handling: Malformed entries are now logged as errors without terminating the program.
    • Added shorthand CLI options to simplify command usage.
  • 1.1.0
    • Added a new subcommand trgt merge. This command merges VCF files generated by trgt genotype into a joint VCF file. Works with VCFs generated by all versions of TRGT (the resulting joint VCF will always be in the TRGT ≥v1.0.0 format which includes padding bases).
    • Added subsampling of regions with ultra-high coverage (>MAX_DEPTH * 3, by default 750); implemented via reservoir sampling.
    • Fixed a cluster genotyper bug that occurred when only a single read covered a locus.
    • Added new logic for filtering non-HiFi reads: remove up to 3% of lower quality reads that do not match the expected repeat sequence.
  • 1.1.1
    • Hotfix: Read filtering logic no longer removes reads without RQ tags.
  • 1.1.2
    • Hotfix: Prevent genotyping without reads.
    • Added the --disable-bam-output flag to trgt genotype, allowing users to disable BAMlet generation. However, please note that BAMlets are still required for downstream tasks like trgt plot.
  • 1.2.0
    • trgt merge:
      • Multi-sample VCF Merging: Added support for merging TRGT VCFs with any number of samples, allowing updates to large, population-scale datasets with new samples.
      • Synced contig indexing: Introduced support for VCFs with inconsistent contig orderings. Additionally the new --contigs flag allows specifying a comma-separated list of contigs to be merged.
      • The reference genome is no longer required when merging TRGT VCFs from version 1.0.0 or later.
      • Merging now skips and logs problematic loci by default. Use the --quit-on-errors flag to terminate on errors. Statistics are logged post-merge, including counts of failed and skipped TRs.
    • trgt validate
      • Always outputs statistics directly to stdout and stderr instead of logging them.
    • Bug fix:
      • Resolved issue with handling bgzip-compressed BED files.
  • 1.3.0
    • Plotting code has been refactored as we prepare to revamp repeat visualizations
    • The maximum number of reads per allele to plot can now be specified by --max-allele-reads
    • bugfix: repeat identifiers are now permitted to contain commas
  • 1.4.0
    • Parameters appropriate for targeted sequencing can now be set with --preset targeted option
    • Waterfall plots no longer panic when there are no reads in a locus
    • Algorithmic changes to --genotyper cluster allow fewer reads to be assigned to an allele; this may result in minor changes to consensus sequence and read assignment

DISCLAIMER

THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.