Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add longtr support #232

Merged
merged 9 commits into from
Sep 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion doc/CALLERS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Supported TR Genotypers
=======================

TRTools currently supports 5 tandem repeat genotypers. It also supports the Beagle imputation software (see :ref:`below <Beagle_section>`).
TRTools currently supports 6 tandem repeat genotypers. It also supports the Beagle imputation software (see :ref:`below <Beagle_section>`).
We summarize them in the first table and provide some basic parameters of their functionality in the second.
For more information on a genotyper, please see its website linked below.

Expand Down Expand Up @@ -39,6 +39,9 @@ For more information on a genotyper, please see its website linked below.
| PopSTR_ (v2.0) | Designed for genome-wide genotyping |
| | of short or expanded TRs. |
+----------------------------+--------------------------------------+
| LongTR_ (v1.0) | Designed for genome-wide genotyping |
| | of STRs and VNTRs from long reads. |
+----------------------------+--------------------------------------+

|

Expand All @@ -58,6 +61,8 @@ For more information on a genotyper, please see its website linked below.
+----------------------------+--------------------------+----------------------------+------------------------+--------------------------+-------------------------+------------------------+
| PopSTR_ (v2.0) | 1-6bp | Yes | Length | 540,1401 (hg38) | Illumina | Many |
+----------------------------+--------------------------+----------------------------+------------------------+--------------------------+-------------------------+------------------------+
| LongTR_ (v1.0) | 1+bp | No | Length, sequence | No ref provided | PacBio HiFi, ONT | Many |
+----------------------------+--------------------------+----------------------------+------------------------+--------------------------+-------------------------+------------------------+

Since each of these tools take as input a list of TRs to genotype, they could also be used on custom panels of TR loci.
Tool information and reference panel numbers shown above are based on downloads from the github repository of each tool as of July 2, 2020.
Expand All @@ -74,6 +79,7 @@ see :ref:`Contributing` for more information.
.. _GangSTR: https://github.com/gymreklab/gangstr
.. _HipSTR: https://hipstr-tool.github.io/HipSTR/
.. _PopSTR: https://github.com/DecodeGenetics/popSTR
.. _LongTR: https://github.com/gymrek-lab/longtr

.. _Beagle_section:

Expand Down
4 changes: 3 additions & 1 deletion doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,17 +121,19 @@ TRTools supports VCFs from the following TR genotyping tools:
* GangSTR_ version 2.4 or higher
* HipSTR [`main repo <https://github.com/tfwillems/HipSTR>`_] [`Gymrek Lab repo <https://github.com/gymrek-lab/hipstr>`_]
* PopSTR_ version 2 or higher
* LongTR_

See our description of the :doc:`features and example use-cases </CALLERS>` of each of these tools.

..
please ensure this list of links remains the same as the one in the main README
please ensure this list of links remains the same as the one in CALLERS.rst

.. _AdVNTR: https://advntr.readthedocs.io
.. _ExpansionHunter: https://github.com/Illumina/ExpansionHunter
.. _GangSTR: https://github.com/gymreklab/gangstr
.. _HipSTR: https://hipstr-tool.github.io/HipSTR/
.. _PopSTR: https://github.com/DecodeGenetics/popSTR
.. _LongTR: https://github.com/gymrek-lab/longtr

Testing
-------
Expand Down
Binary file added example-files/HG002_htt_test.vcf.gz
Binary file not shown.
Binary file added example-files/HG002_htt_test.vcf.gz.tbi
Binary file not shown.
Binary file added example-files/HG003_htt_test.vcf.gz
Binary file not shown.
Binary file added example-files/HG003_htt_test.vcf.gz.tbi
Binary file not shown.
Binary file added example-files/longtr_testfile.vcf.gz
Binary file not shown.
Binary file added example-files/longtr_testfile.vcf.gz.tbi
Binary file not shown.
12 changes: 12 additions & 0 deletions test/cmdline_tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_gangstr.sorted.vcf.gz --ou
runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_gangstr.sorted.vcf.gz --out stdout --mean --vcftype eh"
runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_gangstr.sorted.vcf.gz --out stdout --mean --vcftype advntr"
runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_gangstr.sorted.vcf.gz --out stdout --mean --vcftype popstr"
runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_gangstr.sorted.vcf.gz --out stdout --mean --vcftype longtr"

runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_hipstr.sorted.vcf.gz --out stdout --mean --vcftype gangstr"
runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_hipstr.sorted.vcf.gz --out stdout --mean --vcftype eh"
Expand All @@ -129,16 +130,20 @@ runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_eh.sorted.vcf.gz --out std
runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_eh.sorted.vcf.gz --out stdout --mean --vcftype hipstr"
runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_eh.sorted.vcf.gz --out stdout --mean --vcftype advntr"
runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_eh.sorted.vcf.gz --out stdout --mean --vcftype popstr"
runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_eh.sorted.vcf.gz --out stdout --mean --vcftype longtr"

runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_popstr.sorted.vcf.gz --out stdout --mean --vcftype gangstr"
runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_popstr.sorted.vcf.gz --out stdout --mean --vcftype hipstr"
runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_popstr.sorted.vcf.gz --out stdout --mean --vcftype advntr"
runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_popstr.sorted.vcf.gz --out stdout --mean --vcftype eh"
runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_popstr.sorted.vcf.gz --out stdout --mean --vcftype longtr"

runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_advntr.sorted.vcf.gz --out stdout --mean --vcftype gangstr"
runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_advntr.sorted.vcf.gz --out stdout --mean --vcftype hipstr"
runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_advntr.sorted.vcf.gz --out stdout --mean --vcftype popstr"
runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_advntr.sorted.vcf.gz --out stdout --mean --vcftype eh"
runcmd_fail "statSTR --vcf ${EXDATADIR}/NA12878_chr21_advntr.sorted.vcf.gz --out stdout --mean --vcftype longtr"


# Test mergeSTR on all supported tools
# AdVNTR
Expand Down Expand Up @@ -191,6 +196,11 @@ FILE2=${EXDATADIR}/NA12891_chr21_popstr.sorted.vcf.gz
FILE3=${EXDATADIR}/NA12892_chr21_popstr.sorted.vcf.gz
runcmd_pass "mergeSTR --vcfs ${FILE1},${FILE2},${FILE3} --out ${TMPDIR}/test_merge_popstr --vcftype popstr"

# LongTR
FILE1=${EXDATADIR}/HG002_htt_test.vcf.gz
FILE2=${EXDATADIR}/HG003_htt_test.vcf.gz
runcmd_pass "mergeSTR --vcfs ${FILE1},${FILE2} --out ${TMPDIR}/test_merge_longtr --vcftype longtr"

# Test mergeSTR on a file with list of VCFs
FILE1=${EXDATADIR}/NA12878_chr21_hipstr.sorted.vcf.gz
FILE2=${EXDATADIR}/NA12891_chr21_hipstr.sorted.vcf.gz
Expand All @@ -211,6 +221,7 @@ runcmd_pass "dumpSTR --vcf ${EXDATADIR}/NA12878_chr21_advntr.sorted.vcf.gz --adv
runcmd_pass "dumpSTR --vcf ${EXDATADIR}/NA12878_chr21_eh.sorted.vcf.gz --out ${TMPDIR}/test_dumpstr_eh --eh-min-call-LC 50 --num-records 10 --drop-filtered"
runcmd_pass "dumpSTR --vcf ${EXDATADIR}/trio_chr21_gangstr.sorted.vcf.gz --out ${TMPDIR}/test_dumpstr_gangstr --min-locus-callrate 0.9 --num-records 10"
runcmd_pass "dumpSTR --vcf ${EXDATADIR}/trio_chr21_hipstr.sorted.vcf.gz --vcftype hipstr --out ${TMPDIR}/test_dumpstr_hipstr --filter-hrun --num-records 10"
runcmd_pass "dumpSTR --vcf ${EXDATADIR}/longtr_testfile.vcf.gz --vcftype longtr --out ${TMPDIR}/test_dumpstr_longtr --use-length --filter-hrun --num-records 10"
runcmd_pass "dumpSTR --vcf ${EXDATADIR}/trio_chr21_popstr.sorted.vcf.gz --out ${TMPDIR}/test_dumpstr_popstr --min-locus-callrate 0.9 --popstr-min-call-DP 10 --num-records 100"

FILE1=${TMPDIR}/NA12878_advntr_reheader.vcf.gz
Expand All @@ -228,6 +239,7 @@ runcmd_pass "qcSTR --vcf ${EXDATADIR}/trio_chr21_hipstr.sorted.vcf.gz --out ${TM
runcmd_pass "qcSTR --vcf ${EXDATADIR}/NA12878_chr21_eh.sorted.vcf.gz --out ${TMPDIR}/test_qc_eh"
runcmd_pass "qcSTR --vcf ${EXDATADIR}/NA12878_chr21_advntr.sorted.vcf.gz --out ${TMPDIR}/test_qc_advntr"
runcmd_pass "qcSTR --vcf ${EXDATADIR}/trio_chr21_popstr.sorted.vcf.gz --out ${TMPDIR}/test_qc_popstr"
runcmd_pass "qcSTR --vcf ${EXDATADIR}/longtr_testfile.vcf.gz --vcftype longtr --out ${TMPDIR}/test_qc_longtr"

runcmd_pass "qcSTR --vcf ${TMPDIR}/test_merge_gangstr.vcf --out ${TMPDIR}/test_qc_gangstr --period 4 --quality per-locus"
runcmd_pass "qcSTR --vcf ${TMPDIR}/test_merge_hipstr.vcf --out ${TMPDIR}/test_qc_hipstr --vcftype hipstr --samples ${EXDATADIR}/ex-samples.txt"
Expand Down
4 changes: 2 additions & 2 deletions trtools/annotaTR/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Required parameters:

Other general parameters:

* :code:`--vcftype <string>`: Which genotyping tool generated the input VCF. Default = :code:`auto`. Necessary if it cannot be automatically inferred. One of: :code:`gangstr`, :code:`advntr`, :code:`hipstr`, :code:`eh`, :code:`popstr`.
* :code:`--vcftype <string>`: Which genotyping tool generated the input VCF. Default = :code:`auto`. Necessary if it cannot be automatically inferred. One of: :code:`gangstr`, :code:`advntr`, :code:`hipstr`, :code:`longtr`, :code:`eh`, :code:`popstr`.
* :code:`--outtype <string>`: Which output format to generate. Supported output types are :code:`vcf` or :code:`pgen`. If multiple types are listed (e.g. :code:`--outtype vcf pgen`), all specified output formats are generated. By default, only VCF output is generated at :code:`$outprefix.vcf`. If PGEN output is specified, the files :code:`$outprefix.pgen`, :code:`$outprefix.pvar`, and :code:`$outprefix.psam` are generated. See more on the `pgen format here <https://www.cog-genomics.org/plink/2.0/formats#pgen>`_
* :code:`--vcf-outtype <string>`: Type of VCF output to produce. This option is ignored unless using :code:`--outtype vcf`. Options: z=compressed VCF, v=uncompressed VCF, b=compressed BCF, u=uncompressed BCF, s=stdout
* :code:`--region <str>`: Restrict to processing a specific genomic region. Syntax: chr:start-end.
Expand Down Expand Up @@ -111,7 +111,7 @@ Additional relevant options:

If generating a VCF output file, this command will output a new file containing only STRs, with the following fields added back depending on the genotyper used to generate the reference panel:

* For HipSTR-based reference panels: INFO fields START, END, PERIOD are added
* For HipSTR-based or LongTR-based reference panels: INFO fields START, END, PERIOD are added
* For adVNTR: INFO fields RU, VID are added
* For GangSTR: INFO field RU is added
* For ExpansionHunter: INFO fields RU, VARID, RL are added
Expand Down
1 change: 1 addition & 0 deletions trtools/annotaTR/annotaTR.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
# Info fields copied from reference panel for each tool
INFOFIELDS = {
trh.VcfTypes.hipstr: ["START","END","PERIOD"],
trh.VcfTypes.longtr: ["START","END","PERIOD"],
trh.VcfTypes.advntr: ["RU", "VID"],
trh.VcfTypes.gangstr: ["RU"],
trh.VcfTypes.eh: ["RU","VARID","RL"]
Expand Down
2 changes: 1 addition & 1 deletion trtools/associaTR/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ Required positional parameters:

Optional input parameters:

* :code:`--vcftype` One of :code:`eh, hipstr, gangstr, popstr, advntr`
* :code:`--vcftype` One of :code:`eh, hipstr, longtr, gangstr, popstr, advntr`
Specify which caller produced the TR VCF, useful when the VCF is ambiguous
and the caller cannot be automatically inferred.
* :code:`--same-samples` - see the description of traits above
Expand Down
3 changes: 2 additions & 1 deletion trtools/associaTR/associaTR.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from . import load_and_filter_genotypes
import trtools
import trtools.utils.utils as utils
import trtools.utils.tr_harmonizer as trh

pval_precision = 2

Expand Down Expand Up @@ -491,7 +492,7 @@ def run():
'Traits and the phenotype will be standardized to mean 0 and std 1 prior to regression, but '
'coefficients/standard errors are transformed back to the original scale before being written out.'
)
parser.add_argument('--vcftype', choices=['eh', 'hipstr', 'gangstr', 'popstr', 'advntr'],
parser.add_argument('--vcftype', choices=[str(item) for item in trh.VcfTypes.__members__],
help="Specify which caller produced the TR VCF, useful when the VCF is ambiguous "
"and the caller cannot be automatically inferred.")
parser.add_argument('--same-samples', default=False, action='store_true',
Expand Down
17 changes: 14 additions & 3 deletions trtools/dumpSTR/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ See a description of output files below.
Other general parameters:

* :code:`--vcftype <string>`: Which genotyping tool generated the input VCF. Default = :code:`auto`.
Necessary if it cannot be automatically inferred. One of: :code:`gangstr`, :code:`advntr`, :code:`hipstr`, :code:`eh`, :code:`popstr`.
Necessary if it cannot be automatically inferred. One of: :code:`gangstr`, :code:`advntr`, :code:`hipstr`, :code:`longtr`, :code:`eh`, :code:`popstr`.
* :code:`--zip` to output a bgzipped filtered VCF (:code:`$out.vcf.gz`) and tabix index (:code:`$out.vcf.gz.tbi`) instead of :code:`$out.vcf`.
* :code:`--num-records <int>`: only process this many records from the input VCF file

Expand Down Expand Up @@ -60,10 +60,10 @@ These filters are not specific to any tool and can be applied to any VCF file:
where p_i is the frequency of allele i

* :code:`--max-locus-het <float>`: Filters loci with high heterozygosity
* :code:`--use-length`: Use allele lengths, rather than sequences, to compute heterozygosity and HWE (only relevant for HipSTR, which reports sequence level differences in TR alleles)
* :code:`--use-length`: Use allele lengths, rather than sequences, to compute heterozygosity and HWE (only relevant for HipSTR and LongTR, which report sequence level differences in TR alleles)
* :code:`--filter-regions <BEDFILE[,BEDFILE12,...]>`: Filter TRs overlapping the specified set of regions. Must be used with :code:`--filter-regions-names`. Can supply a comma-separated list to each to apply multiple region filters. Bed files must be sorted and tabix-indexed. Note that regions are 0-based and inclusive of the start position, but exclusive of the end position.
* :code:`--filter-regions-names <string[,string2,...]>`: Filter names for each BED file specified in :code:`--filter-regions`.
* :code:`--filter-hrun`: Filter repeats with long homopolymer runs. Only used for HipSTR VCF files otherwise ignored. Filters pentanucleotides with homopolymer runs of 5bp or longer, or hexanucleotides with homopolymer runs of 6bp or longer.
* :code:`--filter-hrun`: Filter repeats with long homopolymer runs. Only used for HipSTR or LongTR VCF files otherwise ignored. Filters pentanucleotides with homopolymer runs of 5bp or longer, or hexanucleotides with homopolymer runs of 6bp or longer.
* :code:`--drop-filtered`: Do not output loci that were filtered, or loci with no calls remaining after filtering.

TRs passing all locus-level filters will be marked as "PASS" in the FILTER field.
Expand Down Expand Up @@ -133,6 +133,14 @@ HipSTR call-level filters
* :code:`--hipstr-max-call-DP <int>`: Maximum call coverage. Based on DP field.
* :code:`--hipstr-min-call-Q <float>`: Minimum call quality score. Based on Q field.

LongTR call-level filters
**************************
* :code:`--longtr-max-call-flank-indel <float>`: Maximum call flank indel rate. Computed as DFLANKINDEL/DP
* :code:`--longtr-min-supp-reads <int>`: Minimum supporting reads for each allele. Based on ALLREADS and GB fields
* :code:`--longtr-min-call-DP <int>`: Minimum call coverage. Based on DP field.
* :code:`--longtr-max-call-DP <int>`: Maximum call coverage. Based on DP field.
* :code:`--longtr-min-call-Q <float>`: Minimum call quality score. Based on Q field.

PopSTR call-level filters
**************************
* :code:`--popstr-min-call-DP <int>`: Minimum call coverage. Based on DP field.
Expand Down Expand Up @@ -170,3 +178,6 @@ Below are :code:`dumpSTR` examples using VCFs from supported TR genotypers. Data
# PopSTR
dumpSTR --vcf trio_chr21_popstr.sorted.vcf.gz --out test_dumpstr_popstr --min-locus-callrate 0.9 --popstr-min-call-DP 10 --num-records 100

# LongTR
dumpSTR --vcf longtr_testfile.vcf.gz --vcftype longtr --out test_dumpstr_longtr --min-locus-callrate 0.9 --longtr-min-call-Q 0.9

Loading
Loading