Skip to content

Commit

Permalink
add gzvcf output and update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Melissa Gymrek authored and Melissa Gymrek committed Sep 9, 2024
1 parent fd95517 commit 1703b33
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 3 deletions.
8 changes: 7 additions & 1 deletion trtools/annotaTR/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@ Required parameters:
Other general parameters:

* :code:`--vcftype <string>`: Which genotyping tool generated the input VCF. Default = :code:`auto`. Necessary if it cannot be automatically inferred. One of: :code:`gangstr`, :code:`advntr`, :code:`hipstr`, :code:`eh`, :code:`popstr`.
* :code:`--outtype <string>`: Which output format to generate. Supported output types are :code:`vcf` or :code:`pgen`. If multiple types are listed (e.g. :code:`--outtype vcf pgen`), all specified output formats are generated. By default, only VCF output is generated at :code:`$outprefix.vcf`. If PGEN output is specified, the files :code:`$outprefix.pgen`, :code:`$outprefix.pvar`, and :code:`$outprefix.psam` are generated. See more on the `pgen format here <https://www.cog-genomics.org/plink/2.0/formats#pgen>`_
* :code:`--outtype <string>`: Which output format to generate. Supported output types are :code:`vcf`, :code:`gzvcf`, or :code:`pgen`. If multiple types are listed (e.g. :code:`--outtype vcf pgen`), all specified output formats are generated. By default, only VCF output is generated at :code:`$outprefix.vcf`. You cannot output both VCF and gzipped VCF in the same command. If PGEN output is specified, the files :code:`$outprefix.pgen`, :code:`$outprefix.pvar`, and :code:`$outprefix.psam` are generated. See more on the `pgen format here <https://www.cog-genomics.org/plink/2.0/formats#pgen>`_
* :code:`--region <str>`: Restrict to processing a specific genomic region. Syntax: chr:start-end.
* :code:`--chunk-size <int>`: If writing PGEN output, load dosages in chunks of this many variants at a type. This can avoid out of memory errors if you are processing very large VCF files.

In addition to specifying input and output options above, you must specify at least one annotation operation to perform. These are described below.
Expand Down Expand Up @@ -64,6 +65,10 @@ where the argument to the :code:`--dosages` option specifies the method to compu
* :code:`beagpleap`: dosages for imputed calls are computed in a way that takes into account imputation uncertainty. This option requires the Beagle VCF FORMAT fields :code:`AP1` and :code:`AP2` to be present, which give the probability of each alternate allele on each of the two haplotypes of an individual. AP-based dosages are generated by computing a weighted sum of allele lengths for each haplotype and added together to get the total dosage for the genotype (:math:`\sum_{hap=1,2} \sum_{allele a} P(a_{hap})len(a)`). In the case of imputation with no uncertainty (all AP fields are either 0 or 1), dosages computed in this way should match best guess dosages.
* :code:`beagleap_norm`: same as above, but scaled to be between 0 and 2.

Additional dosage options:

* :code:`--warn-on-AP-error`: Output a warning, rather than quit, if invalid AP fields are encountered (i.e. if AP values are not found, do not sum to 1 or are negative, or if invalid dosage values are encountered after normalizing).

If annotating dosages, the following fields are added to VCF output files:

* FORMAT field :code:`TRDS`. This is a float value giving the estimated TR dosage for each call.
Expand Down Expand Up @@ -101,6 +106,7 @@ Additional relevant options:
* **rawalleles** means loci are matched on :code:`chrom:pos:ref:alt`
* **trimmedalleles** means loci are matched on :code:`chrom:pos:ref:alt` but ref and alt alleles are trimmed to remove common prefixes/suffixes. The trimmedalleles option must be used if you merged samples in your target VCF file using :code:`bcftools merge`, since that tool will modify alleles to remove common sequence (see `this issue <https://github.com/samtools/bcftools/issues/726>`_)
* :code:`--ignore-duplicates`: This flag outputs a warning if duplicate loci are detected in the reference. If this flag is not set and a duplicate locus is detected, the program quits.
* :code:`--update-ref-alt`: Update the REF/ALT allele sequences from the reference panel. Fixes issue with alleles being chopped after bcftools merge. Use with caution as this assumes allele order is exactly the same between the refpanel and target VCF. Only works when matching on locus id.

If generating a VCF output file, this command will output a new file containing only STRs, with the following fields added back depending on the genotyper used to generate the reference panel:

Expand Down
3 changes: 1 addition & 2 deletions trtools/annotaTR/annotaTR.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,7 @@ def CheckAlleleCompatibility(record_ref, record_alt, panel_ref, panel_alt):
of the panel and target VCF are compatible. In particular:
- is the number of ALT alleles the same
- are all alleles are offset by the same number of bp
- are all the ALTs in the target VCF substrings of the
refpanel ALTS.
- are all the ALTs in the target VCF substrings of the refpanel ALTS.
If any of these fail then the alleles are deemed incompatible.
Parameters
Expand Down

0 comments on commit 1703b33

Please sign in to comment.