feat: Add options to facilitate debugging annotaTR runs on large files #234

gymreklab · 2024-08-30T23:29:55Z

This PR introduces multiple options to annotaTR that help with debugging issues that arose when processing large files.

Added option --region to enable running annotaTR on a specified genomic region (chr:start-end). Helpful for debugging.
Added option --update-ref-alt to force annotaTR top copy over the ref/alt alleles from the reference panel. Helpful in cases where we have run bcftools merge in which case allele sequences may be modified. This in some cases causes problems since the INFO/END field is not updated accordingly, causing parsing HipSTR records to fail (related to the offsets set here: https://github.com/gymrek-lab/TRTools/blob/master/trtools/utils/tr_harmonizer.py#L340). Required adding function CheckAlleleCompatibility to annotaTR.
Added ability to output to bgzipped VCF output with --outtype gzvcf. Useful when output files are huge and we will end up zipping them anyway.
It adds option --warn-on-AP-error which results in skipping loci where checks on AP fields fail. In these cases, rather than the program quitting, we output nan values for dosages. In particular the checks this is relevant to are:

Checking if the AP1/2 fields exist
Checking if they sum to more than 1
Checking for negative values
Checking if normalized values end up being >=2.1 or <=-0.1

Most of these invalid AP cases should still never happen. We have encountered rare cases where values sum to more than 1, likely due to rounding errors in cases with huge numbers of alleles. The main motivation is in cases where we run annotaTR on huge VCF files which takes many hours only to encounter a bad AP field at the very end and crash, or when the vast majority of AP fields are fine but a few problematic loci cause the whole run to fail.

Other specific changes related to this option: (1) Added option strict to GetDosages(). This defaults to true, in which case we throw ValueError for the cases above. If this is false, we output a warning and return all dosage values as np.nan. (2) Regardless of whether the strict option is set, added info to the error/warning messages about which locus was problematic to help with tracking down those cases.

Finally, I added a couple additional tests unrelated to these when attempting to get good test coverage on annotaTR.py.

Checklist

[ x] I've checked to ensure there aren't already other open pull requests for the same update/change
[ x] I've prefixed the title of my PR according to the conventional commits specification. If your PR fixes a bug, please prefix the PR with fix: . Otherwise, if it introduces a new feature, please prefix it with feat: . If it introduces a breaking change, please add an exclamation before the colon, like feat!: . If the scope of the PR changes because of a revision to it, please update the PR title, since the title will be used in our CHANGELOG.
[ x] At the top of the PR, I've listed any open issues that this PR will resolve. For example, "resolves #0" if this PR resolves issue #0

[ x] I've explained my changes in a manner that will make it possible for both users and maintainers of TRTools to understand them

[x ] I've added tests for any new functionality. Or, if this PR fixes a bug, I've added test(s) that replicate it
[ x] All directories with large test files are listed in the "exclude" section of our pyproject.toml so that they do not appear in our PyPI distribution. All new files are also smaller than 0.5 MB.
[ x] I've updated the relevant REAMDEs with any new usage information and checked that the newly built documentation is formatted properly
[ x] All functions, modules, classes etc. still conform to numpy docstring standards
[x ] (if applicable) I've updated the pyproject.toml file with any changes I've made to TRTools's dependencies, and I've run poetry lock --no-update to ensure the lock file stays up to date and that our dependencies are locked to their minimum versions
[ x] In the body of this PR, I've included a short address to the reviewer highlighting one or two items that might deserve their focus

aryarm

It's interesting to see this get refined and closer and closer to production! I have just a few small comments which are scattered throughout. Thanks for asking me for a review!

One thing that wasn't immediately clear to me: How does pgenlib.PgenWriter handle np.nan values? Are those getting encoded properly as missing? And, if so, maybe it would be good to have a test that checks for that via pgenlib.PgenReader ? I can help write such a test, if you'd like.

trtools/annotaTR/annotaTR.py

trtools/annotaTR/README.rst

trtools/utils/tr_harmonizer.py

trtools/annotaTR/tests/test_annotaTR.py

Co-authored-by: Arya Massarat <[email protected]>

gymreklab · 2024-09-16T16:58:09Z

Thanks for these comments @aryarm!

I updated the VCF output options as suggested (I wasn't totally sure if compressed BCF files should still be .bcf or should they be .bcf.gz? )
Regarding setting all samples with bad AP fields to np.nan vs. only the failing samples: I had deliberately set them all to np.nan for now. This is because it happens very rarely (like 2 out of 50K loci which are actually removed from our cleaned up panel) and at what look to be problematic loci, and I would rather just exclude those entire loci from downstream analysis. I am also not sure what sort of biases get introduced if we remove only the failing samples. Would could always change in the future to only set the failing ones to np.nan but my preference is to keep it this way for now.
For testing the downstream pgen output with np.nan values: curious on ideas for the best way to do this. For now, I have added a test file to force this case (trtools/testsupport/sample_vcfs/beagle/beagle_badap.vcf.gz) (used in test/cmdline_tests.sh). Running plink2 --validate on this file doesn't return any problems. Are there other checks we should do?

…xpect

aryarm

ok, sounds good! I think it just uses .bcf (not .bcf.gz), so we should be fine

I also added commit 692a65b to test that missing values are properly being written to the PGEN file output with --warn-on-AP-error

trtools/annotaTR/README.rst

Co-authored-by: Arya Massarat <[email protected]>

Add option to make annotaTR less strict on Beagle AP field checks

e9f23ae

gymreklab marked this pull request as ready for review August 30, 2024 23:32

gymreklab requested a review from nicholema August 30, 2024 23:34

gymreklab marked this pull request as draft August 31, 2024 04:20

Melissa Gymrek and others added 22 commits August 30, 2024 21:24

adding region flag to annotatr

55116b4

adding region flag to annotatr

e0ed3c7

adding region flag to annotatr

a4bf277

adding region flag to annotatr

fd8bfec

adding region flag to annotatr

38d2254

adding region flag to annotatr

65a3ad1

adding region flag to annotatr

232fa3c

adding print debug statements

1508c0e

adding print debug statements

4832a8e

adding print debug statements

66d6aa9

flag to address bcftools merge chopping

c77fecc

flag to address bcftools merge chopping

b0b0b38

change behavior of fixing ref/alt alleles in annotaTR

d6be025

move update ref alt option to annotation options

88a96f0

adding tests for new annotatr options

3e815d4

minor comment update

3b07862

add gzvcf outtype to annotatr

fd95517

add gzvcf output and update docs

1703b33

argument checking for update-ref-alt

6cb53fc

add warning for incompatible alleles

1e47575

add warning for suspicious allele lengths

f36bcd3

add warning for suspicious allele lengths

7ee4059

gymreklab changed the title ~~feat: Add option to make annotaTR less strict on Beagle AP field checks~~ feat: Add options to facilitate debugging annotaTR runs on large files Sep 9, 2024

update warning for suspicious allele lengths

ec532c7

gymreklab marked this pull request as ready for review September 9, 2024 04:18

gymreklab requested a review from aryarm September 9, 2024 04:18

aryarm reviewed Sep 13, 2024

View reviewed changes

gymreklab and others added 5 commits September 16, 2024 09:12

Update trtools/annotaTR/annotaTR.py

0cf4ccb

Co-authored-by: Arya Massarat <[email protected]>

Update trtools/annotaTR/README.rst

71defbb

Co-authored-by: Arya Massarat <[email protected]>

Update trtools/annotaTR/tests/test_annotaTR.py

9b3fb96

Co-authored-by: Arya Massarat <[email protected]>

adding more flexible vcf output options

ef192fa

add files to force bad AP error

0653dfb

add a test to check that the pgen output is consistent with what we e…

692a65b

…xpect

aryarm approved these changes Sep 17, 2024

View reviewed changes

trtools/annotaTR/README.rst Outdated Show resolved Hide resolved

Update trtools/annotaTR/README.rst

34ef793

Co-authored-by: Arya Massarat <[email protected]>

gymreklab merged commit 0d5d96b into master Sep 17, 2024
13 checks passed

gymreklab deleted the annotatr-less-strict-2 branch September 17, 2024 17:45

github-actions bot mentioned this pull request Aug 29, 2024

chore(master): release 6.1.0 #222

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add options to facilitate debugging annotaTR runs on large files #234

feat: Add options to facilitate debugging annotaTR runs on large files #234

gymreklab commented Aug 30, 2024 •

edited

Loading

aryarm left a comment •

edited

Loading

gymreklab commented Sep 16, 2024

aryarm left a comment

feat: Add options to facilitate debugging annotaTR runs on large files #234

feat: Add options to facilitate debugging annotaTR runs on large files #234

Conversation

gymreklab commented Aug 30, 2024 • edited Loading

Checklist

aryarm left a comment • edited Loading

Choose a reason for hiding this comment

gymreklab commented Sep 16, 2024

aryarm left a comment

Choose a reason for hiding this comment

gymreklab commented Aug 30, 2024 •

edited

Loading

aryarm left a comment •

edited

Loading