-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add options to facilitate debugging annotaTR runs on large files #234
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting to see this get refined and closer and closer to production! I have just a few small comments which are scattered throughout. Thanks for asking me for a review!
One thing that wasn't immediately clear to me: How does pgenlib.PgenWriter
handle np.nan
values? Are those getting encoded properly as missing? And, if so, maybe it would be good to have a test that checks for that via pgenlib.PgenReader
? I can help write such a test, if you'd like.
Co-authored-by: Arya Massarat <[email protected]>
Co-authored-by: Arya Massarat <[email protected]>
Co-authored-by: Arya Massarat <[email protected]>
Thanks for these comments @aryarm!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, sounds good! I think it just uses .bcf
(not .bcf.gz
), so we should be fine
I also added commit 692a65b to test that missing values are properly being written to the PGEN file output with --warn-on-AP-error
Co-authored-by: Arya Massarat <[email protected]>
This PR introduces multiple options to annotaTR that help with debugging issues that arose when processing large files.
--region
to enable running annotaTR on a specified genomic region (chr:start-end). Helpful for debugging.--update-ref-alt
to force annotaTR top copy over the ref/alt alleles from the reference panel. Helpful in cases where we have runbcftools merge
in which case allele sequences may be modified. This in some cases causes problems since the INFO/END field is not updated accordingly, causing parsing HipSTR records to fail (related to the offsets set here: https://github.com/gymrek-lab/TRTools/blob/master/trtools/utils/tr_harmonizer.py#L340). Required adding functionCheckAlleleCompatibility
to annotaTR.--outtype gzvcf
. Useful when output files are huge and we will end up zipping them anyway.--warn-on-AP-error
which results in skipping loci where checks on AP fields fail. In these cases, rather than the program quitting, we output nan values for dosages. In particular the checks this is relevant to are:Most of these invalid AP cases should still never happen. We have encountered rare cases where values sum to more than 1, likely due to rounding errors in cases with huge numbers of alleles. The main motivation is in cases where we run annotaTR on huge VCF files which takes many hours only to encounter a bad AP field at the very end and crash, or when the vast majority of AP fields are fine but a few problematic loci cause the whole run to fail.
Other specific changes related to this option: (1) Added option
strict
toGetDosages()
. This defaults totrue
, in which case we throwValueError
for the cases above. If this isfalse
, we output a warning and return all dosage values asnp.nan
. (2) Regardless of whether thestrict
option is set, added info to the error/warning messages about which locus was problematic to help with tracking down those cases.Finally, I added a couple additional tests unrelated to these when attempting to get good test coverage on
annotaTR.py
.Checklist
fix:
. Otherwise, if it introduces a new feature, please prefix it withfeat:
. If it introduces a breaking change, please add an exclamation before the colon, likefeat!:
. If the scope of the PR changes because of a revision to it, please update the PR title, since the title will be used in our CHANGELOG.poetry lock --no-update
to ensure the lock file stays up to date and that our dependencies are locked to their minimum versions