Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add options to facilitate debugging annotaTR runs on large files #234

Merged
merged 31 commits into from
Sep 17, 2024

Conversation

gymreklab
Copy link
Contributor

@gymreklab gymreklab commented Aug 30, 2024

This PR introduces multiple options to annotaTR that help with debugging issues that arose when processing large files.

  1. Added option --region to enable running annotaTR on a specified genomic region (chr:start-end). Helpful for debugging.
  2. Added option --update-ref-alt to force annotaTR top copy over the ref/alt alleles from the reference panel. Helpful in cases where we have run bcftools merge in which case allele sequences may be modified. This in some cases causes problems since the INFO/END field is not updated accordingly, causing parsing HipSTR records to fail (related to the offsets set here: https://github.com/gymrek-lab/TRTools/blob/master/trtools/utils/tr_harmonizer.py#L340). Required adding function CheckAlleleCompatibility to annotaTR.
  3. Added ability to output to bgzipped VCF output with --outtype gzvcf. Useful when output files are huge and we will end up zipping them anyway.
  4. It adds option --warn-on-AP-error which results in skipping loci where checks on AP fields fail. In these cases, rather than the program quitting, we output nan values for dosages. In particular the checks this is relevant to are:
  • Checking if the AP1/2 fields exist
  • Checking if they sum to more than 1
  • Checking for negative values
  • Checking if normalized values end up being >=2.1 or <=-0.1

Most of these invalid AP cases should still never happen. We have encountered rare cases where values sum to more than 1, likely due to rounding errors in cases with huge numbers of alleles. The main motivation is in cases where we run annotaTR on huge VCF files which takes many hours only to encounter a bad AP field at the very end and crash, or when the vast majority of AP fields are fine but a few problematic loci cause the whole run to fail.

Other specific changes related to this option: (1) Added option strict to GetDosages(). This defaults to true, in which case we throw ValueError for the cases above. If this is false, we output a warning and return all dosage values as np.nan. (2) Regardless of whether the strict option is set, added info to the error/warning messages about which locus was problematic to help with tracking down those cases.

Finally, I added a couple additional tests unrelated to these when attempting to get good test coverage on annotaTR.py.

Checklist

  • [ x] I've checked to ensure there aren't already other open pull requests for the same update/change
  • [ x] I've prefixed the title of my PR according to the conventional commits specification. If your PR fixes a bug, please prefix the PR with fix: . Otherwise, if it introduces a new feature, please prefix it with feat: . If it introduces a breaking change, please add an exclamation before the colon, like feat!: . If the scope of the PR changes because of a revision to it, please update the PR title, since the title will be used in our CHANGELOG.
  • [ x] At the top of the PR, I've listed any open issues that this PR will resolve. For example, "resolves #0" if this PR resolves issue #0
  • [ x] I've explained my changes in a manner that will make it possible for both users and maintainers of TRTools to understand them
  • [x ] I've added tests for any new functionality. Or, if this PR fixes a bug, I've added test(s) that replicate it
  • [ x] All directories with large test files are listed in the "exclude" section of our pyproject.toml so that they do not appear in our PyPI distribution. All new files are also smaller than 0.5 MB.
  • [ x] I've updated the relevant REAMDEs with any new usage information and checked that the newly built documentation is formatted properly
  • [ x] All functions, modules, classes etc. still conform to numpy docstring standards
  • [x ] (if applicable) I've updated the pyproject.toml file with any changes I've made to TRTools's dependencies, and I've run poetry lock --no-update to ensure the lock file stays up to date and that our dependencies are locked to their minimum versions
  • [ x] In the body of this PR, I've included a short address to the reviewer highlighting one or two items that might deserve their focus

@gymreklab gymreklab marked this pull request as ready for review August 30, 2024 23:32
@gymreklab gymreklab requested a review from nicholema August 30, 2024 23:34
@gymreklab gymreklab marked this pull request as draft August 31, 2024 04:20
@gymreklab gymreklab changed the title feat: Add option to make annotaTR less strict on Beagle AP field checks feat: Add options to facilitate debugging annotaTR runs on large files Sep 9, 2024
@gymreklab gymreklab marked this pull request as ready for review September 9, 2024 04:18
@gymreklab gymreklab requested a review from aryarm September 9, 2024 04:18
Copy link
Member

@aryarm aryarm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting to see this get refined and closer and closer to production! I have just a few small comments which are scattered throughout. Thanks for asking me for a review!

One thing that wasn't immediately clear to me: How does pgenlib.PgenWriter handle np.nan values? Are those getting encoded properly as missing? And, if so, maybe it would be good to have a test that checks for that via pgenlib.PgenReader ? I can help write such a test, if you'd like.

trtools/annotaTR/annotaTR.py Outdated Show resolved Hide resolved
trtools/annotaTR/README.rst Outdated Show resolved Hide resolved
trtools/annotaTR/README.rst Outdated Show resolved Hide resolved
trtools/utils/tr_harmonizer.py Show resolved Hide resolved
trtools/utils/tr_harmonizer.py Show resolved Hide resolved
trtools/utils/tr_harmonizer.py Show resolved Hide resolved
trtools/annotaTR/tests/test_annotaTR.py Show resolved Hide resolved
@gymreklab
Copy link
Contributor Author

Thanks for these comments @aryarm!

  • I updated the VCF output options as suggested (I wasn't totally sure if compressed BCF files should still be .bcf or should they be .bcf.gz? )

  • Regarding setting all samples with bad AP fields to np.nan vs. only the failing samples: I had deliberately set them all to np.nan for now. This is because it happens very rarely (like 2 out of 50K loci which are actually removed from our cleaned up panel) and at what look to be problematic loci, and I would rather just exclude those entire loci from downstream analysis. I am also not sure what sort of biases get introduced if we remove only the failing samples. Would could always change in the future to only set the failing ones to np.nan but my preference is to keep it this way for now.

  • For testing the downstream pgen output with np.nan values: curious on ideas for the best way to do this. For now, I have added a test file to force this case (trtools/testsupport/sample_vcfs/beagle/beagle_badap.vcf.gz) (used in test/cmdline_tests.sh). Running plink2 --validate on this file doesn't return any problems. Are there other checks we should do?

Copy link
Member

@aryarm aryarm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sounds good! I think it just uses .bcf (not .bcf.gz), so we should be fine

I also added commit 692a65b to test that missing values are properly being written to the PGEN file output with --warn-on-AP-error

trtools/annotaTR/README.rst Outdated Show resolved Hide resolved
@gymreklab gymreklab merged commit 0d5d96b into master Sep 17, 2024
13 checks passed
@gymreklab gymreklab deleted the annotatr-less-strict-2 branch September 17, 2024 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants