Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add option to make annotaTR less strict on Beagle AP field checks #233

Closed
wants to merge 1 commit into from

Conversation

gymreklab
Copy link
Contributor

The main change is the addition of the option --warn-on-AP-error which results in skipping loci where checks on AP fields fail. In these cases, rather than the program quitting, we output nan values for dosages. In particular the checks this is relevant to are:

  • Checking if the AP1/2 fields exist
  • Checking if they sum to more than 1
  • Checking for negative values
  • Checking if normalized values end up being >=2.1 or <=-0.1

Most of these should still never happen. We have encountered cases where values sum to more than 1, likely due to rounding errors in cases with huge numbers of alleles.

This is a somewhat dangerous flag and its use should not be encouraged. The main motivation is in cases where we run annotaTR on huge VCF files which takes many hours only to encounter a bad AP field at the very end and crash, or when the vast majority of AP fields are fine but a few problematic loci cause the whole run to fail.

Other specific changes:

  1. Added option strict to GetDosages(). This defaults to true, in which case we throw ValueError for the cases above. If this is false, we output a warning and return all dosage values as np.nan.
  2. Regardless of whether the strict option is set, added info to the error/warning messages about which locus was problematic to help with tracking down those cases.

Checklist

  • [ x] I've checked to ensure there aren't already other open pull requests for the same update/change
  • [ x] I've prefixed the title of my PR according to the conventional commits specification. If your PR fixes a bug, please prefix the PR with fix: . Otherwise, if it introduces a new feature, please prefix it with feat: . If it introduces a breaking change, please add an exclamation before the colon, like feat!: . If the scope of the PR changes because of a revision to it, please update the PR title, since the title will be used in our CHANGELOG.
  • [ x] At the top of the PR, I've listed any open issues that this PR will resolve. For example, "resolves #0" if this PR resolves issue #0
  • [ x] I've explained my changes in a manner that will make it possible for both users and maintainers of TRTools to understand them
  • [x ] I've added tests for any new functionality. Or, if this PR fixes a bug, I've added test(s) that replicate it
  • [ x] All directories with large test files are listed in the "exclude" section of our pyproject.toml so that they do not appear in our PyPI distribution. All new files are also smaller than 0.5 MB.
  • [ x] I've updated the relevant REAMDEs with any new usage information and checked that the newly built documentation is formatted properly
  • [ x] All functions, modules, classes etc. still conform to numpy docstring standards
  • [x ] (if applicable) I've updated the pyproject.toml file with any changes I've made to TRTools's dependencies, and I've run poetry lock --no-update to ensure the lock file stays up to date and that our dependencies are locked to their minimum versions
  • [ x] In the body of this PR, I've included a short address to the reviewer highlighting one or two items that might deserve their focus

@gymreklab gymreklab changed the base branch from master to add-longtr-support August 30, 2024 23:21
@gymreklab gymreklab closed this Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant