-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using cigar complexity to break ties in uninformative reads' best haplotypes #5359
Conversation
@ldgauthier Reminder that this is needed for the M2 paper evaluation. I forgot about this PR amid all the other MC3 stuff and would have reminded you earlier had I remembered. |
@@ -421,11 +422,12 @@ private void normalizeLikelihoodsPerRead(final double maximumBestAltLikelihoodDi | |||
* @param sampleIndex including sample index. | |||
* @param readIndex target read index. | |||
* | |||
* @param useReferenceIfUninformative | |||
* @param priorities An array of allele priorities (lower score is higher priority) to be used, if present, to break ties for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be having a dyslexic moment, but is lower score higher priority? If we prefer the reference and more complex haplotypes have more negative scores, then higher scores should have higher priorities, yes? Later in the code you update the best allele if the candidatePriority > bestPriority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. The code is right and the comment is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it.
Codecov Report
@@ Coverage Diff @@
## master #5359 +/- ##
===============================================
+ Coverage 86.903% 87.002% +0.099%
- Complexity 30311 31188 +877
===============================================
Files 1849 1908 +59
Lines 140507 144073 +3566
Branches 15475 15937 +462
===============================================
+ Hits 122105 125347 +3242
- Misses 12793 12965 +172
- Partials 5609 5761 +152
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks for the test change explanation and the painstaking effort of looking at changed likelihoods.
@ldgauthier this finishes what we started in #4858 and is necessary for the pileup-calls-on-bamouts MC3 validation. The cause is the same, in that Pair-HMM has a tiny bias in favor of shorter haplotypes and thus it prefers deletion haplotypes when reads end inside STRs. In #4858 we broke near-ties in favor of the reference; this PR fixes the case where two alt haplotypes share a SNV and one of them has a spurious deletion.
One important sanity check was that when I set
cigarTerm
to zero inAssemblyBasedCallerUtils.java
no tests broke. This means that the refactoring needed to set up the change didn't affect behavior.I looked at most of the sites where
PL
s and/orDP
s changed in the integration test vcfs and in every case the difference was from a fake deletion that this PR fixed. I also went through the diff of the bamouts in IGV and found the same thing.Finally, the changes to test vcfs in
GenotypeGVCFsIntegrationTest
andGenomicsDBImporterIntegrationTest
are a consequence of changes to theHaplotypeCallerIntegrationTest
vcfs.