Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cap likelihoods symmetrically #6196

Merged
merged 2 commits into from
Oct 23, 2019
Merged

Conversation

ldgauthier
Copy link
Contributor

Big events or haplotypes with a lot of events can lead to very disparate likelihoods between the reference allele and the alt allele. We put a cap on how unlikely an alt read can be, but the behavior was not symmetric with respect to the reference. As a result, some sites with large indels or many SNPs could get genotyped as heterozygous with >90% alt alleles, where the one or two reference reads are likely derived from cross-sample contamination. This genotyping artifact is now fixed, though the cap is still set by the "global mismapping penalty", or the probability that a read is derived from another locus in the genome and not either of the haplotypes represented in this ActiveRegion.

@@ -206,7 +206,7 @@ public static CachingIndexedFastaSequenceFile createReferenceReader(final String
* @return never {@code null}.
*/
public static ReadLikelihoodCalculationEngine createLikelihoodCalculationEngine(final LikelihoodEngineArgumentCollection likelihoodArgs) {
final double log10GlobalReadMismappingRate = likelihoodArgs.phredScaledGlobalReadMismappingRate < 0 ? - Double.MAX_VALUE
final double log10GlobalReadMismappingRate = likelihoodArgs.phredScaledGlobalReadMismappingRate < 0 ? Double.NEGATIVE_INFINITY
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs say you can turn off the cap by setting this value negative, but it wasn't working before because subsequent methods were looking for -Inf.

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to @ldgauthier

@@ -352,7 +352,8 @@ public void normalizeLikelihoods(final double maximumLikelihoodDifferenceCap) {
private void normalizeLikelihoodsPerEvidence(final double maximumBestAltLikelihoodDifference,
final double[][] sampleValues, final int sampleIndex, final int evidenceIndex) {

final BestAllele bestAlternativeAllele = searchBestAllele(sampleIndex,evidenceIndex,false);
//allow the best "alternative" allele to be the reference because asymmetry leads to strange artifacts like het calls with >90% alt reads
final BestAllele bestAlternativeAllele = searchBestAllele(sampleIndex,evidenceIndex,true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we then just call this bestAllele?

};
runCommandLine(argsNoMaxAlternateAlleles);

List<VariantContext> vcs = VariantContextTestUtils.readEntireVCFIntoMemory(outputContaminatedHomVarDeletions.getAbsolutePath()).getRight();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VariantContextTestUtils.getVariantContexts is all you need.


// Run both with and without --max-alternate-alleles over our interval, so that we can
// prove that the argument is working as intended.
final String[] argsNoMaxAlternateAlleles = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an ArgumentBuilder would be cleaner:

final ArgumentBuilder args = new ArgumentBuilder().addInput(bam)
   .addReference(hg38Reference)
   .addInterval(new SimpleInterval(intervals))
   .addOutput(outputContaminatedHomVarDeletions)
   .addNumericArgument(IntervalArgumentCollection.INTERVAL_PADDING_LONG_NAME, 50);
runCommandLine(args);


final File outputContaminatedHomVarDeletions = createTempFile("testContaminatedHomVarDeletions", ".vcf");

// Run both with and without --max-alternate-alleles over our interval, so that we can
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it run "with"?

@ldgauthier
Copy link
Contributor Author

@davidbenjamin take another look?

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ldgauthier ldgauthier force-pushed the ldg_capLikelihoodsSymmetrically branch from 36caaa9 to c5b0ba6 Compare October 23, 2019 18:07
@ldgauthier ldgauthier merged commit b2e219b into master Oct 23, 2019
@ldgauthier ldgauthier deleted the ldg_capLikelihoodsSymmetrically branch December 4, 2019 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants