Middle ground between --dont-trim-active-regions true
and --dont-trim-active-regions false
#5791
Labels
--dont-trim-active-regions true
and --dont-trim-active-regions false
#5791
Feature request
Tool(s) or class(es) involved
HaplotypeCaller (latest version from master as of 3/12/2019)
Description
I'm running into situations where the HaplotypeCaller does the wrong thing when genotyping, when it is allowed to trim active regions. A good example is a fairly high frequency variant at
HG19:chr11 -6411935
akaHG38:chr11:6390705
. I'm looking at this data in the 24 samples that Broad did deep PCR-free 2x250 WGS sequencing on for the 1000G project. In these samples there are a pair of variants at that locus, one is a SNP and the other is a 6bp deletion. It's right on the boundary of a simple sequence repeat.What appears to happen is that when the active region is trimmed, sometimes the deletion allele is lost. From looking at the gVCFs created and also at the output with
--debug
it's clear the allele is discovered, but then when genotyping is done on the trimmed active region the allele disappears. Here's an example pair of calls where the only different in HC invocation was that the first one used--dont-trim-active-regions true
:and the second one didn't:
Note how in the second case, there are two alts in the gVCF, but only one of them has depth!
The only way to recover these cases is to run with
--dont-trim-active-regions
, but that make the HC run approximately 5 times slower, which is obviously not ideal.What I'd like to suggest is that the HC have some automated way to detect when this kind of error is likely to happen or has happened, and work around it. My suggestion(s) would be:
I think this really only happens in repetitive regions. I wonder if it would be possible to have the HC automatically trim active regions when assembly at kmer size 10 works, and disable it when it has to escalate to a higher kmer size?
Trim the active region, but retain the untrimmed active region also. Genotype using the trimmed region. If any allele receives count=0, re-genotype using the untrimmed regions.
My thought here is that I think not trimming the active regions really only makes a difference at a small fraction of sites, on the order of 1/1000, but to rescue those sites we have to pay a 5x performance penalty at every site. It would be great if trimming could be auto-disabled at only those sites that are problematic, so we could have our cake and eat it too.
The text was updated successfully, but these errors were encountered: