-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continue investigation of count collection strategies. #4551
Comments
@LeeTL1220 @ldgauthier @cwhelan @mwalker174 @SHuang-Broad @yfarjoun might find this analysis of interest. |
Very interesting! @mbabadi What were your settings in IGV (under View->Prefs->Alignments) for MQ threshold, filter secondary/supplementary alignments? Could MQ maybe explain the random dropouts in the inversions cast? |
Interesting! Thanks for generating these. I am already convinced by #4519 we should at least switch over to a ‘CollectReadCounts’ strategy for initial evaluations. A few comments: -I’m guessing that the equal insert size and uniform sampling is enhancing many of these artifacts to a level that we probably don’t see in the real world. Can we take a look at some real-world examples? -Same goes for the fact that homs will be unlikely. -Not sure about the dropouts. Might be worth running without SNPs as a confounding factor. -How flexible is SVGen? Might be worth putting together a more realistic simulated data set. Any chance @MartonKN might be able to use it to cook up some realistic tumor data? -I don’t recall having a In any case, I think along with findings from the other issue, we should issue a quick PR for Speaking of which, this PR should not delay getting the first round of automated evaluations up and running. Again, the whole point of those is to have a reproducible baseline metric against which we can easily experiment with and adopt these sorts of changes. Although these sorts of theoretical/simulated/thought experiments are clearly useful to us, unfortunately, they may not be as compelling to some of our users as demonstrable improvement seems on real data! |
@samuelklee Valentin wrote it: While automated evaluations are definitely the way to go (eventually), there is real value in understanding the interaction between BWA and our coverage collection strategy in a controlled setting. Regarding automatic evaluations -- let's talk in person when you get back. |
Also, just to provide some context to all tagged: certain users of the old CNV pipeline expressed somewhat vague concerns with the non-fragment-based coverage collection strategies—which also differed across WES and WGS, to boot—-but didn’t offer any compelling demonstrations that fragment-based strategies were better. For the new version of the pipelines, the main priority was to pick a single strategy to unify WES/WGS coverage collection. We decided to give a simple fragment-based strategy a shot—-with the intention of using automated evaluations to test it in a rigorous manner. Although those aren’t in place yet, I’m comfortable with making the call against it at this point. |
Ah, that’s right, @mbabadi, thanks for reminding me. We can always quickly reimplement such a strategy as a Tasks aimed towards getting the first round of automatic evaluations in place comprise the majority of our goals this quarter, so let’s get them done. Once they’re in place, we’ll be able to experiment more freely and confidently! |
One more question: did you examine events shorter than the fragment size? |
Update: Here's how the coverage looks like using Summary: marked improvement in all cases, however, the error modes are different. Any improvement over Unbalanced translocation: Balanced translocation: Inversion: Deletion: Tandem Duplication: |
Also, MQ filtering results in stochastic coverage dropout. It is likely that low MQ regions significantly overlap across samples, in which case, downstream CNV can learn such biases and correct the coverage. Will test this in validations. |
Thanks, @samuelklee for keeping me updated. |
Pursuing issue #4519, I have tried to evaluate the performance of
CollectFragmentCounts
(the current coverage collection tool for somatic and germline CNV caller) on synthetic dataset.Methodology: SVGen was used to generate a set of random canonical SVs (deletion, tandem duplication, inversion, balanced translocation, unbalanced translocation) on chr22 of hg19 + random SNPs at observed population frequencies. The SVs were applied to the reference to generate the SV genome. Paired-end reads w/ equal lengths (100bp) and insert sizes (500bp) were uniformly generated from the SV genome and was mapped to chr22 using BWA-MEM (default arguments). Coverage on 100bp uniform bins were collected using
CollectFragmentCounts
(default arguments: MQ > 30, both mates aligned, and only innies). The coverage was studied case-by-case on a few SVs.Case-by-base study:
Balanced translocation:
Here, an event in shown where a ~ 3kb region of chr22 is translocated to another region. Ideally, there should be no coverage loss. The IGV inspection shows excess coverage on the left side and depletion on the right side. Upon inspecting the conjugate translocation site, a similar scenario is seen. This situation is hardly avoidable -- depending on the mappability of the two loci, one captures the chimeric fragment of the other with higher probability (right?). The situation is worse for
CollectFragmentCounts
because chimeras are ignored altogether:Deletion:
For deletions, both coverage collection strategies work well. The deletion region is not quite captured perfectly by either method.
Tandem Duplication:
For tandem duplications, neglecting FF and RR fagments leads to an underestimation of the size of the duplicated region by
CollectFragmentCounts
. IGV does not seem to get it quite right either (@cwhelan does the IGV plot make sense to you? could it be there's a bug in SVGen in generating tandem duplications? )Inversion:
IGV performs well, with very little coverage depletion at the boundaries.
CollectFragmentCounts
shows significant coverage depletion at the boundaries + random dropouts (why?).Here's another example in a less mappable region (the IGV track should GMA Illumina mappability track):
Again, IGV does a much better job. In general, keeping only FR pairs seems to lead to noisy coverage, especially in low mappability regions.
Unbalanced Translocation:
A clear win for IGV, and a good reason for keeping split reads (notice the coverage loss on the left side of the event in
CollectFragmentCounts
).An example is in a low complexity region:
No good strategy here -- it's better to blacklist such regions altogether for CNV calling.
Another win for IGV. I do not understand the reason for coverage dropouts in
CollectFragmentCounts
. It might have something to do with the SNPs (though, all reads have high MQ).Summary and Conclusion: We should not filter out FF and RR pairs, and we'd rather consider them as individual reads. IGV's strategy (base coverage on individual reads, no filtering except for mismatching bases and clipped regions) conforms significantly better to the expected coverage compared to
CollectFragmentCounts
. We should consider revivingCollectBaseCallCoverage
from GATK4 beta and evaluating it.I suggest addressing this as a priority issue before we consider g/s-CNV for production use, or coverage collection on any large dataset.
The text was updated successfully, but these errors were encountered: