Continue investigation of count collection strategies. #4551

mbabadi · 2018-03-21T21:55:14Z

Pursuing issue #4519, I have tried to evaluate the performance of CollectFragmentCounts (the current coverage collection tool for somatic and germline CNV caller) on synthetic dataset.

Methodology: SVGen was used to generate a set of random canonical SVs (deletion, tandem duplication, inversion, balanced translocation, unbalanced translocation) on chr22 of hg19 + random SNPs at observed population frequencies. The SVs were applied to the reference to generate the SV genome. Paired-end reads w/ equal lengths (100bp) and insert sizes (500bp) were uniformly generated from the SV genome and was mapped to chr22 using BWA-MEM (default arguments). Coverage on 100bp uniform bins were collected using CollectFragmentCounts (default arguments: MQ > 30, both mates aligned, and only innies). The coverage was studied case-by-case on a few SVs.

Case-by-base study:

Balanced translocation:

Here, an event in shown where a ~ 3kb region of chr22 is translocated to another region. Ideally, there should be no coverage loss. The IGV inspection shows excess coverage on the left side and depletion on the right side. Upon inspecting the conjugate translocation site, a similar scenario is seen. This situation is hardly avoidable -- depending on the mappability of the two loci, one captures the chimeric fragment of the other with higher probability (right?). The situation is worse for CollectFragmentCounts because chimeras are ignored altogether:

Deletion:
For deletions, both coverage collection strategies work well. The deletion region is not quite captured perfectly by either method.

Tandem Duplication:
For tandem duplications, neglecting FF and RR fagments leads to an underestimation of the size of the duplicated region by CollectFragmentCounts. IGV does not seem to get it quite right either (@cwhelan does the IGV plot make sense to you? could it be there's a bug in SVGen in generating tandem duplications? )

Inversion:
IGV performs well, with very little coverage depletion at the boundaries. CollectFragmentCounts shows significant coverage depletion at the boundaries + random dropouts (why?).

Here's another example in a less mappable region (the IGV track should GMA Illumina mappability track):

Again, IGV does a much better job. In general, keeping only FR pairs seems to lead to noisy coverage, especially in low mappability regions.

Unbalanced Translocation:
A clear win for IGV, and a good reason for keeping split reads (notice the coverage loss on the left side of the event in CollectFragmentCounts).

An example is in a low complexity region:

No good strategy here -- it's better to blacklist such regions altogether for CNV calling.

Another win for IGV. I do not understand the reason for coverage dropouts in CollectFragmentCounts. It might have something to do with the SNPs (though, all reads have high MQ).

Summary and Conclusion: We should not filter out FF and RR pairs, and we'd rather consider them as individual reads. IGV's strategy (base coverage on individual reads, no filtering except for mismatching bases and clipped regions) conforms significantly better to the expected coverage compared to CollectFragmentCounts. We should consider reviving CollectBaseCallCoverage from GATK4 beta and evaluating it.

I suggest addressing this as a priority issue before we consider g/s-CNV for production use, or coverage collection on any large dataset.

The text was updated successfully, but these errors were encountered:

mbabadi · 2018-03-21T22:01:54Z

@LeeTL1220 @ldgauthier @cwhelan @mwalker174 @SHuang-Broad @yfarjoun might find this analysis of interest.

mwalker174 · 2018-03-21T22:46:57Z

Very interesting! @mbabadi What were your settings in IGV (under View->Prefs->Alignments) for MQ threshold, filter secondary/supplementary alignments? Could MQ maybe explain the random dropouts in the inversions cast?

samuelklee · 2018-03-21T22:57:11Z

Interesting! Thanks for generating these. I am already convinced by #4519 we should at least switch over to a ‘CollectReadCounts’ strategy for initial evaluations. A few comments:

-I’m guessing that the equal insert size and uniform sampling is enhancing many of these artifacts to a level that we probably don’t see in the real world. Can we take a look at some real-world examples?

-Same goes for the fact that homs will be unlikely.

-Not sure about the dropouts. Might be worth running without SNPs as a confounding factor.

-How flexible is SVGen? Might be worth putting together a more realistic simulated data set. Any chance @MartonKN might be able to use it to cook up some realistic tumor data?

-I don’t recall having a CollectBaseCallCoverage type tool in beta—which tool are you thinking of? On a related note, it seems there is some demand to port DepthOfCoverage from GATK3. However, I’d prefer that we roll a CNV-specific version of the tool even if it does get ported.

In any case, I think along with findings from the other issue, we should issue a quick PR for CollectReadCounts and go ahead to change the CollectCounts WDL task to call it—it’s for this very reason that the task is named generically! @sooheelee note that we may have to update the tutorials, etc. at some point, but perhaps the right time will be until all evaluations are more complete.

Speaking of which, this PR should not delay getting the first round of automated evaluations up and running. Again, the whole point of those is to have a reproducible baseline metric against which we can easily experiment with and adopt these sorts of changes. Although these sorts of theoretical/simulated/thought experiments are clearly useful to us, unfortunately, they may not be as compelling to some of our users as demonstrable improvement seems on real data!

mbabadi · 2018-03-21T23:16:47Z

@samuelklee Valentin wrote it:
https://github.com/broadinstitute/gatk-protected/blob/1.0.0.0-alpha1.2/src/main/java/org/broadinstitute/hellbender/tools/exome/CalculateTargetBaseCallCoverage.java
We dropped it once we got rid of the XHMM port.

While automated evaluations are definitely the way to go (eventually), there is real value in understanding the interaction between BWA and our coverage collection strategy in a controlled setting.

Regarding automatic evaluations -- let's talk in person when you get back.

samuelklee · 2018-03-21T23:17:03Z

Also, just to provide some context to all tagged: certain users of the old CNV pipeline expressed somewhat vague concerns with the non-fragment-based coverage collection strategies—which also differed across WES and WGS, to boot—-but didn’t offer any compelling demonstrations that fragment-based strategies were better.

For the new version of the pipelines, the main priority was to pick a single strategy to unify WES/WGS coverage collection. We decided to give a simple fragment-based strategy a shot—-with the intention of using automated evaluations to test it in a rigorous manner. Although those aren’t in place yet, I’m comfortable with making the call against it at this point.

samuelklee · 2018-03-21T23:29:42Z

Ah, that’s right, @mbabadi, thanks for reminding me. We can always quickly reimplement such a strategy as a FeatureWalker (preferably in a dev branch) and add it to the bake-off.

Tasks aimed towards getting the first round of automatic evaluations in place comprise the majority of our goals this quarter, so let’s get them done. Once they’re in place, we’ll be able to experiment more freely and confidently!

samuelklee · 2018-03-22T00:15:32Z

One more question: did you examine events shorter than the fragment size?

mbabadi · 2018-03-23T16:21:39Z

Update:

Here's how the coverage looks like using CollectReadCounts (w/ and w/o MQ > 30 filter) vs. CollectFragmentCounts. The lines are offset by +10 and +20 for better visibility.

Summary: marked improvement in all cases, however, the error modes are different. CollectFragmentCounts tends to underestimate the size of SV regions and uniformly leads to coverage depletion near the breakpoints, CollectReadCounts estimates the size of SV regions better, however, coverage near the breakpoints tend to be less predictable (sometimes depletion, sometimes accumulation). Still, IGV seems to do the best job.

Any improvement over CollectReadCounts requires using supplementary alignment information (e.g. weight sharing among supplementary alignments; this will likely fix the coverage asymmetry of translocation breakpoints), read clipping information, and mismatches. The latter two require a base-level coverage collection strategy (like IGV and CollectTargetBaseCallCoverage).

Unbalanced translocation:

Balanced translocation:

Inversion:

Deletion:

Tandem Duplication:

mbabadi · 2018-03-23T16:26:07Z

Also, MQ filtering results in stochastic coverage dropout. It is likely that low MQ regions significantly overlap across samples, in which case, downstream CNV can learn such biases and correct the coverage. Will test this in validations.

sooheelee · 2018-03-26T15:16:45Z

@sooheelee note that we may have to update the tutorials, etc. at some point, but perhaps the right time will be until all evaluations are more complete.

Thanks, @samuelklee for keeping me updated.

mbabadi added the Copy Number tools label Mar 21, 2018

mbabadi assigned asmirnov239, samuelklee and mbabadi Mar 21, 2018

samuelklee mentioned this issue Nov 15, 2018

CollectFragmentCount requires a new feature: different fragment assignment policies #3938

Closed

samuelklee changed the title ~~CollectFragmentCounts performs poorly on a synthetic SV callset~~ Continue investigation of count collection strategies. Jan 31, 2019

samuelklee mentioned this issue Jan 31, 2019

An insert- and bait- based coverage calculation tool for WES BAMs #2947

Closed

samuelklee unassigned mbabadi Feb 1, 2019

samuelklee mentioned this issue Feb 25, 2019

Improved memory requirements of CollectReadCounts. #5715

Merged

samuelklee removed their assignment Mar 31, 2020

mwalker174 closed this as completed Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continue investigation of count collection strategies. #4551

Continue investigation of count collection strategies. #4551

mbabadi commented Mar 21, 2018 •

edited

Loading

mbabadi commented Mar 21, 2018 •

edited

Loading

mwalker174 commented Mar 21, 2018

samuelklee commented Mar 21, 2018 •

edited

Loading

mbabadi commented Mar 21, 2018

samuelklee commented Mar 21, 2018

samuelklee commented Mar 21, 2018 •

edited

Loading

samuelklee commented Mar 22, 2018

mbabadi commented Mar 23, 2018

mbabadi commented Mar 23, 2018

sooheelee commented Mar 26, 2018

Continue investigation of count collection strategies. #4551

Continue investigation of count collection strategies. #4551

Comments

mbabadi commented Mar 21, 2018 • edited Loading

mbabadi commented Mar 21, 2018 • edited Loading

mwalker174 commented Mar 21, 2018

samuelklee commented Mar 21, 2018 • edited Loading

mbabadi commented Mar 21, 2018

samuelklee commented Mar 21, 2018

samuelklee commented Mar 21, 2018 • edited Loading

samuelklee commented Mar 22, 2018

mbabadi commented Mar 23, 2018

mbabadi commented Mar 23, 2018

sooheelee commented Mar 26, 2018

mbabadi commented Mar 21, 2018 •

edited

Loading

mbabadi commented Mar 21, 2018 •

edited

Loading

samuelklee commented Mar 21, 2018 •

edited

Loading

samuelklee commented Mar 21, 2018 •

edited

Loading