Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continue investigation of count collection strategies. #4551

Closed
mbabadi opened this issue Mar 21, 2018 · 10 comments
Closed

Continue investigation of count collection strategies. #4551

mbabadi opened this issue Mar 21, 2018 · 10 comments
Assignees

Comments

@mbabadi
Copy link
Contributor

mbabadi commented Mar 21, 2018

Pursuing issue #4519, I have tried to evaluate the performance of CollectFragmentCounts (the current coverage collection tool for somatic and germline CNV caller) on synthetic dataset.

Methodology: SVGen was used to generate a set of random canonical SVs (deletion, tandem duplication, inversion, balanced translocation, unbalanced translocation) on chr22 of hg19 + random SNPs at observed population frequencies. The SVs were applied to the reference to generate the SV genome. Paired-end reads w/ equal lengths (100bp) and insert sizes (500bp) were uniformly generated from the SV genome and was mapped to chr22 using BWA-MEM (default arguments). Coverage on 100bp uniform bins were collected using CollectFragmentCounts (default arguments: MQ > 30, both mates aligned, and only innies). The coverage was studied case-by-case on a few SVs.

Case-by-base study:

Balanced translocation:
baltr-1-igv

Here, an event in shown where a ~ 3kb region of chr22 is translocated to another region. Ideally, there should be no coverage loss. The IGV inspection shows excess coverage on the left side and depletion on the right side. Upon inspecting the conjugate translocation site, a similar scenario is seen. This situation is hardly avoidable -- depending on the mappability of the two loci, one captures the chimeric fragment of the other with higher probability (right?). The situation is worse for CollectFragmentCounts because chimeras are ignored altogether:

baltr-1

Deletion:
For deletions, both coverage collection strategies work well. The deletion region is not quite captured perfectly by either method.

del-1-igv

del-1

Tandem Duplication:
For tandem duplications, neglecting FF and RR fagments leads to an underestimation of the size of the duplicated region by CollectFragmentCounts. IGV does not seem to get it quite right either (@cwhelan does the IGV plot make sense to you? could it be there's a bug in SVGen in generating tandem duplications? )

dup-1-igv

dup-1

Inversion:
IGV performs well, with very little coverage depletion at the boundaries. CollectFragmentCounts shows significant coverage depletion at the boundaries + random dropouts (why?).

inv-1-igv

inv-1

Here's another example in a less mappable region (the IGV track should GMA Illumina mappability track):

inv-2-igv

inv-2

Again, IGV does a much better job. In general, keeping only FR pairs seems to lead to noisy coverage, especially in low mappability regions.

Unbalanced Translocation:
A clear win for IGV, and a good reason for keeping split reads (notice the coverage loss on the left side of the event in CollectFragmentCounts).

unbtr-1-igv

unbtr-1

An example is in a low complexity region:

unbtr-2-igv

unbtr-2

No good strategy here -- it's better to blacklist such regions altogether for CNV calling.

Another win for IGV. I do not understand the reason for coverage dropouts in CollectFragmentCounts. It might have something to do with the SNPs (though, all reads have high MQ).

unbtr-3-igv

unbtr-3

Summary and Conclusion: We should not filter out FF and RR pairs, and we'd rather consider them as individual reads. IGV's strategy (base coverage on individual reads, no filtering except for mismatching bases and clipped regions) conforms significantly better to the expected coverage compared to CollectFragmentCounts. We should consider reviving CollectBaseCallCoverage from GATK4 beta and evaluating it.

I suggest addressing this as a priority issue before we consider g/s-CNV for production use, or coverage collection on any large dataset.

@mbabadi
Copy link
Contributor Author

mbabadi commented Mar 21, 2018

@LeeTL1220 @ldgauthier @cwhelan @mwalker174 @SHuang-Broad @yfarjoun might find this analysis of interest.

@mwalker174
Copy link
Contributor

Very interesting! @mbabadi What were your settings in IGV (under View->Prefs->Alignments) for MQ threshold, filter secondary/supplementary alignments? Could MQ maybe explain the random dropouts in the inversions cast?

@samuelklee
Copy link
Contributor

samuelklee commented Mar 21, 2018

Interesting! Thanks for generating these. I am already convinced by #4519 we should at least switch over to a ‘CollectReadCounts’ strategy for initial evaluations. A few comments:

-I’m guessing that the equal insert size and uniform sampling is enhancing many of these artifacts to a level that we probably don’t see in the real world. Can we take a look at some real-world examples?

-Same goes for the fact that homs will be unlikely.

-Not sure about the dropouts. Might be worth running without SNPs as a confounding factor.

-How flexible is SVGen? Might be worth putting together a more realistic simulated data set. Any chance @MartonKN might be able to use it to cook up some realistic tumor data?

-I don’t recall having a CollectBaseCallCoverage type tool in beta—which tool are you thinking of? On a related note, it seems there is some demand to port DepthOfCoverage from GATK3. However, I’d prefer that we roll a CNV-specific version of the tool even if it does get ported.

In any case, I think along with findings from the other issue, we should issue a quick PR for CollectReadCounts and go ahead to change the CollectCounts WDL task to call it—it’s for this very reason that the task is named generically! @sooheelee note that we may have to update the tutorials, etc. at some point, but perhaps the right time will be until all evaluations are more complete.

Speaking of which, this PR should not delay getting the first round of automated evaluations up and running. Again, the whole point of those is to have a reproducible baseline metric against which we can easily experiment with and adopt these sorts of changes. Although these sorts of theoretical/simulated/thought experiments are clearly useful to us, unfortunately, they may not be as compelling to some of our users as demonstrable improvement seems on real data!

@mbabadi
Copy link
Contributor Author

mbabadi commented Mar 21, 2018

@samuelklee Valentin wrote it:
https://github.com/broadinstitute/gatk-protected/blob/1.0.0.0-alpha1.2/src/main/java/org/broadinstitute/hellbender/tools/exome/CalculateTargetBaseCallCoverage.java
We dropped it once we got rid of the XHMM port.

While automated evaluations are definitely the way to go (eventually), there is real value in understanding the interaction between BWA and our coverage collection strategy in a controlled setting.

Regarding automatic evaluations -- let's talk in person when you get back.

@samuelklee
Copy link
Contributor

Also, just to provide some context to all tagged: certain users of the old CNV pipeline expressed somewhat vague concerns with the non-fragment-based coverage collection strategies—which also differed across WES and WGS, to boot—-but didn’t offer any compelling demonstrations that fragment-based strategies were better.

For the new version of the pipelines, the main priority was to pick a single strategy to unify WES/WGS coverage collection. We decided to give a simple fragment-based strategy a shot—-with the intention of using automated evaluations to test it in a rigorous manner. Although those aren’t in place yet, I’m comfortable with making the call against it at this point.

@samuelklee
Copy link
Contributor

samuelklee commented Mar 21, 2018

Ah, that’s right, @mbabadi, thanks for reminding me. We can always quickly reimplement such a strategy as a FeatureWalker (preferably in a dev branch) and add it to the bake-off.

Tasks aimed towards getting the first round of automatic evaluations in place comprise the majority of our goals this quarter, so let’s get them done. Once they’re in place, we’ll be able to experiment more freely and confidently!

@samuelklee
Copy link
Contributor

One more question: did you examine events shorter than the fragment size?

@mbabadi
Copy link
Contributor Author

mbabadi commented Mar 23, 2018

Update:

Here's how the coverage looks like using CollectReadCounts (w/ and w/o MQ > 30 filter) vs. CollectFragmentCounts. The lines are offset by +10 and +20 for better visibility.

Summary: marked improvement in all cases, however, the error modes are different. CollectFragmentCounts tends to underestimate the size of SV regions and uniformly leads to coverage depletion near the breakpoints, CollectReadCounts estimates the size of SV regions better, however, coverage near the breakpoints tend to be less predictable (sometimes depletion, sometimes accumulation). Still, IGV seems to do the best job.

Any improvement over CollectReadCounts requires using supplementary alignment information (e.g. weight sharing among supplementary alignments; this will likely fix the coverage asymmetry of translocation breakpoints), read clipping information, and mismatches. The latter two require a base-level coverage collection strategy (like IGV and CollectTargetBaseCallCoverage).

Unbalanced translocation:

unbtr-1

unbtr-2

unbtr-3

Balanced translocation:

baltr-1

Inversion:

inv-1

inv-2

Deletion:

del-1

Tandem Duplication:

dup-1

@mbabadi
Copy link
Contributor Author

mbabadi commented Mar 23, 2018

Also, MQ filtering results in stochastic coverage dropout. It is likely that low MQ regions significantly overlap across samples, in which case, downstream CNV can learn such biases and correct the coverage. Will test this in validations.

@sooheelee
Copy link
Contributor

@sooheelee note that we may have to update the tutorials, etc. at some point, but perhaps the right time will be until all evaluations are more complete.

Thanks, @samuelklee for keeping me updated.

@samuelklee samuelklee changed the title CollectFragmentCounts performs poorly on a synthetic SV callset Continue investigation of count collection strategies. Jan 31, 2019
@samuelklee samuelklee removed their assignment Mar 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants