Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An insert- and bait- based coverage calculation tool for WES BAMs #2947

Closed
droazen opened this issue Jun 5, 2017 · 1 comment
Closed

An insert- and bait- based coverage calculation tool for WES BAMs #2947

droazen opened this issue Jun 5, 2017 · 1 comment

Comments

@droazen
Copy link
Contributor

droazen commented Jun 5, 2017

@mbabadi commented on Fri Feb 17 2017

The current coverage tool in the GATK CNV pipeline, i.e. CalculateTargetCoverage has important caveats that make the ab-initio modeling of coverage data and subsequent CNV analysis inaccurate.

We need to write a principled tool for calculating read depth histograms from targetted sequencing data. Given that a major source of coverage variance in targetted sequencing stems from the variance in bait efficiencies, the most reasonable read-depth calculation scheme is to associate inserts to baits rather than single reads to targets. No more arbitrary interval padding (which was a hacky way to get away not thinking about inserts).

There is a subtle problem, though: inserts often overlap with more than one bait. In such cases, we need to have a model for estimating the probability that the insert is captured by either of the overlapping baits. The modeling can be done in the following semi-empirical fashion (thanks @yfarjoun), which needs to be done only once for each capture technology (Agilent, ICE):

  • We locate isolated baits (i.e. those that are separated from one another by a few standard deviations of the average insert size)
  • We take a number of BAMs and calculate the empirical distribution of inserts around the isolated baits
  • We fit a simple parametric distribution to the obtained empirical distributions, parameterized by bait length and insert length; we probably don't need to go all-in here, though the reference context of the bait is also likely to be an important covariate.

Once these distributions are known, we can easily calculate the membership share of each bait in ambiguous cases and give each bait the appropriate share.

Bonus:

The empirical distribution of inserts around baits also allows us to associate a more reasonable GC content to each bait. Since GC bias is a property of the fragments that are pulled by the baits, a reasonable measure of "GC content" of each bait has to be calculated from the expected value of the GC content of the fragments that the bait pulls (not the GC content of the baits or targets), and this can be easily calculated from the previously obtained empirical distributions.


@mbabadi commented on Fri Feb 17 2017

@yfarjoun @samuelklee @davidbenjamin agree?


@yfarjoun commented on Fri Feb 17 2017

I agree with the idea. the devil, as always is in the details.

  • I think that the model you are interested in is parametrized with
    insert-length and position (of bait) in insert, not bait-length and insert
    length. Bait length is going to be a constant, or almost so, I suspect
    (please check)
  • I suspect that of these two covariates, position from the (nearest) end
    will be the most informative and will probably decay to a constant with a
    decay parameter of the order of the bait length.
  • the effect of GC bias will effect the overall efficacy of the bait since
    it will effect the entire insert when amplified, and so it will introduce
    noise that is orthogonal to the question you are after. I would design the
    measurement to be as unaffected by this noise as possible. your later model
    will find this constant of course.

happy to talk about this more!

Yossi.

On Fri, Feb 17, 2017 at 6:57 PM, Mehrtash Babadi [email protected]
wrote:

The current coverage tool in the GATK CNV pipeline, i.e.
CalculateTargetCoverage has important caveats that render the ab-initio
modeling of coverage and subsequent CNV analysis inaccurate. The purpose of
this issue is to write a new tool for dealing with targetted sequencing
data. Since a major source of coverage variance in targetted sequencing is
different capture efficiencies, the most reasonable read-depth calculation
scheme is to associate inserts to baits rather than single reads to
targets
.

There is a subtle problem, though: inserts often overlap with more than
one bait. In such cases, we need to have a model for estimating the
probability that the insert is captured by either of the overlapping baits.
The modeling can be done in the following semi-empirical fashion (thanks
@yfarjoun https://github.com/yfarjoun), which needs to be done only
once each capture technology (Agilent, ICE):

  • We locate isolated baits (i.e. those that are separated from one
    another by a few standard deviations of the average insert size)
  • We take a number of BAMs and calculate the empirical distribution of
    inserts around the isolated baits
  • We build a simple parametric model for the obtained empirical
    distributions, parametrized by bait length and insert length; we probably
    don't need to go all-in here, though the reference context of the bait is
    also likely to be an important covariate.

Once these distributions are known, we can easily calculate the membership
share of each bait in ambiguous cases and give each bait the appropriate
share.
Bonus:

The empirical distribution of inserts around baits also allows us to
associate a more reasonable GC content to each bait. Since GC bias is a
property of the fragments that are pulled by the baits, a reasonable
measure of "GC content" of each bait has to be calculated from the expected
value of the GC content of the fragments that the bait pulls (not the GC
content of the baits or targets), and this can be easily calculated from
the previously obtained empirical distributions.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
broadinstitute/gatk-protected#914, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACnk0qEpyk5wss6qvl653UQo-BAiQWfIks5rdjPNgaJpZM4ME4kq
.


@mbabadi commented on Sat Feb 18 2017

@yfarjoun, thanks for your comments. On the first two points, I agree. Let me clarify: I was going to use the bait-length and insert-length as hyperparameters of the pdf, where the pdf itself gives the probability of having an insert in a certain configuration relative to the bait. I think the parametrization you proposed, i.e. the distance between nearest ends of insert and bait, is very reasonable since the PDF is going to be reflection-symmetric once averaged over all baits; and you're right, the bait length is constant (77bp for ICE) so we can drop it from the analysis.

If the fragment capture efficiency is insensitive to the relative position of the bait sequence in the fragment, we expect the pdf to be approximately uniform (save for boundary effects at the scale of bait length), with the 0.5 x (insert length - bait length) setting the upper bound of the distribution. However, some dependency on the position of the bait is expected: e.g. if the bait sequence is on the dangling end of a fragment, it is less likely to stay bound than if it is in the middle of the fragment. I'm curious to see what comes out (who knows -- maybe another 6-bp periodicity!)

Regarding GC bias -- I agree with what you said; I guess I had a different point, though. To find the GC curve, one needs to regress the coverage depth wrt. the GC content in some way. Now, all we have is a collection of baits, along with the number of inserts pulled by each bait (let's call it the bait pileup). So, we have to designate a measure of GC content to each bait pileup in order to regress the pileup size against it. A reasonable choice is to use the expected value of the GC content of the pulled fragments for each bait:

effective GC of a bait = \sum_{all inserts} GC(insert) x Pr(insert) x Pr(insert is pulled by bait)

Alternatively, we can calculate the average GC content of each bait pileup, for each sample, as we calculate the bait pileups. This is probably too expensive and unnecessary (there shouldn't be much sample-to-sample variation anyway, right?).

@samuelklee
Copy link
Contributor

Let's consider this a dupe of #4551.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants