You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current coverage tool in the GATK CNV pipeline, i.e. CalculateTargetCoverage has important caveats that make the ab-initio modeling of coverage data and subsequent CNV analysis inaccurate.
We need to write a principled tool for calculating read depth histograms from targetted sequencing data. Given that a major source of coverage variance in targetted sequencing stems from the variance in bait efficiencies, the most reasonable read-depth calculation scheme is to associate inserts to baits rather than single reads to targets. No more arbitrary interval padding (which was a hacky way to get away not thinking about inserts).
There is a subtle problem, though: inserts often overlap with more than one bait. In such cases, we need to have a model for estimating the probability that the insert is captured by either of the overlapping baits. The modeling can be done in the following semi-empirical fashion (thanks @yfarjoun), which needs to be done only once for each capture technology (Agilent, ICE):
We locate isolated baits (i.e. those that are separated from one another by a few standard deviations of the average insert size)
We take a number of BAMs and calculate the empirical distribution of inserts around the isolated baits
We fit a simple parametric distribution to the obtained empirical distributions, parameterized by bait length and insert length; we probably don't need to go all-in here, though the reference context of the bait is also likely to be an important covariate.
Once these distributions are known, we can easily calculate the membership share of each bait in ambiguous cases and give each bait the appropriate share.
Bonus:
The empirical distribution of inserts around baits also allows us to associate a more reasonable GC content to each bait. Since GC bias is a property of the fragments that are pulled by the baits, a reasonable measure of "GC content" of each bait has to be calculated from the expected value of the GC content of the fragments that the bait pulls (not the GC content of the baits or targets), and this can be easily calculated from the previously obtained empirical distributions.
I agree with the idea. the devil, as always is in the details.
I think that the model you are interested in is parametrized with
insert-length and position (of bait) in insert, not bait-length and insert
length. Bait length is going to be a constant, or almost so, I suspect
(please check)
I suspect that of these two covariates, position from the (nearest) end
will be the most informative and will probably decay to a constant with a
decay parameter of the order of the bait length.
the effect of GC bias will effect the overall efficacy of the bait since
it will effect the entire insert when amplified, and so it will introduce
noise that is orthogonal to the question you are after. I would design the
measurement to be as unaffected by this noise as possible. your later model
will find this constant of course.
happy to talk about this more!
Yossi.
On Fri, Feb 17, 2017 at 6:57 PM, Mehrtash Babadi [email protected]
wrote:
The current coverage tool in the GATK CNV pipeline, i.e.
CalculateTargetCoverage has important caveats that render the ab-initio
modeling of coverage and subsequent CNV analysis inaccurate. The purpose of
this issue is to write a new tool for dealing with targetted sequencing
data. Since a major source of coverage variance in targetted sequencing is
different capture efficiencies, the most reasonable read-depth calculation
scheme is to associate inserts to baits rather than single reads to
targets.
There is a subtle problem, though: inserts often overlap with more than
one bait. In such cases, we need to have a model for estimating the
probability that the insert is captured by either of the overlapping baits.
The modeling can be done in the following semi-empirical fashion (thanks @yfarjounhttps://github.com/yfarjoun), which needs to be done only
once each capture technology (Agilent, ICE):
We locate isolated baits (i.e. those that are separated from one
another by a few standard deviations of the average insert size)
We take a number of BAMs and calculate the empirical distribution of
inserts around the isolated baits
We build a simple parametric model for the obtained empirical
distributions, parametrized by bait length and insert length; we probably
don't need to go all-in here, though the reference context of the bait is
also likely to be an important covariate.
Once these distributions are known, we can easily calculate the membership
share of each bait in ambiguous cases and give each bait the appropriate
share.
Bonus:
The empirical distribution of inserts around baits also allows us to
associate a more reasonable GC content to each bait. Since GC bias is a
property of the fragments that are pulled by the baits, a reasonable
measure of "GC content" of each bait has to be calculated from the expected
value of the GC content of the fragments that the bait pulls (not the GC
content of the baits or targets), and this can be easily calculated from
the previously obtained empirical distributions.
@yfarjoun, thanks for your comments. On the first two points, I agree. Let me clarify: I was going to use the bait-length and insert-length as hyperparameters of the pdf, where the pdf itself gives the probability of having an insert in a certain configuration relative to the bait. I think the parametrization you proposed, i.e. the distance between nearest ends of insert and bait, is very reasonable since the PDF is going to be reflection-symmetric once averaged over all baits; and you're right, the bait length is constant (77bp for ICE) so we can drop it from the analysis.
If the fragment capture efficiency is insensitive to the relative position of the bait sequence in the fragment, we expect the pdf to be approximately uniform (save for boundary effects at the scale of bait length), with the 0.5 x (insert length - bait length) setting the upper bound of the distribution. However, some dependency on the position of the bait is expected: e.g. if the bait sequence is on the dangling end of a fragment, it is less likely to stay bound than if it is in the middle of the fragment. I'm curious to see what comes out (who knows -- maybe another 6-bp periodicity!)
Regarding GC bias -- I agree with what you said; I guess I had a different point, though. To find the GC curve, one needs to regress the coverage depth wrt. the GC content in some way. Now, all we have is a collection of baits, along with the number of inserts pulled by each bait (let's call it the bait pileup). So, we have to designate a measure of GC content to each bait pileup in order to regress the pileup size against it. A reasonable choice is to use the expected value of the GC content of the pulled fragments for each bait:
effective GC of a bait = \sum_{all inserts} GC(insert) x Pr(insert) x Pr(insert is pulled by bait)
Alternatively, we can calculate the average GC content of each bait pileup, for each sample, as we calculate the bait pileups. This is probably too expensive and unnecessary (there shouldn't be much sample-to-sample variation anyway, right?).
The text was updated successfully, but these errors were encountered:
@mbabadi commented on Fri Feb 17 2017
The current coverage tool in the GATK CNV pipeline, i.e.
CalculateTargetCoverage
has important caveats that make the ab-initio modeling of coverage data and subsequent CNV analysis inaccurate.We need to write a principled tool for calculating read depth histograms from targetted sequencing data. Given that a major source of coverage variance in targetted sequencing stems from the variance in bait efficiencies, the most reasonable read-depth calculation scheme is to associate inserts to baits rather than single reads to targets. No more arbitrary interval padding (which was a hacky way to get away not thinking about inserts).
There is a subtle problem, though: inserts often overlap with more than one bait. In such cases, we need to have a model for estimating the probability that the insert is captured by either of the overlapping baits. The modeling can be done in the following semi-empirical fashion (thanks @yfarjoun), which needs to be done only once for each capture technology (Agilent, ICE):
Once these distributions are known, we can easily calculate the membership share of each bait in ambiguous cases and give each bait the appropriate share.
Bonus:
The empirical distribution of inserts around baits also allows us to associate a more reasonable GC content to each bait. Since GC bias is a property of the fragments that are pulled by the baits, a reasonable measure of "GC content" of each bait has to be calculated from the expected value of the GC content of the fragments that the bait pulls (not the GC content of the baits or targets), and this can be easily calculated from the previously obtained empirical distributions.
@mbabadi commented on Fri Feb 17 2017
@yfarjoun @samuelklee @davidbenjamin agree?
@yfarjoun commented on Fri Feb 17 2017
I agree with the idea. the devil, as always is in the details.
insert-length and position (of bait) in insert, not bait-length and insert
length. Bait length is going to be a constant, or almost so, I suspect
(please check)
will be the most informative and will probably decay to a constant with a
decay parameter of the order of the bait length.
it will effect the entire insert when amplified, and so it will introduce
noise that is orthogonal to the question you are after. I would design the
measurement to be as unaffected by this noise as possible. your later model
will find this constant of course.
happy to talk about this more!
Yossi.
On Fri, Feb 17, 2017 at 6:57 PM, Mehrtash Babadi [email protected]
wrote:
@mbabadi commented on Sat Feb 18 2017
@yfarjoun, thanks for your comments. On the first two points, I agree. Let me clarify: I was going to use the bait-length and insert-length as hyperparameters of the pdf, where the pdf itself gives the probability of having an insert in a certain configuration relative to the bait. I think the parametrization you proposed, i.e. the distance between nearest ends of insert and bait, is very reasonable since the PDF is going to be reflection-symmetric once averaged over all baits; and you're right, the bait length is constant (77bp for ICE) so we can drop it from the analysis.
If the fragment capture efficiency is insensitive to the relative position of the bait sequence in the fragment, we expect the pdf to be approximately uniform (save for boundary effects at the scale of bait length), with the 0.5 x (insert length - bait length) setting the upper bound of the distribution. However, some dependency on the position of the bait is expected: e.g. if the bait sequence is on the dangling end of a fragment, it is less likely to stay bound than if it is in the middle of the fragment. I'm curious to see what comes out (who knows -- maybe another 6-bp periodicity!)
Regarding GC bias -- I agree with what you said; I guess I had a different point, though. To find the GC curve, one needs to regress the coverage depth wrt. the GC content in some way. Now, all we have is a collection of baits, along with the number of inserts pulled by each bait (let's call it the bait pileup). So, we have to designate a measure of GC content to each bait pileup in order to regress the pileup size against it. A reasonable choice is to use the expected value of the GC content of the pulled fragments for each bait:
effective GC of a bait = \sum_{all inserts} GC(insert) x Pr(insert) x Pr(insert is pulled by bait)
Alternatively, we can calculate the average GC content of each bait pileup, for each sample, as we calculate the bait pileups. This is probably too expensive and unnecessary (there shouldn't be much sample-to-sample variation anyway, right?).
The text was updated successfully, but these errors were encountered: