-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read filtering based on MQ is not a substitute for "mappability" filtering #4558
Comments
@samuelklee mappability filtering is highly likely to get rid of a lot of the deletion calls in the |
A few comments to add on to what we discussed in person: I think the coverage distribution is indeed the correct summary statistic to model for this problem. Total coverage just doesn't provide enough information, but subsampling bins or fitting a per-bin bias model is overkill. However, I think a straightforward, self-contained modeling or masking approach (which need not rely on a mappability track) within the ploidy-determination tool is still quite feasible. I think that if we can easily solve the problem without requiring a mappability track then we should try to do it, as that is a relatively expensive resource to create. For example, some very naive hard filtering (red) of the histogram yields a peak that is easily fit by a negative binomial (green)---even a Poisson fit does not appear to bias the depth estimates, and certainly does not result in incorrect ploidy estimates: (Incidentally, it is helpful to plot on a log scale when checking the fit of these distributions.) This strategy also gives us a way to ignore low-level mosaicism or large germline events, which filtering on mappability may not address: So let's try to encapsulate changes to the ploidy tool. I agree that the histogram creation can be easily done on the Java side, to save on intermediate file writing. We can probably just cap the maximum bin to I agree that there is still a lot of important work to be done in exploring our best practices for coverage collection, and I know that you have been interested in improving them for a while. Ultimately, we may want to consider incorporating mappability or other informative metadata, as we've discussed. However, this will require some non-trivial investment in method/tool development time. Since our preliminary evaluations show that even with the current, naive strategies the tool is performing reasonably well, I am prioritizing cutting a release and improving/automating the evaluations. As we discussed, this will both allow users to start using the tool (which will hopefully result in useful feedback) and establish a baseline for us. This will ultimately provide the necessary foundation for future exploratory work and method development---which always takes more time than we think it will! |
Also, I think that being able to express joint contig-contig ploidy priors will help immensely. That is, rather than giving X a prior that evenly weights CN = 1 and CN = 2 and Y a separate prior that evenly weights CN = 0 and CN = 1, we'd want to be able to specify that the combinations XX and XY should be evenly weighted (otherwise we are equally likely to get X and XXY). |
Just to clarify -- your proposed approach is certainly useful for filtering large germline events, etc, and must be part of the solution (for more robustness). What I'm saying here is that some elementary filtering gives us great mileage, both in terms of our ease of mind in determining the hard-filtering region in the ploidy tool, and for reducing false calls. |
Thanks for kicking those off! A few more comments: Just to summarize so far---I am arguing that we probably don't need mappability filtering for ploidy calling, as naive bin filtering suffices to clean up the relevant histograms. And, as you showed, we can hopefully rely on per-bin bias modeling to at least partially account for mappability in gCNV calling (and we certainly wouldn't want to filter out a significant fraction of the genome, in any case). Do we agree? To answer your first question, the criterion for choosing the peak is quite hacky at the moment, but I found that filtering low-count bins to first check for the presence of a high peak and then falling back to the peak at zero works perfectly fine in practice. We can certainly try to do something smarter, since, as you say, bin filtering may be desirable---even if we implement mappability filtering---to remove large germline events (it's true that the "example" I showed above is indeed from the PAR-like region on X, as you point out, but this is roughly how a large arm-level event would appear even after mappability filtering.) Although the model I fit above, which is simply a sparse mixture of NBs with regularly-spaced means (modulo some sample-specific and contig-specific jitter), could conceivably capture such events as well, we want to avoid models where a single NB might try to capture two or more peaks. Also, just to clarify, the weird mosaic examples are the bottom two plots out of the four above---you can see the shifted (non-X, in one of the examples) single peaks. However, it's interesting that the PARs are still showing up in XY---I'm pretty sure I used the blacklist you provided, although I will double check. Did that only include the "official" PARs, or also the additional ones you found? In any case, are we comfortable calling in those regions (here I'm talking about gCNV, not ploidy)? As I show above, I don't think we need mappability to nail the baseline ploidy. Can we then rely on the per-bin bias to account for these regions in gCNV (pinning them back to the correct CN) without mappability filtering? And with mappability filtering, how substantial is the hit to coverage in these regions? Should we blacklist them for the time being? To summarize, I think the order of events I'd like to see is this:
Again, getting a initial Beta release and some reasonable parameters to users is a high priority, so thanks for kicking off the evaluations, and thanks for your willingness to discuss our options. Let me know if you agree with the rest of the plan! |
As a step toward productionizing gCNV, I explore the idea of aggressive filtering of reads (or targets, or bins) based on known annotations, such as genome mappability, complexity, SINEs, LINEs, repeats, etc.
This post is about mappability filtering, and in particular, for WGS samples. Let us consider coverage data collected by
CollectFragmentCounts
(default tool args: well-formed paired, FR orientation, MQ > 30) for a several WGS samples (hg38, BWA-MEM). Here's how the coverage distribution looks like for a random sample:Notice the main peak at ~ 90 fragments/bin, the diffuse counts below 90, and the peak at 0.
To proceed, I calculated the average mappability score of each bin using the recently published multi-read Umap track. It is not an ideal mappability scoring scheme (since it does not consider Illumina-like errors and paired-end reads), but it is the best I could find for hg38. I filtered out all bins that had a mappability score short of 100% (very aggressive). Here's how the filtered coverage distribution looks like:
Evidently, the diffusive low coverage part of the distribution has completely disappeared, along with the peak at 0.
Here's a plot of the number of retained bins after filtering with various mean mappability cutoffs:
Except for chrY, the most aggressive filtering only gets rid of 20% of the genome.
Let us fit different distributions to the filtered coverage data:
As we expect, negative binomial is almost a perfect fit! this is very reassuring, because gCNV model is based on negative binomial (based on purely theoretical reasoning).
Now, let us look at unfiltered coverage and regress coverage with mappability. Here's how it looks like for 3 different contigs and 5 different samples:
Clearly, there is a strong correlation here. In particular, low mappability bins show two distinct modes: a mode at the coverage mode (~ 100 fragments/bin), and a mode that bifurcates to lower values.
I conjecture that the bimodality results from heterogeneity of mappability scores at different positions in the same bin. The bins are 1k wide and it is feasible that some positions are highly mappable and other positions are not. This conjecture can be tested by collecting coverage on smaller bins and to check whether the bimodality weakens. If it does, I suggest filtering based on read position, similar to Genome STRiP, as opposed to filtering bins.
Also, there is little sample-to-sample variation in coverage-mappability scatter plots (as opposed to, let's say, GC). Therefore, there is no reason to consider mappability as a bias covariate. The mappability coverage bias can be captured by a cohort-wide mean bias.
Finally, let us study the NB overdispersion of different samples for different contigs:
There's a clear structure here: some samples have higher overdispersion than the others. This could be due to degraded samples, less even GC curve, different chemistry, etc. In any event, we can regress the residual variance$psi_sj$ (for sample s, contig j) with a linear model:
psi_sj ~ N(a_s * psi_j + b_s, \beta)
Here's how the regression looks like:
Pretty much everything is explained by the linear model. This provides support for our choice of linear-NB model in gCNV.
Finally, let us examine whether there is a correlation between$a_s$ , $b_s$ , and depth of coverage:
There is absolutely no correlation, which is again expected.
The text was updated successfully, but these errors were encountered: