-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added gCNV PROBPROG 2018 extended abstract, archived notes on CNV methods, and deleted some legacy documentation. #5732
Conversation
@vdauwera note that this modifies the path of the CNV methods doc (which is mostly out of date, but is still linked to in some Comms materials) from docs/CNVs/CNV-methods.pdf to docs/CNV/archived/archived-CNV-methods.pdf. The extended abstract on gCNV is a very technical description of the probabilistic model, but it might be worth referencing it for interested users until a preprint or publication is available. |
Codecov Report
@@ Coverage Diff @@
## master #5732 +/- ##
================================================
- Coverage 86.985% 37.377% -49.607%
+ Complexity 31863 17994 -13869
================================================
Files 1943 1975 +32
Lines 146768 147415 +647
Branches 16223 16225 +2
================================================
- Hits 127666 55100 -72566
- Misses 13189 87388 +74199
+ Partials 5913 4927 -986
|
…hods, and deleted some legacy documentation.
33824ef
to
4ad1d58
Compare
@mwalker174 can you assign someone to review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to see up to date documentation on gCNV going in. How are users supposed to find the updated pdf? Could we add a link to it in the GermlineCNVCaller doc?
If this is going to be a primary reference there needs to be some way to translate between the HHMM parameters and actual tool arguments. IMO, the current tool argument documentation is too vague (most users probably wonder what the heck coherence length is). I think it would be immensely helpful to give the corresponding HHMM parameters from the pdf. @vdauwera Are greek characters / subscripts permitted in tool argument documentation?
I have a few minor suggestions mostly just to improve clarity. The intro could use some citations in a few places if you have them.
@@ -86,7 +85,7 @@ | |||
|
|||
\begin{document} | |||
|
|||
\title{Notes on CNV Methods} | |||
\title{(ARCHIVED) Notes on CNV Methods} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a DEPRECATED or ARCHIVED watermark to each page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added a header to each page---a little less distracting than a watermark.
\footnotetext{A slightly edited version of this extended abtract was accepted for poster presentation at \textit{PROBPROG 2018, October 04--06, 2018, Cambridge, MA, USA}. (Last edited \today.)} | ||
|
||
\section{Introduction} | ||
Copy-number variation (CNV) is a class of genomic {\em structural variation} where the {\em integer copy-number state} of a large genomic region (typically $>$1000 base pairs, spanning several consecutive exons or genes) is altered with respect to a reference genome. Inferring these copy-number states is an important problem in computational genomics, both for research applications and clinical practice~\cite{zhang2009copy}. Inferring integer CNV states from NGS read-depth data begins with aligning short reads (sequences of $\sim$100 base pairs from a sample genome) to the reference genome and counting the number of reads that align to each region. Calling CNV events from these counts is a challenging problem due to strong systematic biases, which can arise from batch effects in sample preparation and sequencing library preparation protocols, variation in sequencing efficiency across genomic regions, and numerous other hidden processes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling CNV events from these counts is a challenging problem due to strong systematic biases, which can arise from batch effects in sample preparation and sequencing library preparation protocols, variation in sequencing efficiency across genomic regions, and numerous other hidden processes.
Citation needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many benchmarking and review papers. We can cite a number of them, e.g.:
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-S11-S1
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5569469/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these. I think I'm OK with fleshing out the intro with citations of reviews, etc. in the full preprint.
\section{Introduction} | ||
Copy-number variation (CNV) is a class of genomic {\em structural variation} where the {\em integer copy-number state} of a large genomic region (typically $>$1000 base pairs, spanning several consecutive exons or genes) is altered with respect to a reference genome. Inferring these copy-number states is an important problem in computational genomics, both for research applications and clinical practice~\cite{zhang2009copy}. Inferring integer CNV states from NGS read-depth data begins with aligning short reads (sequences of $\sim$100 base pairs from a sample genome) to the reference genome and counting the number of reads that align to each region. Calling CNV events from these counts is a challenging problem due to strong systematic biases, which can arise from batch effects in sample preparation and sequencing library preparation protocols, variation in sequencing efficiency across genomic regions, and numerous other hidden processes. | ||
|
||
Many previous methods for CNV detection from read-depth data attempt to remove systematic biases via PCA denoising \cite{fromer_discovery_2012} or regression~\cite{jiang_codex:_2015, klambauer_cn.mops:_2012}, or try to obviate the issue by pre-clustering samples and genomic regions~\cite{handsaker_large_2015, packer_clamms:_2016}. CNVs are subsequently detected using hidden Markov models (HMM) or non-parametric change-point detection algorithms~\cite{olshen_circular_2004}. Crucially, these methods suffer from a lack of self-consistency between data normalization and event detection, which results in inadvertent removal of signal in the former and decreased sensitivity in the latter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crucially, these methods suffer from a lack of self-consistency between data normalization and event detection, which results in inadvertent removal of signal in the former and decreased sensitivity in the latter.
Citation or explanation needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue has been discussed in CODEX paper in relation to common variants and sensitivity:
https://www.ncbi.nlm.nih.gov/pubmed/25618849
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a parenthetical to explicitly say this is discussed in that reference (just adding a citation could be misconstrued as pointing out that particular tool as lacking in this regard).
|
||
Many previous methods for CNV detection from read-depth data attempt to remove systematic biases via PCA denoising \cite{fromer_discovery_2012} or regression~\cite{jiang_codex:_2015, klambauer_cn.mops:_2012}, or try to obviate the issue by pre-clustering samples and genomic regions~\cite{handsaker_large_2015, packer_clamms:_2016}. CNVs are subsequently detected using hidden Markov models (HMM) or non-parametric change-point detection algorithms~\cite{olshen_circular_2004}. Crucially, these methods suffer from a lack of self-consistency between data normalization and event detection, which results in inadvertent removal of signal in the former and decreased sensitivity in the latter. | ||
|
||
Here, we propose and implement GATK gCNV, a principled Bayesian approach for learning global and sample-specific biases of read-depth data from large cohorts while {\em simultaneously} inferring copy-number states. Our model combines a negative-binomial factor analysis module for learning batch effects with a hierarchical HMM (HHMM) for detecting both sample-specific CNV events and global regions of high and low CNV activity. Self-consistency between bias modeling and CNV calling greatly improves the performance of our algorithm with respect to existing methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we propose and implement
-> Here we present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Here, we propose and implement GATK gCNV, a principled Bayesian approach for learning global and sample-specific biases of read-depth data from large cohorts while {\em simultaneously} inferring copy-number states. Our model combines a negative-binomial factor analysis module for learning batch effects with a hierarchical HMM (HHMM) for detecting both sample-specific CNV events and global regions of high and low CNV activity. Self-consistency between bias modeling and CNV calling greatly improves the performance of our algorithm with respect to existing methods. | ||
|
||
\section{Summary of Methods}\label{sec:methods} | ||
\noindent{\em\bf Modeling read-depth ---} We seek to model the data $n_{st}$, the integer count of aligned reads for sample $s=1, 2, \ldots, S$ in genomic region $t=1, 2, \ldots, T$. A negative-binomial distribution for $n_{st}$ is theoretically and empirically justified; taking $\lambda_{st}$ as the Poisson parameter and $\alpha_{st}$ as the Gamma distribution parameter, we may write |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A negative-binomial distribution for $n_{st}$ is theoretically and empirically justified;
Citation needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can not think of a citation... perhaps we can briefly explain the idea:
Coverage is a random process of assignment of reads to the genome and as such, can be modeled as a Poisson process with a rate that is proportional to ploidy (copy number) and library prep bias. We can approximate the bias with a GLM w/ sample-specific and target-specific factors but then have to deal with the unexplained variance. The unexplained variance is then modeled with a Gamma-distributed multiplicative factor in the Poisson rate, leading to negative binomial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just going to leave it at We model $n_{st}$ with a negative-binomial distribution...
I think an aside about Poisson/negative binomial processes is probably unnecessary since these are standard choices for modeling count data.
\begin{multline} | ||
p(\tau_t \rightarrow \tau_{t+1}) = \exp(-\Delta_{t, t+1}/d_\tau) \, \delta(\tau_t, \tau_{t+1}) \\+ [1 - \exp(-\Delta_{t, t+1}/d_\tau)]\,p(\tau_{t+1}), | ||
\end{multline} | ||
where $p(\tau_{t})$ the prior region class probability, $\Delta_{t, t+1}$ is the genomic distance between region $t$ and $t+1$, and $d_\tau$ is the typical size of regions with similar CNV activity rates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$\Delta_{t, t+1}$ is the genomic distance between region $t$ and $t+1$
How is genomic distance calculated? Between starts? Between mid-points?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Between the midpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
\caption{The performance of GATK gCNV, XHMM \cite{fromer_discovery_2012}, and CODEX \cite{jiang_codex:_2015} in detecting CNV events from WES data against a matched WGS callset obtained using Genome STRiP \cite{handsaker_large_2015}. Performance metrics are reported per WES target.} | ||
\label{fig:benchmark} | ||
\end{figure} | ||
Finally, we benchmark GATK gCNV using a cohort of whole-exome sequencing (WES) blood-normal samples. We use a manually validated and FDR-controlled callset obtained from matched whole-genome sequencing (WGS) samples using Genome STRiP~\cite{handsaker_large_2015} as the truth callset. To compare, we also include the CNV calls obtained using two popular CNV calling tools, XHMM \cite{fromer_discovery_2012} and CODEX \cite{jiang_codex:_2015}, in the benchmark. The results are shown in Fig.~\ref{sec:bench}. We find that GATK gCNV yields nearly $20\%$ higher sensitivity and $50\%$ higher specificity over all CNV events compared to the other methods. Although not presented here, benchmarks that stratify by truth variant frequency also show that GATK gCNV performs favorably on common CNV events. More extensive benchmarking results demonstrate even further improvement, primarily resulting from GATK gCNV hyperparmeter optimization using additional WGS truth callsets, and will be presented elsewhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should at least briefly describe how true positives were matched (some fraction reciprocal overlap?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also how were true negatives counted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We collapse the WGS calls onto WES targets and assign an integer copy number call to each target by quantization (rounding to nearest integer copy number). Now, counting TP, FP, TN, and FN is trivial: if both WGS and WES say CN=2 on an autosomal target, that's a count for TN, if both say CN=3, that's a count for TP, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for confirming. This is already briefly alluded to in the caption, we can add more detail in the preprint (and update for more recent validations, if appropriate).
\section{Benchmarking}\label{sec:bench} | ||
\begin{figure} | ||
\includegraphics[width=\columnwidth]{figures/germline-cnv-caller-model/benchmark.pdf} | ||
\caption{The performance of GATK gCNV, XHMM \cite{fromer_discovery_2012}, and CODEX \cite{jiang_codex:_2015} in detecting CNV events from WES data against a matched WGS callset obtained using Genome STRiP \cite{handsaker_large_2015}. Performance metrics are reported per WES target.} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these shown over varying QS? Also are the dotted lines on the right plot f1 isoclines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, added a sentence.
Perhaps we can leave adding more detailed references to the preprint/publication version? I think we will want to significantly expand the intro/motivation sections anyway. @mbabadi can you briefly fill in some of the other details that @mwalker174 asked about? Commenting here or adding commits to this branch are both fine, whichever is easier for you. |
@mwalker174 @mbabadi @droazen might be nice to get this one in before the release and highlight the whitepaper in the release notes. |
@mwalker174 @samuelklee I replies to Mark's questions. Please let me know if you need more help. |
Excellent, thanks! I'll add to the tex sometime tomorrow. |
Thanks @mbabadi, back to you @mwalker174! |
Thanks for answering my questions @mbabadi and @samuelklee. Merge when ready. |
Closes #3005.