Added gCNV PROBPROG 2018 extended abstract, archived notes on CNV methods, and deleted some legacy documentation. #5732

samuelklee · 2019-02-27T18:25:27Z

Closes #3005.

samuelklee · 2019-02-27T18:29:43Z

@vdauwera note that this modifies the path of the CNV methods doc (which is mostly out of date, but is still linked to in some Comms materials) from docs/CNVs/CNV-methods.pdf to docs/CNV/archived/archived-CNV-methods.pdf.

The extended abstract on gCNV is a very technical description of the probabilistic model, but it might be worth referencing it for interested users until a preprint or publication is available.

codecov-io · 2019-03-06T21:56:49Z

Codecov Report

Merging #5732 into master will decrease coverage by 49.607%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##              master     #5732        +/-   ##
================================================
- Coverage     86.985%   37.377%   -49.607%     
+ Complexity     31863     17994     -13869     
================================================
  Files           1943      1975        +32     
  Lines         146768    147415       +647     
  Branches       16223     16225         +2     
================================================
- Hits          127666     55100     -72566     
- Misses         13189     87388     +74199     
+ Partials        5913      4927       -986

Impacted Files	Coverage Δ	Complexity Δ
...s/copynumber/models/AlleleFractionInitializer.java	`89.063% <ø> (ø)`	`17 <0> (ø)`	⬇️
...r/tools/copynumber/models/AlleleFractionState.java	`100% <ø> (ø)`	`7 <0> (ø)`	⬇️
...umber/utils/optimization/PersistenceOptimizer.java	`77.419% <ø> (-10.753%)`	`24 <0> (-4)`
...bender/tools/copynumber/models/CopyRatioState.java	`100% <ø> (ø)`	`5 <0> (ø)`	⬇️
...copynumber/utils/segmentation/KernelSegmenter.java	`95.671% <ø> (-2.164%)`	`45 <0> (-3)`
...s/copynumber/models/AlleleFractionLikelihoods.java	`97.826% <ø> (ø)`	`11 <0> (ø)`	⬇️
...ools/copynumber/models/AlleleFractionModeller.java	`88.406% <ø> (-5.797%)`	`14 <0> (-4)`
...ls/variant/writers/GVCFBlockCombiningIterator.java	`0% <0%> (-100%)`	`0% <0%> (-1%)`
...ls/walkers/genotyper/HeterogeneousPloidyModel.java	`0% <0%> (-100%)`	`0% <0%> (-14%)`
...nder/utils/downsampling/FractionalDownsampler.java	`0% <0%> (-100%)`	`0% <0%> (-17%)`
... and 1353 more

…hods, and deleted some legacy documentation.

samuelklee · 2019-03-11T14:07:04Z

@mwalker174 can you assign someone to review?

mwalker174

Glad to see up to date documentation on gCNV going in. How are users supposed to find the updated pdf? Could we add a link to it in the GermlineCNVCaller doc?

If this is going to be a primary reference there needs to be some way to translate between the HHMM parameters and actual tool arguments. IMO, the current tool argument documentation is too vague (most users probably wonder what the heck coherence length is). I think it would be immensely helpful to give the corresponding HHMM parameters from the pdf. @vdauwera Are greek characters / subscripts permitted in tool argument documentation?

I have a few minor suggestions mostly just to improve clarity. The intro could use some citations in a few places if you have them.

mwalker174 · 2019-03-13T19:25:09Z

docs/CNV/archived/archived-CNV-methods.tex

@@ -86,7 +85,7 @@

 \begin{document}

-\title{Notes on CNV Methods}
+\title{(ARCHIVED) Notes on CNV Methods}


Can we add a DEPRECATED or ARCHIVED watermark to each page?

Just added a header to each page---a little less distracting than a watermark.

mwalker174 · 2019-03-13T19:41:54Z

docs/CNV/germline-cnv-caller-model.tex

+\footnotetext{A slightly edited version of this extended abtract was accepted for poster presentation at \textit{PROBPROG 2018, October 04--06, 2018, Cambridge, MA, USA}.  (Last edited \today.)}
+
+\section{Introduction}
+Copy-number variation (CNV) is a class of genomic {\em structural variation} where the {\em integer copy-number state} of a large genomic region (typically $>$1000 base pairs, spanning several consecutive exons or genes) is altered with respect to a reference genome.  Inferring these copy-number states is an important problem in computational genomics, both for research applications and clinical practice~\cite{zhang2009copy}. Inferring integer CNV states from NGS read-depth data begins with aligning short reads (sequences of $\sim$100 base pairs from a sample genome) to the reference genome and counting the number of reads that align to each region.  Calling CNV events from these counts is a challenging problem due to strong systematic biases, which can arise from batch effects in sample preparation and sequencing library preparation protocols, variation in sequencing efficiency across genomic regions, and numerous other hidden processes.


Calling CNV events from these counts is a challenging problem due to strong systematic biases, which can arise from batch effects in sample preparation and sequencing library preparation protocols, variation in sequencing efficiency across genomic regions, and numerous other hidden processes.
Citation needed

There are many benchmarking and review papers. We can cite a number of them, e.g.:
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-S11-S1
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5569469/

Thanks for these. I think I'm OK with fleshing out the intro with citations of reviews, etc. in the full preprint.

mwalker174 · 2019-03-13T19:42:30Z

docs/CNV/germline-cnv-caller-model.tex

+\section{Introduction}
+Copy-number variation (CNV) is a class of genomic {\em structural variation} where the {\em integer copy-number state} of a large genomic region (typically $>$1000 base pairs, spanning several consecutive exons or genes) is altered with respect to a reference genome.  Inferring these copy-number states is an important problem in computational genomics, both for research applications and clinical practice~\cite{zhang2009copy}. Inferring integer CNV states from NGS read-depth data begins with aligning short reads (sequences of $\sim$100 base pairs from a sample genome) to the reference genome and counting the number of reads that align to each region.  Calling CNV events from these counts is a challenging problem due to strong systematic biases, which can arise from batch effects in sample preparation and sequencing library preparation protocols, variation in sequencing efficiency across genomic regions, and numerous other hidden processes.
+
+Many previous methods for CNV detection from read-depth data attempt to remove systematic biases via PCA denoising \cite{fromer_discovery_2012} or regression~\cite{jiang_codex:_2015, klambauer_cn.mops:_2012}, or try to obviate the issue by pre-clustering samples and genomic regions~\cite{handsaker_large_2015, packer_clamms:_2016}.  CNVs are subsequently detected using hidden Markov models (HMM) or non-parametric change-point detection algorithms~\cite{olshen_circular_2004}. Crucially, these methods suffer from a lack of self-consistency between data normalization and event detection, which results in inadvertent removal of signal in the former and decreased sensitivity in the latter.


Crucially, these methods suffer from a lack of self-consistency between data normalization and event detection, which results in inadvertent removal of signal in the former and decreased sensitivity in the latter.
Citation or explanation needed

This issue has been discussed in CODEX paper in relation to common variants and sensitivity:
https://www.ncbi.nlm.nih.gov/pubmed/25618849

Added a parenthetical to explicitly say this is discussed in that reference (just adding a citation could be misconstrued as pointing out that particular tool as lacking in this regard).

mwalker174 · 2019-03-13T19:43:01Z

docs/CNV/germline-cnv-caller-model.tex

+
+Many previous methods for CNV detection from read-depth data attempt to remove systematic biases via PCA denoising \cite{fromer_discovery_2012} or regression~\cite{jiang_codex:_2015, klambauer_cn.mops:_2012}, or try to obviate the issue by pre-clustering samples and genomic regions~\cite{handsaker_large_2015, packer_clamms:_2016}.  CNVs are subsequently detected using hidden Markov models (HMM) or non-parametric change-point detection algorithms~\cite{olshen_circular_2004}. Crucially, these methods suffer from a lack of self-consistency between data normalization and event detection, which results in inadvertent removal of signal in the former and decreased sensitivity in the latter.
+
+Here, we propose and implement GATK gCNV, a principled Bayesian approach for learning global and sample-specific biases of read-depth data from large cohorts while {\em simultaneously} inferring copy-number states. Our model combines a negative-binomial factor analysis module for learning batch effects with a hierarchical HMM (HHMM) for detecting both sample-specific CNV events and global regions of high and low CNV activity.  Self-consistency between bias modeling and CNV calling greatly improves the performance of our algorithm with respect to existing methods.


Here, we propose and implement -> Here we present

mwalker174 · 2019-03-13T19:43:50Z

docs/CNV/germline-cnv-caller-model.tex

+Here, we propose and implement GATK gCNV, a principled Bayesian approach for learning global and sample-specific biases of read-depth data from large cohorts while {\em simultaneously} inferring copy-number states. Our model combines a negative-binomial factor analysis module for learning batch effects with a hierarchical HMM (HHMM) for detecting both sample-specific CNV events and global regions of high and low CNV activity.  Self-consistency between bias modeling and CNV calling greatly improves the performance of our algorithm with respect to existing methods.
+
+\section{Summary of Methods}\label{sec:methods}
+\noindent{\em\bf Modeling read-depth ---} We seek to model the data $n_{st}$, the integer count of aligned reads for sample $s=1, 2, \ldots, S$ in genomic region $t=1, 2, \ldots, T$.  A negative-binomial distribution for $n_{st}$ is theoretically and empirically justified; taking $\lambda_{st}$ as the Poisson parameter and $\alpha_{st}$ as the Gamma distribution parameter, we may write


A negative-binomial distribution for $n_{st}$ is theoretically and empirically justified;
Citation needed

I can not think of a citation... perhaps we can briefly explain the idea:

Coverage is a random process of assignment of reads to the genome and as such, can be modeled as a Poisson process with a rate that is proportional to ploidy (copy number) and library prep bias. We can approximate the bias with a GLM w/ sample-specific and target-specific factors but then have to deal with the unexplained variance. The unexplained variance is then modeled with a Gamma-distributed multiplicative factor in the Poisson rate, leading to negative binomial.

I'm just going to leave it at We model $n_{st}$ with a negative-binomial distribution... I think an aside about Poisson/negative binomial processes is probably unnecessary since these are standard choices for modeling count data.

mwalker174 · 2019-03-13T19:56:46Z

docs/CNV/germline-cnv-caller-model.tex

+\begin{multline}
+p(\tau_t \rightarrow \tau_{t+1}) = \exp(-\Delta_{t, t+1}/d_\tau) \, \delta(\tau_t, \tau_{t+1}) \\+ [1 - \exp(-\Delta_{t, t+1}/d_\tau)]\,p(\tau_{t+1}),
+\end{multline}
+where $p(\tau_{t})$ the prior region class probability, $\Delta_{t, t+1}$ is the genomic distance between region $t$ and $t+1$, and $d_\tau$ is the typical size of regions with similar CNV activity rates.


$\Delta_{t, t+1}$ is the genomic distance between region $t$ and $t+1$
How is genomic distance calculated? Between starts? Between mid-points?

Between the midpoints.

mwalker174 · 2019-03-13T20:17:20Z

docs/CNV/germline-cnv-caller-model.tex

+\caption{The performance of GATK gCNV, XHMM \cite{fromer_discovery_2012}, and CODEX \cite{jiang_codex:_2015} in detecting CNV events from WES data against a matched WGS callset obtained using Genome STRiP \cite{handsaker_large_2015}.  Performance metrics are reported per WES target.}  
+\label{fig:benchmark}
+\end{figure}
+Finally, we benchmark GATK gCNV using a cohort of whole-exome sequencing (WES) blood-normal samples. We use a manually validated and FDR-controlled callset obtained from matched whole-genome sequencing (WGS) samples using Genome STRiP~\cite{handsaker_large_2015} as the truth callset. To compare, we also include the CNV calls obtained using two popular CNV calling tools, XHMM \cite{fromer_discovery_2012} and CODEX \cite{jiang_codex:_2015}, in the benchmark. The results are shown in Fig.~\ref{sec:bench}. We find that GATK gCNV yields nearly $20\%$ higher sensitivity and $50\%$ higher specificity over all CNV events compared to the other methods.  Although not presented here, benchmarks that stratify by truth variant frequency also show that GATK gCNV performs favorably on common CNV events. More extensive benchmarking results demonstrate even further improvement, primarily resulting from GATK gCNV hyperparmeter optimization using additional WGS truth callsets, and will be presented elsewhere.


We should at least briefly describe how true positives were matched (some fraction reciprocal overlap?).

Also how were true negatives counted?

We collapse the WGS calls onto WES targets and assign an integer copy number call to each target by quantization (rounding to nearest integer copy number). Now, counting TP, FP, TN, and FN is trivial: if both WGS and WES say CN=2 on an autosomal target, that's a count for TN, if both say CN=3, that's a count for TP, etc.

Great, thanks for confirming. This is already briefly alluded to in the caption, we can add more detail in the preprint (and update for more recent validations, if appropriate).

mwalker174 · 2019-03-13T20:25:32Z

docs/CNV/germline-cnv-caller-model.tex

+\section{Benchmarking}\label{sec:bench}
+\begin{figure}
+\includegraphics[width=\columnwidth]{figures/germline-cnv-caller-model/benchmark.pdf}
+\caption{The performance of GATK gCNV, XHMM \cite{fromer_discovery_2012}, and CODEX \cite{jiang_codex:_2015} in detecting CNV events from WES data against a matched WGS callset obtained using Genome STRiP \cite{handsaker_large_2015}.  Performance metrics are reported per WES target.}  


Are these shown over varying QS? Also are the dotted lines on the right plot f1 isoclines?

Yes, and yes.

Thanks, added a sentence.

samuelklee · 2019-03-14T14:07:15Z

Perhaps we can leave adding more detailed references to the preprint/publication version? I think we will want to significantly expand the intro/motivation sections anyway. @mbabadi can you briefly fill in some of the other details that @mwalker174 asked about? Commenting here or adding commits to this branch are both fine, whichever is easier for you.

samuelklee · 2019-03-21T14:43:20Z

@mwalker174 @mbabadi @droazen might be nice to get this one in before the release and highlight the whitepaper in the release notes.

mbabadi · 2019-03-21T19:15:30Z

@mwalker174 @samuelklee I replies to Mark's questions. Please let me know if you need more help.

samuelklee · 2019-03-21T19:36:47Z

Excellent, thanks! I'll add to the tex sometime tomorrow.

samuelklee · 2019-03-22T16:45:15Z

Thanks @mbabadi, back to you @mwalker174!

mwalker174 · 2019-03-22T18:44:43Z

Thanks for answering my questions @mbabadi and @samuelklee. Merge when ready.

samuelklee requested a review from mwalker174 March 6, 2019 20:44

samuelklee added 2 commits March 7, 2019 00:25

Added gCNV PROBPROG 2018 extended abstract, archived notes on CNV met…

09eadc9

…hods, and deleted some legacy documentation.

Updated links for documentation on topological persistence.

4ad1d58

samuelklee force-pushed the sl_cnv_docs branch 2 times, most recently from 33824ef to 4ad1d58 Compare March 7, 2019 14:34

mwalker174 self-assigned this Mar 11, 2019

mwalker174 requested changes Mar 13, 2019

View reviewed changes

Added ARCHIVED headers.

8262ffd

Addressed PR comments.

00e8c29

mwalker174 approved these changes Mar 22, 2019

View reviewed changes

samuelklee merged commit 02b95b3 into master Mar 22, 2019

samuelklee deleted the sl_cnv_docs branch March 22, 2019 19:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added gCNV PROBPROG 2018 extended abstract, archived notes on CNV methods, and deleted some legacy documentation. #5732

Added gCNV PROBPROG 2018 extended abstract, archived notes on CNV methods, and deleted some legacy documentation. #5732

samuelklee commented Feb 27, 2019

samuelklee commented Feb 27, 2019 •

edited

Loading

codecov-io commented Mar 6, 2019 •

edited

Loading

samuelklee commented Mar 11, 2019

mwalker174 left a comment

mwalker174 Mar 13, 2019

samuelklee Mar 14, 2019

mwalker174 Mar 13, 2019

mbabadi Mar 21, 2019

samuelklee Mar 22, 2019

mwalker174 Mar 13, 2019

mbabadi Mar 21, 2019

samuelklee Mar 22, 2019

mwalker174 Mar 13, 2019

samuelklee Mar 22, 2019

mwalker174 Mar 13, 2019

mbabadi Mar 21, 2019

samuelklee Mar 22, 2019

mwalker174 Mar 13, 2019

mbabadi Mar 21, 2019

samuelklee Mar 22, 2019

mwalker174 Mar 13, 2019

mwalker174 Mar 13, 2019

mbabadi Mar 21, 2019

samuelklee Mar 22, 2019 •

edited

Loading

mwalker174 Mar 13, 2019

mbabadi Mar 21, 2019

samuelklee Mar 22, 2019

samuelklee commented Mar 14, 2019

samuelklee commented Mar 21, 2019

mbabadi commented Mar 21, 2019

samuelklee commented Mar 21, 2019

samuelklee commented Mar 22, 2019

mwalker174 commented Mar 22, 2019


		Many previous methods for CNV detection from read-depth data attempt to remove systematic biases via PCA denoising \cite{fromer_discovery_2012} or regression~\cite{jiang_codex:_2015, klambauer_cn.mops:_2012}, or try to obviate the issue by pre-clustering samples and genomic regions~\cite{handsaker_large_2015, packer_clamms:_2016}. CNVs are subsequently detected using hidden Markov models (HMM) or non-parametric change-point detection algorithms~\cite{olshen_circular_2004}. Crucially, these methods suffer from a lack of self-consistency between data normalization and event detection, which results in inadvertent removal of signal in the former and decreased sensitivity in the latter.

		Here, we propose and implement GATK gCNV, a principled Bayesian approach for learning global and sample-specific biases of read-depth data from large cohorts while {\em simultaneously} inferring copy-number states. Our model combines a negative-binomial factor analysis module for learning batch effects with a hierarchical HMM (HHMM) for detecting both sample-specific CNV events and global regions of high and low CNV activity. Self-consistency between bias modeling and CNV calling greatly improves the performance of our algorithm with respect to existing methods.

Added gCNV PROBPROG 2018 extended abstract, archived notes on CNV methods, and deleted some legacy documentation. #5732

Added gCNV PROBPROG 2018 extended abstract, archived notes on CNV methods, and deleted some legacy documentation. #5732

Conversation

samuelklee commented Feb 27, 2019

samuelklee commented Feb 27, 2019 • edited Loading

codecov-io commented Mar 6, 2019 • edited Loading

Codecov Report

samuelklee commented Mar 11, 2019

mwalker174 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samuelklee Mar 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samuelklee commented Mar 14, 2019

samuelklee commented Mar 21, 2019

mbabadi commented Mar 21, 2019

samuelklee commented Mar 21, 2019

samuelklee commented Mar 22, 2019

mwalker174 commented Mar 22, 2019

samuelklee commented Feb 27, 2019 •

edited

Loading

codecov-io commented Mar 6, 2019 •

edited

Loading

samuelklee Mar 22, 2019 •

edited

Loading