Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added gCNV PROBPROG 2018 extended abstract, archived notes on CNV methods, and deleted some legacy documentation. #5732

Merged
merged 4 commits into from
Mar 22, 2019

Conversation

samuelklee
Copy link
Contributor

Closes #3005.

@samuelklee
Copy link
Contributor Author

samuelklee commented Feb 27, 2019

@vdauwera note that this modifies the path of the CNV methods doc (which is mostly out of date, but is still linked to in some Comms materials) from docs/CNVs/CNV-methods.pdf to docs/CNV/archived/archived-CNV-methods.pdf.

The extended abstract on gCNV is a very technical description of the probabilistic model, but it might be worth referencing it for interested users until a preprint or publication is available.

@samuelklee samuelklee requested a review from mwalker174 March 6, 2019 20:44
@codecov-io
Copy link

codecov-io commented Mar 6, 2019

Codecov Report

Merging #5732 into master will decrease coverage by 49.607%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##              master     #5732        +/-   ##
================================================
- Coverage     86.985%   37.377%   -49.607%     
+ Complexity     31863     17994     -13869     
================================================
  Files           1943      1975        +32     
  Lines         146768    147415       +647     
  Branches       16223     16225         +2     
================================================
- Hits          127666     55100     -72566     
- Misses         13189     87388     +74199     
+ Partials        5913      4927       -986
Impacted Files Coverage Δ Complexity Δ
...s/copynumber/models/AlleleFractionInitializer.java 89.063% <ø> (ø) 17 <0> (ø) ⬇️
...r/tools/copynumber/models/AlleleFractionState.java 100% <ø> (ø) 7 <0> (ø) ⬇️
...umber/utils/optimization/PersistenceOptimizer.java 77.419% <ø> (-10.753%) 24 <0> (-4)
...bender/tools/copynumber/models/CopyRatioState.java 100% <ø> (ø) 5 <0> (ø) ⬇️
...copynumber/utils/segmentation/KernelSegmenter.java 95.671% <ø> (-2.164%) 45 <0> (-3)
...s/copynumber/models/AlleleFractionLikelihoods.java 97.826% <ø> (ø) 11 <0> (ø) ⬇️
...ools/copynumber/models/AlleleFractionModeller.java 88.406% <ø> (-5.797%) 14 <0> (-4)
...ls/variant/writers/GVCFBlockCombiningIterator.java 0% <0%> (-100%) 0% <0%> (-1%)
...ls/walkers/genotyper/HeterogeneousPloidyModel.java 0% <0%> (-100%) 0% <0%> (-14%)
...nder/utils/downsampling/FractionalDownsampler.java 0% <0%> (-100%) 0% <0%> (-17%)
... and 1353 more

@samuelklee samuelklee force-pushed the sl_cnv_docs branch 2 times, most recently from 33824ef to 4ad1d58 Compare March 7, 2019 14:34
@samuelklee
Copy link
Contributor Author

@mwalker174 can you assign someone to review?

@mwalker174 mwalker174 self-assigned this Mar 11, 2019
Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see up to date documentation on gCNV going in. How are users supposed to find the updated pdf? Could we add a link to it in the GermlineCNVCaller doc?

If this is going to be a primary reference there needs to be some way to translate between the HHMM parameters and actual tool arguments. IMO, the current tool argument documentation is too vague (most users probably wonder what the heck coherence length is). I think it would be immensely helpful to give the corresponding HHMM parameters from the pdf. @vdauwera Are greek characters / subscripts permitted in tool argument documentation?

I have a few minor suggestions mostly just to improve clarity. The intro could use some citations in a few places if you have them.

@@ -86,7 +85,7 @@

\begin{document}

\title{Notes on CNV Methods}
\title{(ARCHIVED) Notes on CNV Methods}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a DEPRECATED or ARCHIVED watermark to each page?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added a header to each page---a little less distracting than a watermark.

\footnotetext{A slightly edited version of this extended abtract was accepted for poster presentation at \textit{PROBPROG 2018, October 04--06, 2018, Cambridge, MA, USA}. (Last edited \today.)}

\section{Introduction}
Copy-number variation (CNV) is a class of genomic {\em structural variation} where the {\em integer copy-number state} of a large genomic region (typically $>$1000 base pairs, spanning several consecutive exons or genes) is altered with respect to a reference genome. Inferring these copy-number states is an important problem in computational genomics, both for research applications and clinical practice~\cite{zhang2009copy}. Inferring integer CNV states from NGS read-depth data begins with aligning short reads (sequences of $\sim$100 base pairs from a sample genome) to the reference genome and counting the number of reads that align to each region. Calling CNV events from these counts is a challenging problem due to strong systematic biases, which can arise from batch effects in sample preparation and sequencing library preparation protocols, variation in sequencing efficiency across genomic regions, and numerous other hidden processes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling CNV events from these counts is a challenging problem due to strong systematic biases, which can arise from batch effects in sample preparation and sequencing library preparation protocols, variation in sequencing efficiency across genomic regions, and numerous other hidden processes.
Citation needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these. I think I'm OK with fleshing out the intro with citations of reviews, etc. in the full preprint.

\section{Introduction}
Copy-number variation (CNV) is a class of genomic {\em structural variation} where the {\em integer copy-number state} of a large genomic region (typically $>$1000 base pairs, spanning several consecutive exons or genes) is altered with respect to a reference genome. Inferring these copy-number states is an important problem in computational genomics, both for research applications and clinical practice~\cite{zhang2009copy}. Inferring integer CNV states from NGS read-depth data begins with aligning short reads (sequences of $\sim$100 base pairs from a sample genome) to the reference genome and counting the number of reads that align to each region. Calling CNV events from these counts is a challenging problem due to strong systematic biases, which can arise from batch effects in sample preparation and sequencing library preparation protocols, variation in sequencing efficiency across genomic regions, and numerous other hidden processes.

Many previous methods for CNV detection from read-depth data attempt to remove systematic biases via PCA denoising \cite{fromer_discovery_2012} or regression~\cite{jiang_codex:_2015, klambauer_cn.mops:_2012}, or try to obviate the issue by pre-clustering samples and genomic regions~\cite{handsaker_large_2015, packer_clamms:_2016}. CNVs are subsequently detected using hidden Markov models (HMM) or non-parametric change-point detection algorithms~\cite{olshen_circular_2004}. Crucially, these methods suffer from a lack of self-consistency between data normalization and event detection, which results in inadvertent removal of signal in the former and decreased sensitivity in the latter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crucially, these methods suffer from a lack of self-consistency between data normalization and event detection, which results in inadvertent removal of signal in the former and decreased sensitivity in the latter.
Citation or explanation needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue has been discussed in CODEX paper in relation to common variants and sensitivity:
https://www.ncbi.nlm.nih.gov/pubmed/25618849

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a parenthetical to explicitly say this is discussed in that reference (just adding a citation could be misconstrued as pointing out that particular tool as lacking in this regard).


Many previous methods for CNV detection from read-depth data attempt to remove systematic biases via PCA denoising \cite{fromer_discovery_2012} or regression~\cite{jiang_codex:_2015, klambauer_cn.mops:_2012}, or try to obviate the issue by pre-clustering samples and genomic regions~\cite{handsaker_large_2015, packer_clamms:_2016}. CNVs are subsequently detected using hidden Markov models (HMM) or non-parametric change-point detection algorithms~\cite{olshen_circular_2004}. Crucially, these methods suffer from a lack of self-consistency between data normalization and event detection, which results in inadvertent removal of signal in the former and decreased sensitivity in the latter.

Here, we propose and implement GATK gCNV, a principled Bayesian approach for learning global and sample-specific biases of read-depth data from large cohorts while {\em simultaneously} inferring copy-number states. Our model combines a negative-binomial factor analysis module for learning batch effects with a hierarchical HMM (HHMM) for detecting both sample-specific CNV events and global regions of high and low CNV activity. Self-consistency between bias modeling and CNV calling greatly improves the performance of our algorithm with respect to existing methods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we propose and implement -> Here we present

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Here, we propose and implement GATK gCNV, a principled Bayesian approach for learning global and sample-specific biases of read-depth data from large cohorts while {\em simultaneously} inferring copy-number states. Our model combines a negative-binomial factor analysis module for learning batch effects with a hierarchical HMM (HHMM) for detecting both sample-specific CNV events and global regions of high and low CNV activity. Self-consistency between bias modeling and CNV calling greatly improves the performance of our algorithm with respect to existing methods.

\section{Summary of Methods}\label{sec:methods}
\noindent{\em\bf Modeling read-depth ---} We seek to model the data $n_{st}$, the integer count of aligned reads for sample $s=1, 2, \ldots, S$ in genomic region $t=1, 2, \ldots, T$. A negative-binomial distribution for $n_{st}$ is theoretically and empirically justified; taking $\lambda_{st}$ as the Poisson parameter and $\alpha_{st}$ as the Gamma distribution parameter, we may write
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A negative-binomial distribution for $n_{st}$ is theoretically and empirically justified;
Citation needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can not think of a citation... perhaps we can briefly explain the idea:

Coverage is a random process of assignment of reads to the genome and as such, can be modeled as a Poisson process with a rate that is proportional to ploidy (copy number) and library prep bias. We can approximate the bias with a GLM w/ sample-specific and target-specific factors but then have to deal with the unexplained variance. The unexplained variance is then modeled with a Gamma-distributed multiplicative factor in the Poisson rate, leading to negative binomial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just going to leave it at We model $n_{st}$ with a negative-binomial distribution... I think an aside about Poisson/negative binomial processes is probably unnecessary since these are standard choices for modeling count data.

\begin{multline}
p(\tau_t \rightarrow \tau_{t+1}) = \exp(-\Delta_{t, t+1}/d_\tau) \, \delta(\tau_t, \tau_{t+1}) \\+ [1 - \exp(-\Delta_{t, t+1}/d_\tau)]\,p(\tau_{t+1}),
\end{multline}
where $p(\tau_{t})$ the prior region class probability, $\Delta_{t, t+1}$ is the genomic distance between region $t$ and $t+1$, and $d_\tau$ is the typical size of regions with similar CNV activity rates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$\Delta_{t, t+1}$ is the genomic distance between region $t$ and $t+1$
How is genomic distance calculated? Between starts? Between mid-points?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between the midpoints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

\caption{The performance of GATK gCNV, XHMM \cite{fromer_discovery_2012}, and CODEX \cite{jiang_codex:_2015} in detecting CNV events from WES data against a matched WGS callset obtained using Genome STRiP \cite{handsaker_large_2015}. Performance metrics are reported per WES target.}
\label{fig:benchmark}
\end{figure}
Finally, we benchmark GATK gCNV using a cohort of whole-exome sequencing (WES) blood-normal samples. We use a manually validated and FDR-controlled callset obtained from matched whole-genome sequencing (WGS) samples using Genome STRiP~\cite{handsaker_large_2015} as the truth callset. To compare, we also include the CNV calls obtained using two popular CNV calling tools, XHMM \cite{fromer_discovery_2012} and CODEX \cite{jiang_codex:_2015}, in the benchmark. The results are shown in Fig.~\ref{sec:bench}. We find that GATK gCNV yields nearly $20\%$ higher sensitivity and $50\%$ higher specificity over all CNV events compared to the other methods. Although not presented here, benchmarks that stratify by truth variant frequency also show that GATK gCNV performs favorably on common CNV events. More extensive benchmarking results demonstrate even further improvement, primarily resulting from GATK gCNV hyperparmeter optimization using additional WGS truth callsets, and will be presented elsewhere.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should at least briefly describe how true positives were matched (some fraction reciprocal overlap?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also how were true negatives counted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We collapse the WGS calls onto WES targets and assign an integer copy number call to each target by quantization (rounding to nearest integer copy number). Now, counting TP, FP, TN, and FN is trivial: if both WGS and WES say CN=2 on an autosomal target, that's a count for TN, if both say CN=3, that's a count for TP, etc.

Copy link
Contributor Author

@samuelklee samuelklee Mar 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for confirming. This is already briefly alluded to in the caption, we can add more detail in the preprint (and update for more recent validations, if appropriate).

\section{Benchmarking}\label{sec:bench}
\begin{figure}
\includegraphics[width=\columnwidth]{figures/germline-cnv-caller-model/benchmark.pdf}
\caption{The performance of GATK gCNV, XHMM \cite{fromer_discovery_2012}, and CODEX \cite{jiang_codex:_2015} in detecting CNV events from WES data against a matched WGS callset obtained using Genome STRiP \cite{handsaker_large_2015}. Performance metrics are reported per WES target.}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these shown over varying QS? Also are the dotted lines on the right plot f1 isoclines?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added a sentence.

@samuelklee
Copy link
Contributor Author

Perhaps we can leave adding more detailed references to the preprint/publication version? I think we will want to significantly expand the intro/motivation sections anyway. @mbabadi can you briefly fill in some of the other details that @mwalker174 asked about? Commenting here or adding commits to this branch are both fine, whichever is easier for you.

@samuelklee
Copy link
Contributor Author

@mwalker174 @mbabadi @droazen might be nice to get this one in before the release and highlight the whitepaper in the release notes.

@mbabadi
Copy link
Contributor

mbabadi commented Mar 21, 2019

@mwalker174 @samuelklee I replies to Mark's questions. Please let me know if you need more help.

@samuelklee
Copy link
Contributor Author

Excellent, thanks! I'll add to the tex sometime tomorrow.

@samuelklee
Copy link
Contributor Author

Thanks @mbabadi, back to you @mwalker174!

@mwalker174
Copy link
Contributor

Thanks for answering my questions @mbabadi and @samuelklee. Merge when ready.

@samuelklee samuelklee merged commit 02b95b3 into master Mar 22, 2019
@samuelklee samuelklee deleted the sl_cnv_docs branch March 22, 2019 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants