forked from broadinstitute/pyro-cov
-
Notifications
You must be signed in to change notification settings - Fork 0
/
supplement.tex
274 lines (195 loc) · 16.4 KB
/
supplement.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
% Use only LaTeX2e, calling the article.cls class and 12-point type.
\documentclass[12pt]{article}
% Users of the {thebibliography} environment or BibTeX should use the
% scicite.sty package, downloadable from *Science* at
% http://www.sciencemag.org/authors/preparing-manuscripts-using-latex
% This package should properly format in-text
% reference calls and reference-list numbers.
\usepackage{scicite}
\usepackage{times}
% Custom packages.
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{hyperref}
% The preamble here sets up a lot of new/revised commands and
% environments. It's annoying, but please do *not* try to strip these
% out into a separate .sty file (which could lead to the loss of some
% information when we convert the file to other formats). Instead, keep
% them in the preamble of your main LaTeX source file.
% The following parameters seem to provide a reasonable page setup.
\topmargin 0.0cm
\oddsidemargin 0.2cm
\textwidth 16cm
\textheight 21cm
\footskip 1.0cm
% Custom commands.
\newcommand \TODO \fbox
\newcommand \blank {{\,\pmb\cdot\,}}
%The next command sets up an environment for the abstract to your paper.
\newenvironment{sciabstract}{%
\begin{quote} \bf}
{\end{quote}}
% Include your paper's title here
\title{Supplementary material}
% Place the author information here. Please hand-code the contact
% information and notecalls; do *not* use \footnote commands. Let the
% author contact information appear immediately below the author names
% as shown. We would also prefer that you don't change the type-size
% settings shown here.
\author
{
% John Smith,$^{1\ast}$ Jane Doe,$^{1}$ Joe Scientist$^{2}$\\
% \\
% \normalsize{$^{1}$Department of Chemistry, University of Wherever,}\\
% \normalsize{An Unknown Address, Wherever, ST 00000, USA}\\
% \normalsize{$^{2}$Another Unknown Address, Palookaville, ST 99999, USA}\\
% \\
% \normalsize{$^\ast$To whom correspondence should be addressed; E-mail: [email protected].}
}
% Include the date command, but leave its argument blank.
\date{}
%%%%%%%%%%%%%%%%% END OF PREAMBLE %%%%%%%%%%%%%%%%
\begin{document}
% Double-space the manuscript.
\baselineskip24pt
% Make the title.
\maketitle
% Place your abstract within the special {sciabstract} environment.
% \begin{sciabstract}
% This document presents a number of hints about how to set up your
% {\it Science\/} paper in \LaTeX\ . We provide a template file,
% \texttt{scifile.tex}, that you can use to set up the \LaTeX\ source
% for your article. An example of the style is the special
% \texttt{\{sciabstract\}} environment used to set up the abstract you
% see here.
% \end{sciabstract}
% In setting up this template for *Science* papers, we've used both
% the \section* command and the \paragraph* command for topical
% divisions. Which you use will of course depend on the type of paper
% you're writing. Review Articles tend to have displayed headings, for
% which \section* is more appropriate; Research Articles, when they have
% formal topical divisions at all, tend to signal them with bold text
% that runs into the paragraph, for which \paragraph* is the right
% choice. Either way, use the asterisk (*) modifier, as shown, to
% suppress numbering.
\section*{Materials and methods}
\paragraph*{Data and Code Availability}
Source code for data preprocessing and modeling and available at
\url{https://github.com/broadinstitute/pyro-cov}.
GISAID sequence data is publicly available at
\url{https://gisaid.org}.
PANGO lineage aliases are available at \url{https://cov-lineages.org/} with source code at \url{https://github.com/cov-lineages/lineages-website} and lineage aliases available at \url{https://github.com/cov-lineages/pango-designation}.
\paragraph*{Data Preparation}
We downloaded 2,231,068 samples from GISAID
% determined by: wc -l ~/data/gisaid/provision.json
\cite{elbe2017gisaid} on 2021-07-06.
Each sample record includes labels for time, location, PANGO lineage annotation \cite{rambaut2020dynamic}, and genetic sequence.
We discard records with missing time, location, or lineage; 2,161,248 records remain.
We call mutations using the NextClade tool \cite{aksamentov2020nextclade}, discarding sequences whose alignment quality is not reported as ``good'' (sequences discarded in this step are excluded from the mutation features $X_{sf}$, but are still included in the counts $y_{tps}$), and discarding the seven lineages with fewer than 5 good alignments; 1281 lineages remain.
Because PANGO lineages are genetically heterogeneous (with small variation within each lineage), we create continuous $[0, 1]$-valued features $X_{sf}$ denoting, for each lineage (``strain'') $s$ and mutation (``feature'') $f$, the portion of samples in that lineage exhibiting the mutation.
We discard mutations that do not occur in the majority of samples in any single lineage (i.e.~features $f$ with $X_{sf}<\tfrac{1}{2}$ for all $s$); 2337 amino acid mutations pass this threshold.
We bin time intervals into 14-day segments, choosing a multiple of 7 to minimize weekly seasonality, but binning coarser than a week so as to reduce memory requirements; this result in 42 time bins.
Because sample counts vary widely across GISAID geographic region (by as much as five orders of magnitude), we aggregate regions into the following coarse partitions: each country counts as a region, and any first level subregion of a country counts as a region if it has at least 50 samples; otherwise it is aggregated into a whole-country bin.
Note this means that e.g.~a country may be split up into its larger regions, with smaller regions being subsumed into an aggregate country level bin.
%Because sample counts vary widely across GISAID geographic region, we aggregated regions as follows.
%By default each country is a region. If, however, any first level subregion of a country has at least 50 samples it is promoted to a region. Conversely
%subregions with fewer than 50 samples are aggregated into a whole-country bin.
We then drop regions without samples in at least two different time intervals, resulting in 1070 regions in total.
After preprocessing, the model input data are a $T\times P\times S = 42 \times 1070 \times 1281$ shaped array $y_{tps}\in\mathbb N$ of counts (this array is sparse but it is not sparse along any of its 2-dimensional marginals), and a $S\times F = 1281 \times 2337$ shaped array $X_{sf}\in[0,1]$ of mutation features.
\paragraph*{Probabilistic Model}
% Comment by Sagar Gosai:
% Another way to help justify your choices of priors and conditional
% likelihoods would be to generate prior predictive samples for the final
% counts and see how well those match the marginal distribution of the data.
%
% This paper by Gabry et. al. provides some nice front-end strategies to
% support your modeling choices:
% https://rss.onlinelibrary.wiley.com/doi/full/10.1111/rssa.12378.
We model relative lineage growth with a hierarchical Bayesian regression model with a multinomial likelihood.
Arrays in the model index over one or more indices: $T{=}42$ time steps $t$; $S{=}1281$ PANGO lineages (``strains'') $s$; $P{=}1070$ regions (``places'') $p$; and $F{=}2337$ amino acid mutations (``features'') $f$.
The model, shown below, regresses lineage counts $y_{tps}\in\mathbb N$ in each time-region-lineage bin against amino acid mutation covariates $X_{sf} \in [0,1]$.
The variables $y$ and $X$ are observed and all other variables in the model are latent. Each latent variable is governed by a prior distribution.
The full model is specified as follows:
%%%
\begin{align*}
\textstyle
\alpha_s &\sim \operatorname{Normal}(0, \sigma_1) &
\sigma_1 &\sim \operatorname{LogNormal}(0, 2) \\
\alpha_{ps} &\sim \operatorname{Normal}(\alpha_s, \sigma_2) &
\sigma_2 &\sim \operatorname{LogNormal}(0, 2) \\
\beta_f &\sim \operatorname{Logistic}(0,\, \sigma_3) &
\sigma_3 &= \frac{1}{200} \\
\beta_{ps} &\sim \operatorname{Normal}\Bigl(
\sum_f \beta_f X_{sf},\, \sigma_4
\Bigr) &
\sigma_4 &\sim \operatorname{LogNormal}(-4, 2) \\
\underline{y_{tps}} &\sim \operatorname{Multinomial}\Bigl(
\sum_s y_{tps},\, \operatorname{softmax}(\alpha_{p\blank} + t\beta_{p\blank}/\tau)_s
\Bigr)
\end{align*}
%%%
The proportion of lineages in a single time-region bin is modeled as a Multinomial distribution whose probability parameter is a multivariate logistic growth function $\operatorname{softmax}(\alpha_{p\blank} + t\beta_{p\blank}/\tau)$ with intercept $\alpha_{ps}$ and slope $\beta_{ps}$ in units of generation time $\tau=5.5$ days, where $\operatorname{softmax}$ inputs and outputs vectors and is defined as
${
\operatorname{softmax}(x)_i = \frac {\exp(x_i)} {\sum_j \exp(x_j)}
}$, and where the dot subscripts $\alpha_{p\blank}\in\mathbb R^S$ and $\beta_{p\blank}\in\mathbb R^S$ denote vectors over lineages.
Early iterations of the model used overdispersed likelihoods such as Dirichlet-Multinomial to account for additional variability not directly encoded in the generative process.
However, we found that we can obtain much more accurate model predictions by using a Multinomial likelihood and accounting for model misfit by adding hierarchical structure elsewhere.
The intercepts $\alpha_{ps}$ denote initial relative log prevalence of lineage $s$ in region $p$; these are modeled hierarchically around global relative log prevalences $\alpha_s$ of each lineage.
The slopes $\beta_{ps}$ are modeled hierarchically around global per-lineage growth rates $\sum_f \beta_f X_{sf}$ that are linearly regressed against amino acid mutation features $X_{sf}$.
These linear coefficients $\beta_f$ can be directly interpreted as the effect of a mutation on a lineage's growth rate, all other variation being equal.
In figures we plot ${\mathbb E\left[\beta_f\right] =\hspace{-0.1ex}: \Delta\log R}$ as an estimate of effect size and plot the z score
${
|\mathbb E\left[\beta_f\right]| / \sqrt{\mathbb V\left[\beta_f\right]} =\hspace{-0.1ex}: |\mu|/\sigma
}$
as a proxy for statistical significance.
Note that by regressing against mutations we obviate the need to directly incorporate phylogenetic information into the model: if two lineages are close together in a phylogeny, then their mutation features are likely also similar, so their regressed growth rates will likely be similar.
By sharing statistical strength in this way we are also able to make more accurate predictions for emergent lineages with few observations.
Both of these hierarchies in $\alpha$ and $\beta$ improve model fit in the presence of heavily skewed observations (e.g.~most samples are from the UK, and there is a long tail of sparsely sampled regions).
We place weak priors on scale parameters $\sigma_1$, $\sigma_2$, and $\sigma_4$ (these denote standard deviations, the square roots of prior variance).
The $\sigma_4$ prior is centered around the smaller value $e^{-4}\approx 0.018$ because we expect little variation of relative growth rate across geographic regions a priori.
We fix the linear regression scale parameter $\sigma_3$ to a small value, forcing the regression problem towards a sparse solution (i.e.~we assume a priori that most observed mutations have little effect on growth rate).
We choose a Logistic prior on regression coefficients because it is heavier-tailed than a Normal prior, but not so heavy-tailed that the regression problem becomes multimodal (as it would for e.g.~a Cauchy or Student's t prior).
Like the Laplace distribution, the Logistic distribution's exponential tails are maximally heavy while ensuring the conditional log density is concave, leading to a conditionally unimodal posterior and robust inference.
Unlike the Laplace distribution, the Logistic distribution is smooth, with density given by
$$
\operatorname{Logistic}(x;\mu,s) = \frac
{\exp(-\frac{x-\mu}s)}
{s(1 + \exp\left(-\frac{x-\mu}s)\right)^2}.
$$
% This small value was chosen to based on 2-fold cross validation.
This proportional-growth model differs from many forecasting models in the literature that are formulated in terms of absolute sample counts.
We choose to model relative portions rather than absolute counts because the relative model is robust to a number of sources of bias, including:
sampling bias across regions (e.g.~one region samples 1000x more than another);
sampling bias over time (e.g.~change in sampling rate over time); and
change in absolute growth rate of all lineages, in any (time, region) bin (e.g.~due to changes in local policies or weather, as long as those changes affect all lineages equally).
However the model is susceptible to the following sources of bias:
biased sampling in any (time, region) cell (e.g.~sequencing only in case of S-gene target failure); and changes in sampling bias within a single region over time (e.g.~a country has a lab in only one city, then spins up a second lab in another distant city with different lineage portions).
\paragraph*{Probabilistic Inference}
The model is implemented in the Pyro probabilistic programming language \cite{bingham2019pyro} built on PyTorch \cite{paszke2017automatic}.
To fit an approximate joint posterior distribution over all latent variables (a space of dimension 2,744,961), we train a flexible reparametrized variational distribution using stochastic variational inference.
Our variational approach starts by reparametrizing the model via a sequence of learnable but distribution-preserving transforms: decentering transforms \cite{gorinova2020automatic} on the $\alpha$ and $\beta$ latent variables, and a learnable per-region per-lineage time shift in each linear function $\alpha_{ps}+t\beta_{ps}/\tau$.
Reparametrizing is particularly helpful in avoiding Neal's-funnel situations \cite{neal2003slice} by smoothing out the geometry of latent variables with Normal prior whose scale parameter is also a latent variable.
After reparametrizing we model the posterior on the reparametrized linear coefficients $\beta_{sf}$ as a low-rank multivariate Normal distribution (rank 200 covariance + diagonal noise), and model all remaining latent variables as mean field transformed Normal distributions.
The low-rank multivariate Normal distribution on $\beta_{sf}$ allows inference to capture correlated posterior uncertainty among competing mutations each of which might explain increased growth rate.
The combined variational distribution has 7,334,172 parameters.
Variational inference is performed for 10,000 iterations with the Adam optimizer with clipped gradients and an exponentially decreasing learning rate schedule and initial learning rates between 0.05 and 0.0025 for different parameter groups.
Optimization proceeds in batch-mode, i.e.~without any data subsampling.
We initialize model parameters to median prior values with a small amount of noise added to avoid scale parameters collapsing early in training.
After inference we make predictions by drawing 1000 posterior samples.
See source code for detailed optimizer and initialization configuration.
Inference and prediction on a single GPU (NVIDIA Tesla P100 with 16GB of RAM) takes 10 minutes, which is about the same amount of time required to download and preprocess each daily snapshot of data from GISAID.
Inference cost is $O((TP+F)S)$ but does not depend directly on the number of genetic samples, since samples are aggregated into counts $y$ of constant shape $T\times P\times S$.
\paragraph*{Validation}
We considered the possibility of biased submission to the GISAID database, and compared results obtained from the full dataset with results obtained from disjoint subsets.
We choose to partition using 2-fold cross validation, splitting the world into Europe and the remaining countries.
This split is motivated by most samples originating from the UK: we widened the region around the UK until the region and its complement both had roughly equivalent statistical strength and narrow posterior estimates.
Restricting to CDC data or CDC's randomly sampled NS3 dataset appears to result in insufficient diversity and leads to unclear results (Pearson correlation 0.49, 0.28, resp.).
Our model assumes each single point mutation independently linearly contributes to change in growth rate.
A natural generalization is to search for groups of mutations that only affect growth rate as a whole.
To explore this we fit a similar model of both single and pair mutations, considering only pairs that lie within the same gene.
Fitting this model discovered no pairwise mutations stronger than the top 100 single mutations.
We suspect the growth rate data are underpowered to discover higher-order interactions between mutations.
% TODO Check this on more recent data.
\bibliography{main}
\bibliographystyle{Science}
\end{document}