50_voting.Rmd

# District-Party Ideology and Primary Outcomes {#ch:voting}


$\renewcommand{\ind}[0]{\perp \!\!\! \perp}$
$\renewcommand{\doop}[1]{\mathit{do}\left(#1\right)}$
$\renewcommand{\diff}[1]{\, \mathrm{d}#1}$
$\renewcommand{\E}[1]{\mathbb{E}\left[#1\right]}$
$\renewcommand{\p}[1]{p\left(#1\right)}$


```{r stopper, eval = FALSE, cache = FALSE, include = FALSE}
knitr::knit_exit()
```

```{r knitr-05-1-voting, include = FALSE, cache = FALSE}
source(here::here("assets-bookdown", "knitr-helpers.R"))
```

```{r}
library("here")
library("magrittr")
library("tidyverse")
library("broom")

library("ggdag")
library("ggtext")
library("scales")
library("latex2exp")
library("patchwork")
library("ggforce")
```

How does district-party ideology matter for primary outcomes?
The strategic positioning dilemma theory predicts that candidates position themselves as a compromise between the district electorate and the partisan primary electorate. 
They care about the primary electorate to fend off competition in primaries, either by deterring primary competitors from running at all or by being the "best fitting" candidate to represent the district-party constituency.
I find in Chapter \@ref(ch:positioning) that district-party ideology does affect candidate positioning, even after controlling for aggregate partisan voting in the district. 
This chapter asks if this effort by candidates to position themselves toward their partisan constituency ultimately helps them win primary election outcomes.
Do more conservative districts nominate more conservative nominees?

I argue that this research question presents problems for statistical modeling and causal inference that come from the same basic limitation of the data: district-party ideology does not vary across candidates in the same district.
While it is possible to measure the correlation between district-party ideology and the CF score of the primary _winner_, this method selects on the dependent variable. 
We already know from Chapter \@ref(ch:positioning) that candidates are generally more conservative when the electorate is more conservative, so the simple correlation between district-party ideology and nominee ideology does not capture whether more conservative electorates prefer more conservative candidates, _conditional on the set of available candidates_ in the primary.
Conditioning on a primary race, however, also conditions on district-party ideology, removing all statistical (and therefore causal) variation in district-party ideology across candidates in the same district.
To understand the role of district-party ideology in primary outcomes, we must reframe the research question around causal quantities that are actually identifiable.

I confront these statistical and causal challenges using an augmented conditional logit modeling approach.
Traditionally, conditional logit is a model that predicts discrete choice based on covariates that vary across _alternatives_ within a choice set (candidates), holding data on the chooser (the electorate) constant within the choice set.
This modeling limitation means that chooser-level features, such as district-party ideology, cannot directly affect candidate choice, but they can indirectly affect candidate choice.
I discuss these indirect effects below, and I devise a statistical model that flexibly estimates a related causal quantity: the causal effect of candidate ideology on primary outcomes, with heterogeneous effects that vary across primary electorates with different district-party ideologies.

Using this modeling approach, I find a noisy effect of candidate ideology on primary outcomes.
Primary candidates are less likely to win primary elections when their CF scores are especially centrist or especially extreme, but the estimates are not especially precise.
Furthermore, I find no evidence that the effect varies with district-party ideology across primary electorates.
Although candidates appear to position themselves strategically to fit the particular partisan constituency in their district (Chapter \@ref(ch:positioning)), I find no evidence that partisan constituencies reward these positioning maneuvers any differently as a function of public ideology.


## Spatial Voting and Candidate Choice


```{r utility-data}
chooser_data <- 
  tibble(
    x = -2,
    utility = 0,
    math_label = "bar(theta)[g]",
    label = "District-Party\nIdeology"
  )

util_data <- tibble(
  cand = seq(-10, 10, .1),
  u_distance = cand - chooser_data$x,
  utility = -u_distance^2  
)
```

```{r utility-model}
ggplot(util_data) +
  aes(x = cand, y = utility) +
  geom_line(color = primary) +
  geom_hline(yintercept = 0) +
  geom_point(data = chooser_data, aes(x = x)) +
  geom_text(
    data = chooser_data, aes(x = x, label = math_label), parse =TRUE, vjust = 2
  ) +
  geom_text(
    data = chooser_data, aes(x = x, label = label), vjust = -0.5
  ) +
  annotate(
    geom = "text", label = "← Candidate position →",
    x = 10, y = 0,
    hjust = 1, vjust = 2
  ) +
  annotate(
    geom = "richtext", 
    label = glue::glue("<b style='color:{primary}'>Candidate utility</b> decreases<br>when candidate is farther<br>from group ideal point"),
    x = 4, y = -60,
    hjust = 1,
    label.color = NA, fill = NA,
    family = font_fam
  ) +
  coord_cartesian(ylim = c(-100, 20)) +
  theme_mgd_dag() +
  labs(
    x = NULL, y = NULL,
    title = "Spatial Model of Candidate Choice"
  )
```


How do the ideal points of candidates and electorates affect primary elections? 
Spatial voting models argue that primary candidates are more likely to win the nomination when they position their candidacies closer in ideological space to the median primary voter [@downs:1957:economic-theory; @aldrich:1983:downsian-parties].
This is an essential mechanism underlying the strategic positioning dilemma theory, which states that a candidate must strike a balance between the median partisan voter and the district median voter to win both the primary and the general election [@burden:2001:polarizing-primaries; @brady-han-pope:2007:out-of-step].
This intuition appears to hold in general elections for U.S. House: candidates who are too progressive or too conservative perform worse than candidates who are "just right" [@canes-wrone-et-al:2002:out-of-step; @simas:2013:house-proximity].
Figure \@ref(fig:utility-model) plots the key claim of spatial voting models: a candidate is most appealing to a constituency when the candidate's ideological location (represented on a left–right ideological continuum) matches the constituency's preferred ideological outcome.
The candidate is less appealing, or provides less _utility_ (or "value") to the constituency, when the ideological distance between the candidate and the constituency grows larger.
This utility loss occurs whether the candidate is too progressive or too conservative.

```{r utility-model, include = TRUE, fig.scap = "Spatial proximity and candidate utility.", fig.cap = "A spatial voting model's description of candidate utility (value) as a function of candidate position and district-party ideology. Candidate value is maximized at the group ideal point $\\bar{\\theta}_{g}$ and decreases in either direction. The example in this plot assumes quadratic utility loss."}
```

One important shortcoming of existing primary elections research is the inability of empirical models to capture this "optimal positioning" in primaries.
Studies often measure the relationship between candidate "extremity" and their performance in primary elections—finding that more extreme candidates are more likely to win primary elections [@king-et-al:2016:twitter-primary] or that this effect is limited to extreme Republicans [@nielson-visalvanick:2017:primary-elections]—but extremity is allowed only a constant or monotonic effect on the candidate's primary performance [@hall-snyder:2015:ideology; @king-et-al:2016:twitter-primary; @nielson-visalvanick:2017:primary-elections].
Without the possibility of non-monotonicity in the extremity–victory relationship, these empirical models do not reflect their underlying theoretical models.
Furthermore, without a measure of the partisan constituency's ideal point (district-party ideology), these studies have no way to know whether the optimal candidate ideology is different in more conservative or more progressive electorates.
Because my project measures district-party ideology, I can estimate the optimal candidate ideology in different districts with different partisan constituency ideologies.

Another important limitation of many existing studies is that the factors affecting primary choice cannot be inferred from studying only incumbent members of congress or primary nominees, which is selecting on the dependent variable [e.g. @brady-han-pope:2007:out-of-step; @hirano-et-al:2010:primary-polarization; @mcghee-et-al:2014:nomination-systems; @kujala:2020:primary-donors].
Without somehow accounting for the menu of candidates that a primary electorate can choose from, we cannot infer whether candidates with certain ideological positions are actually preferred over candidates with other ideological positions.
The analysis in this chapter confronts this problem by modeling primary candidate choice using a conditional logit approach, similar to other recent studies of primary choice or multiparty elections [@alvarez-nagler:1998:clogit; @porter-treul:2020:primary-experience; @simas:2017:primary-electability; @ansolabehere-et-al:2004:direct-primary-party-loyalty].

As I discuss in Chapter \@ref(ch:arg), there are several reasons to doubt the explanatory power of a spatial model for House primary voting.
Few voters are likely to be aware of candidate positioning in contexts where the party label does not provide differentiating information between candidates [@norrander:1989:primary-voters].
Voters do respond to policy differences between candidates if they are made aware of those differences [@lelkes:2019:policy-over-party], but learning about the issue positions of House primary candidates is costly.
Voters may also strategically prefer more electable candidates even if those candidates are not closest to their ideal points, but even sophisticated primary voters may not be familiar with House candidates enough to know which candidates present the starkest ideology–electability trade-offs [@simas:2017:primary-electability].
Non-ideological traits like incumbency, an "outsider" reputation [@porter-treul:2020:primary-experience], early fundraising [@bonica:2020:lawyers-in-congress], gender [especially in Democratic races, @thomsen:2020:ideology-gender], or other "valence" features [@nyhuis:2018:separate-valences] may be easier for primary voters to detect and act upon than candidates' ideological stances.
It is also possible that candidate ideology's effect on primary elections is mainly a selection function, deterring moderate candidates from entering a primary at all, leaving only the ideologically faithful candidates who present less ideological contrast to voters [@thomsen:2014:moderate-candidates].


### Causal and statistical identifiability

This project is interested in understanding district-party ideology and how it shapes primary elections.
An essential constraint of this chapter's analysis is that the "effect of district-party ideology on primary election outcomes" is not a convenient causal quantity to work with.
This is because district-party ideology is constant across all candidates who compete in the same primary contest, therefore it has no direct effect on the probability that any one candidate wins. 
It can only have indirect effects that interact with other characteristics of the candidates. 
This section discusses these indirect effects, how they interact with the modeling constraints for primary election data, and how we define causal estimands under these constraints. 

Consider a primary race $r$ containing $n_{r} > 1$ primary candidates, each candidate indexed $i$.
Let $y_{r} = i$ signify that candidate $i$ wins race $r$, with the probability that $i$ wins $r$ given by $\psi_{ir}$.
Choice settings such as this, where one chooser must select among several alternatives in the choice set, is traditionally modeled using a conditional logit likelihood [@mcfadden:1973:conditional-logit].
Conditional logit has been employed to study candidate choice in U.S. primaries by @ansolabehere-et-al:2004:direct-primary-party-loyalty, @culbert:2015:strategic-voting-presidential-primaries, @simas:2017:primary-electability, and @porter-treul:2020:primary-experience.
Conditional logit supposes that the chooser—in this case, the electorate for race $r$—selects a candidate $i$ by comparing the utility they receive from each candidate in the race.
Suppose that this utility $\omega_{ir}$ contains a systematic component $u_{ir}$ and a stochastic component $e_{ir}$.
\begin{align}
  \omega_{ir} &= u_{ir} + e_{ir}
(\#eq:utility-error)
\end{align}
The probability that $i$ is chosen is defined as the probability that $\omega_{ir}$ is greatest among the alternatives in $r$.
Because the error term $e_{ir}$ is unknown and idiosyncratic to the chooser-choice pairing, conditional logit makes a distributional assumption for the error term and calculates the probability that $\omega_{ir}$ is greatest given knowledge of the systematic utility component only.
This probability is calculated as the softmax function of the systematic components in the choice set,
\begin{align}
\begin{split}
  p\left(y_{r} = i\right) &= \psi_{ir} \\
  &= \frac{\text{exp}\left(u_{ir}\right) }{\sum\limits_{i = 1}^{n_{r}}\text{exp}\left(u_{ir}\right)}  
\end{split}
(\#eq:softmax-probability)
\end{align}
which follows from the assumption that $e_{ir}$ is distributed Gumbel, as in logistic regression.
The distinguishing feature of conditional logit is that _chooser_ attributes, additive utility shocks that are specific to the chooser, do not have identifiable effects on the choice probability because the chooser is fixed within a choice set.
As a result, researchers using conditional logit tend to model the choice problem as a function of the alternatives only.

In the case of primary elections, choosers are primary electorates, and district-party ideology is fixed for a given electorate.^[
  District-party groups are not perfectly synonymous with primary electorates, since some constituents who belong to the district-party do not vote in the primary, and some primary voters may not identify with the party. 
  While this conceptual gap could be explored in future research projects, this project tolerates the inconsistency because the most recent evidence on the representativeness of primary electorates finds that they resemble the demographic profile and policy attitudes of the district-party public [@sides-et-al:2018:primary-representativeness].
  This analysis contains more years of data and relies on fewer modeling assumptions than analyses that conclude that primary electorates are more polarized than district-parties. [@jacobson:2012:polarization-origins; @hill:2015:nominating-institution].
]
This means that district-party ideology, $\bar{\theta}_{g[r]}$ for group $g$ in which $r$ takes place, cannot _directly_ affect the probability that a candidate is chosen.
This is consistent with the spatial model intuition from Figure \@ref(fig:utility-model): shifting the district-party ideal point $\bar{\theta}_{g[r]}$ left or right affects utility only because it changes the distance between $\bar{\theta}_{g[r]}$ and the candidate location, so the interaction between district-party ideology and candidate location is key.
More generally, chooser-level features can be included in conditional logit models as long as there is some cross-level interaction with the choice-level data for statistical identification [@fox-et-al:2012:random-coef-logit].^[
  I refer to an "interaction" generally as a function that depends on both district-party ideology and candidate location.
  It does not necessarily imply a multiplicative "interaction term" that is more common in linear modeling, although multiplicative interaction terms are an example of such a function.
]
Building a statistical model that enables this interactivity is an important contribution of this research design.

This conditional logit model's identifiability constraint matters for causal inference as well, because it affects which causal quantities are feasible to estimate.
Consider the potential outcome $\omega_{ir}\left(\text{CF}_{ir}, \bar{\theta}_{g}\right)$, the candidate utility resulting from a given candidate ideology and district-party ideology.^[
  For notational convenience, let $g$ imply $g[r]$. 
]
Imagine that we intervene on district-party ideology and measure the average utility effect^[
  For the current discussion, we consider the effect on utility instead of the effect on win probability.
  This is because win probability is complicated by the presence of other candidates, whereas utility is a straightforward function of chooser and choice features.
  It is important to understand the relationship between the causal model structure and the outcome scale because treatments can have different effects on different scales [@vanderweele:2009:interaction-modification].
  I discuss causal effects on win probability in Section \@ref(sec:causal-probs).
] 
of setting $\bar{\theta}_{g} = \theta$ versus some other value $\theta'$, $\E{\omega_{ir}\left(\text{CF}_{ir}, \theta\right) - \omega_{ir}\left(\text{CF}_{ir}, \theta'\right)}$.
This effect does exist for individual candidates: changing district-party ideology affects the primary electorate's candidate utility by increasing or decreasing the ideological distance between the district-party and the candidate.
As a result, the average effect of district-party ideology is an average over all of its interactive effects with candidate ideology.
But because conditional logit model does not provide an easy interface for modeling chooser-level effects directly, it is impractical to condition on other district-level characteristics to render district-party ideology ignorable.

It is much simpler, instead, to consider the average effect of candidate positioning on candidate utility.
Conditioning on candidate features is more straightforward with conditional logit, so causal identification of alternative-level effects is more analytically straightforward as well.
The conditional average effect of CF score on candidate utility would thus be 
$\E{\omega_{ir}\left(\text{CF}, \theta\right) - \omega_{ir}\left(\text{CF}', \theta\right) \mid C_{ir} = c, r}$,
for a comparison of two values $\text{CF}$ and $\text{CF}'$, fixing the district-party ideology at $\theta$ and conditioning on other candidate-varying attributes $C_{ir} = c$ and the race $r$.^[
  Conditioning on the race, which defines the choice set, is inherent to conditional logit.
  Conditioning on the choice set is what makes undermines the identifiability chooser-level effects without cross-level interactions.
]
This effect is also an average over the interactive effects with district-party ideology, but conditioning on confounders is much easier.

Because this project is focused on the added value of my district-party ideology measure, I go one step further to model effect heterogeneity over district-party ideology instead of holding it constant.
Because identifying ignorable variation in $\bar{\theta}_{g}$ is a challenge in conditional logit, I approach this heterogeneity from an effect modification perspective.
This means that any heterogeneity in causal effects over district-party ideology is not causally attributed to district-party ideology.
Instead, it reflects only the causal effects of CF scores conditional on a given district-party ideology value [see @kam-trussler:2017:HTEs].
To clarify this point, I rewrite the potential outcome as $\omega_{ir}(\text{CF}_{ir})$, removing the causal effect of $\bar{\theta}_{g}$ from the notation.
Formally, we say that district-party ideology is an "indirect modifier" if the CF score effect ($\text{CF}$ versus $\text{CF}'$) varies across levels of district-party ideology ($\theta$ versus $\theta'$), conditional on stratum $c$ and race $r$ [@vanderweele-robins:2007:effect-modification].
In other words, the conditional average effect of candidate ideology is heterogeneous over district-party ideology if the following quantity is not zero:
\begin{align}
  \E{\omega_{ir}\left(\text{CF}\right) - \omega_{ir}\left(\text{CF}'\right) \mid \bar{\theta}_{g} = \theta, c, r}
  &- 
  \E{\omega_{ir}\left(\text{CF}\right) - \omega_{ir}\left(\text{CF}'\right) \mid \bar{\theta}_{g} = \theta', c, r}.
  (\#eq:hte)
\end{align}
<!------- TO DO ---------
- or maybe we should just say the expectation is over i \in r?
instead of conditioning on R?
------------------------->


```{r choice-dag}
clogit_dag <- 
  dagify(
    Y ~ CF + C + U,
    CF ~ G + C,
    G ~ U,
    exposure = "G",
    outcome = "Y",
    coords = tribble(
      ~ name , ~ x , ~ y ,
      "C"    , 1   , 2   ,
      "CF"   , 1   , 1   ,
      "G"    , 0   ,  1,
      "U"    , 0   , 0  ,
      "Y"    , 2   , 1
    ),
    labels = c(
      "G" = "bar(theta)[g]",
      "CF" = "CF[ir]",
      "C" = "C[ir]",
      "Y" = "mu[ir]",
      "U" = "U"
    )
  ) %>%
  tidy_dagitty() %>%
  as_tibble() %>%
  print()
```

```{r plot-choice-dag}
ggplot(clogit_dag) +
  aes(x = x, y = y, xend = xend, yend = yend) + 
  geom_dag_edges(data_directed = filter(clogit_dag, name != "U")) +
  geom_dag_edges(
    data_directed = clogit_dag %>%
      filter((name == "U" & to == "G")),
    edge_color = "gray",
    edge_linetype = 2
  ) + 
  geom_dag_edges_arc(
    data = clogit_dag %>% filter(to == "Y" & name == "U"),
    curvature = -0.3,
    edge_linetype = 2, 
    edge_color = "gray"
  ) + 
  geom_dag_point(
    data = filter(clogit_dag, name != "U"),
    color = "gray80"
  ) + 
  geom_dag_node(
    data = filter(clogit_dag, name == "U"),
    internal_color = "gray",
    color = "white"
  ) + 
  geom_dag_text(
    aes(label = label), 
    parse = TRUE, 
    color = "black", 
    family = font_fam 
  ) + 
  theme_mgd_dag() + 
  theme(legend.position = "none") + 
  labs(
    x = NULL, y = NULL, 
    title = "How CF Score Affects Primary Victory",
    subtitle = "Indirect modification by district-party ideology"
  ) + 
  NULL
```

Figure \@ref(fig:plot-choice-dag) plots a causal graph of the system under consideration.
The causal effect of candidate position $\text{CF}_{ir}$ on candidate utility $\omega_{ir}$ is unidentified without conditioning on pre-treatment candidate features $C_{ir}$.
District-party ideology is included as an indirect modifier of the CF score effect $\text{CF}_{ir} \rightarrow \omega_{ir}$, represented with the path $\bar{\theta}_{g} \rightarrow \text{CF}_{ir}$ and no direct path between $\bar{\theta}_{g}$ and $\omega_{ir}$ [@vanderweele-robins:2007:effect-modification].
<!------- TO DO ---------
- could add labels to describe the U path, modification, etc.
------------------------->
Because district-party ideology is included as an indirect modifier instead of as a joint treatment, back-door paths that connect district-party ideology and candidate utility through unobserved variables $U$ are allowed to exist without confounding the CF score effect or the effect modification interpretation [@vanderweele:2009:interaction-modification].
They do confound the causal effects of district-party ideology, however, which is why effect heterogeneity cannot be describes as the causal effect of district-party ideology.

```{r plot-choice-dag, include = TRUE, out.width = "60%",  fig.height = 6, fig.width = 6, fig.scap = "Causal diagram of CF score effect on win probability.", fig.cap = "Causal diagram of CF score effect on win probability. District-party ideology is an indirect modifier because it has no direct effect on primary outcomes except through candidate proximity. Unobservables $U$ are uncontrolled, so the effect of district-party ideology is not identified. The CF score effect is identified conditioning on $C$ and district-party ideology."}
```


## Modeling Causal Heterogeneity with Continuous Interactions

This section describes a statistical model for primary candidate choice that achieves two key objectives.
First, the model is designed to capture the heterogeneous causal effect of candidate positioning, conditional on district-party ideology.
That is, the model contains appropriate interactions to include chooser-level attributes in the conditional choice model.
And second, the model contains the flexibility to capture non-monotonic effects of candidate positioning: utility losses for candidates that position themselves too far from the district-party ideal point in either ideological direction.
The model detailed below achieves these objectives using two tactics. 
The first tactic: I model candidate utility using a linear combination of CF scores and district-party ideology.
This linear combination projects CF scores and district-party ideology into a common space that can be interpreted as an "ideological distance" between CF scores and district-party ideology, allowing candidate utility to increase or decrease as a function of the ideological distance metric.
The second tactic: The distance metric's effect on candidate utility is modeled with a spline function.
The spline function serves the dual purpose of capturing nonlinearities in candidate utility—an essential component of the spatial voting model—and preserving the interaction between chooser and choice data through those nonlinearities.
This strategy enables the effect of candidate positioning on candidate choice to be heterogeneous across candidates with different CF scores and heterogeneous across primary electorates with different district-party ideology values.

The conditional logit model begins by defining the probability that candidate $i$ is chosen in race $r$ as a softmax function of $u_{ir}$, the systematic component of a candidate's utility conditional on the choice set.
\begin{align}
\begin{split}
  p\left(y_{r} = i\right) &= \psi_{ir} \\
  \psi_{ir} &= \frac{\text{exp}\left(u_{ir}\right) }{\sum\limits_{i \in r}^{n_{r}}\text{exp}\left(u_{ir}\right)} \\
  u_{ir} &= f\left(\text{CF}_{ir}, \bar{\theta}_{g[r]}\right) + \mathbf{c}_{ir}^{\intercal}\gamma
\end{split}
(\#eq:clogit-likelihood)
\end{align}
I use $f()$ to represent a flexible function of candidate $i$'s CF score and the district-party public ideology $\bar{\theta}_{g[r]}$ for group $g$ in which race $r$ is held.
I include a vector of candidate-level covariates $\mathbf{c}_{ir}$ with regression coefficients $\gamma$.
Causal inference requires the assumption that conditioning on candidate features renders CF scores ignorable among the candidates in $r$, conditioning also on all features of $r$.

I then construct $f()$ as a flexible spline function of CF scores and district-party ideal points.
Although CF scores and district-party ideology both represent ideal points, the two measures are not constructed in the same ideal point space, so calculating the absolute or squared distance between ideal points [e.g. @adams-et-al:2004:discounting-directional-voting] is not immediately possible.
To rectify this, I create a function that maps these two measures into a common space.
Let $\Delta_{ir}$ be a linear combination of $\text{CF}_{ir}$ and $\bar{\theta}_{g}$ with coefficients $\alpha$ and $\beta$,
\begin{align}
\begin{split} 
  \Delta_{ir} &= \alpha \text{CF}_{ir} + \beta\bar{\theta}_{g[r]} \\
  \alpha^{2} + \beta^{2} &= 1
\end{split}
(\#eq:linear-combo)
\end{align}
which represents an assumption that CF scores and district-party ideology space are affine transformations of one another, similar to the way Aldrich–McKelvey scaling estimates an affine mapping between ideology spaces [@aldrich-mckelvey:1977:scaling; @hare-et-al:2015:bayes-aldrich-mckelvey].
Another way to interpret $\Delta_{ir}$ is that the common ideal point space is a weighted average of CF scores and district-party ideology, with weights that are estimated from the data.
The second line of \@ref(eq:linear-combo) restricts the coefficients to have a norm of $1$, which is an identifiability restriction on the location and scale of the $\Delta$ space that would otherwise be arbitrary.
The restriction implies a direct mapping between $\text{CF}$ space and $\bar{\theta}_{g}$ space, since $\beta$ is defined in terms of $\alpha$,
\begin{align}
\begin{split}
  1 &= \alpha^{2} + \beta^{2} \\
  \beta &= \pm\sqrt{(1 - \alpha^{2})}
\end{split}
\end{align}
which clarifies how the linear transformation is estimating essentially a scale factor between the two ideal point spaces, parameterized by $\alpha$ only.
Because $\Delta_{ir}$ is a linear transformation of CF scores and district-party ideology, it has the algebraic interpretation of a "distance measure" of the candidate's CF score and the district-party public ideology in the $\Delta$ space.
For convenience, I therefore refer to $\Delta_{ir}$ as "ideological distance."^[
  It is important to note here that my use of "distance" refers more generally to vector spaces than it does to ideal point "differences."
  The "difference" $(x - z)$ is a special case of the distance $\alpha x + \beta z$ where $\alpha = 1$ and $\beta = -1$. 
  For a linear regression of $y$ on $(\alpha x + \beta z)$, regression predictions for $y$ would be invariant to any nonzero combination of $\alpha$ and $\beta$ values.
  So although I refer to $\Delta_{ir}$ as an ideal point "distance," it contains the same information as an ideal point "difference" up to an arbitrary rotation of the $\Delta$ space [e.g. @armstrong-et-al:2014-spatial-models, xv].
  Restricting the rotation of $\Delta$—for example, by fixing $\beta < 0$—would improve the interpretation of $\Delta$ as an ideal point "difference," but it would make Bayesian estimation more difficult by introducing unnecessary boundaries and discontinuities in the posterior distribution over $\alpha$ and $\beta$.
  For ease of estimation, I therefore leave the rotation of the $\Delta$ space unrestricted.
  <!------- TO DO ---------
  - some cite on identifiability in Bayesian models?
  ------------------------->
]

I then create a function that lets candidate utility be a nonlinear function of the ideal point distance $\Delta_{ir}$.
This captures the spatial voting intuition: shocks to either CF scores or district-party ideology change the ideal point distance $\Delta_{ir}$, which has a nonlinear effect on candidate utility depending on whether the shock moves the ideal point distance toward or away from the optimal distance.
I create this nonlinear effect using b-splines. 
I construct a set of basis functions of $\Delta_{ir}$ using a degree-$3$ polynomial basis with $30$ knots across the range of $\Delta_{ir}$.^[
  I restrict the range of possible $\Delta_{ir}$ to be equidistant from $0$ by centering CF scores and district-party ideology within the model so their respective minima and maxima are equidistant from zero.
  This implies a separate $\Delta$ spaces for both parties, as CF scores and district-party ideology take different values for Republicans and Democrats.
  It also means that the knot locations change for each combination of $\alpha$ and $\beta$ values.
]
Let $b_{k}(\Delta_{ir})$ be the $k$^th^ basis function out of $K$ total, each with a coefficient $\phi_{k}$.
The function $f()$ from \@ref(eq:clogit-likelihood) results then in a spline regression on $\Delta_{ir}$.
\begin{align}
\begin{split}
  u_{ir} &= f\left(\text{CF}_{ir}, \bar{\theta}_{g[r]}\right) + \mathbf{c}_{ir}^{\intercal}\gamma \\
  f\left(\text{CF}_{ir}, \bar{\theta}_{g[r]}\right) %_
    &= \sum\limits_{k} b_{k}\left(\Delta_{ir}\right)\phi_{k}
\end{split}
(\#eq:spline-function)
\end{align}
The spline function ultimately is a sum of the weighted basis functions.
The spline enables a continuous interaction effect between CF scores and district-party ideology because the basis functions are nonlinear transformations of district-party ideology and CF scores.
Because the function is nonlinear, the chain rule ensures that the derivative of $u_{ir}$ with respect to CF scores (the instantaneous effect of CF scores) is a function that contains district-party ideology $\bar{\theta}_{g}$.

By specifying the interaction between chooser- and choice-level data in this way, I sidestep the identifiability limitation of a simpler conditional logit model, allowing the causal effect of CF score to vary in different electorates with different district-party ideologies.
Interacting two continuous variables through the spline function is much more flexible than a multiplicative interaction term between CF scores and district-party ideology, which would fail to capture both the utility optimum predicted by spatial voting models and any other non-constant interactions.
Creating the ideal point distance metric also has a generative interpretation that is superior to the multiplicative interaction, because the common ideal point metric is a more faithful representation of spatial voting models.
A multiplicative interaction has no comparable generative interpretation.

Although the interpretation of $\Delta$ as a common ideal point metric is algebraically sensible, a limitation of the approach is that the model does not very accurately identify which $\alpha$ and $\beta$ values create more plausible common spaces in terms of posterior probability.
This is because the spline regression is flexible enough to create sensible regression functions out of the many configurations of $\Delta$ space, so there is no need for the model to detect a single "correct" configuration.
If a particular draw of $\alpha$ and $\beta$ values "compress" the ideal point space in some way, the spline coefficients are able to "stretch" that space back out to fit the data.
As a result, the posterior distribution of spline functions is identified from the data even if its component parameters—$\alpha$, $\beta$, and spline weights $\phi_{k}$—are not strongly identifiable on their own.
This trade-off between global and local identifiability appears in other flexible modeling approaches such as neural networks [@mackay:1992:bayes-neural-net-backpropagation; @beck-et-al:2004:neural-net] and is naturally suited to a Bayesian framework because unidentified or over-paramaterized models pose no special problem for probabilistic inference [@jackman:2009:bayesian, 272].
In short, while the model sacrifices some interpretability to fit a flexible regression function, the trade-off is worth the ability to capture nonlinear patterns in spatial voting while avoiding specific assumptions about the form of the candidate utility function or the ideal point mappings.


### Data {#sec:vote-data}


```{r data-5}
mcmc_path <- file.path("data", "mcmc", "5-voting")

fits_raw <- here(mcmc_path, "local_vb-main.rds") %>%
  read_rds() %>% 
  group_by(party, control_spec) %>% 
  print()

fits_data <- fits_raw %>%
  select(data, stan_data) %>%
  print()
```

```{r data-no-incumbents}
noinc_raw <- here(mcmc_path, "local_vb-no_incumbents.rds") %>%
  read_rds() %>% 
  group_by(party, control_spec) %>%
  print()
```


The data for this analysis are drawn primarily from two secondary sources, the Database on Interests, Money in Politics, and Elections [DIME, @bonica:2019:dime] and the Primary Timing Project [PTP, @boatright-et-al:2020:primary-timing-data]. 
Cases are organized at the candidate-contest level, with identifiers for each primary contest indexing political party $\times$ congressional district $\times$ election cycle.
Because primary candidates can run unopposed, I restrict the data to primary races containing at least two candidates.
I keep only primary races where the number of winning candidates equals $1$, which removes any election where the winner lacked a CF score estimate (so is missing from the DIME) or where primary outcomes are miscoded in the original data sources.
I also drop any primary race where the outcome was decided by a convention instead of an election (coded in the PTP).
Lastly, I remove all blanket and top-two primary races, which are not limited to candidates in a single party.
<!------- TO DO ---------
- how many?
------------------------->


```{r}
link_tab <- 
  here("data", "_model-output", "05-voting", "link-sum.rds") %>%
  read_rds() %>%
  rename(
    matched = `1`,
    unmatched = `NA`
  ) %>%
  mutate(
    total = matched + unmatched
  ) %>%
  print()
```


The DIME database contains most of the essential data used for this analysis: CF scores and primary outcome indicators.
Primary outcomes for the 2016 election cycle were less thoroughly coded than the primary outcomes for 2012 and 2014 cycles, which led to lots of missing data.
Missing primary outcomes in the DIME were supplemented with primary outcome data from the PTP.
Matching the same candidacy across databases was not easy using candidate identifiers,^[
  Candidate IDs in the DIME are regenerated with each vintage of the database, creating inconsistencies in the same candidate's IDs over time.
  As a result, the DIME identifiers that were initially copied into the PTP do not match the DIME identifiers in more recent DIME vintages.
]
so I merge the databases using the probabilistic record-linkage algorithm developed by @enamorado-et-al:2018:record-linkage.
I link candidates by name, state, district number, election cycle, and political party.
This process matches `r filter(link_tab, cycle == 2016) %>% pull(p) %>% percent(accuracy = 1)` of candidacies in 2016 and 
`r (sum(link_tab$matched) / sum(link_tab$total)) %>% percent(accuracy = 1)` of candidates in the entire dataset.
For candidacies where the DIME and the PTP disagree about the outcome of a primary race, I defer to the PTP because its narrower substantive focus on primary elections lends it more credibility.

<!------- TO DO ---------
- ???
------------------------->

Predictive data include dynamic CF scores for every candidate and district-party ideal points from the IRT model in Chapter \@ref(ch:model).
The conditional logit does not identify district-level shocks to candidate utility because these variables are fixed for all candidates in a primary race, so the choice of controls in $\mathbf{c}_{ir}$ differs sharply from the district Chapter \@ref(ch:positioning).
Instead of including district-level demographics, economic indicators, or political background characteristics such as the previous presidential vote in the district, $\mathbf{c}_{ir}$ contains candidate-level features that could affect their ideological positioning as well their likelihood of winning the primary.^[
  Features of a party-group or congressional district can certainly affect candidate CF scores _on average_, which is the focus of Chapter \@ref(ch:positioning).
  It is helpful to think about the conditional logit as using only the "residual" variation in CF scores and other candidate features after these average effects are controlled by conditioning on the choice set.
]
I include an indicator variable for female candidate, which is associated with greater progressivism and a slightly higher primary win probability at least among Democrats [@thomsen-swers:2017:women-run; @thomsen:2019:women-win; @thomsen:2020:ideology-gender].
I also include an indicator for incumbent candidates, who both have more moderate CF scores (seen in Chapter \@ref(ch:positioning)) and are more likely to win their primary reelections.
I include no additional indicators for challengers and open-seat candidates, since open-seat races only compare open-seat candidates to one another, and non-incumbency implies challenger status for any race containing an incumbent candidate.
The standard control specification includes one last covariate for the contribution amount that a candidate donates to their own campaign, which is logged and standardized.
This control is intended to block a back-door path from CF scores to primary victory through candidate wealth, which could affect both the candidate's ideological position and their win probability.

Although there are additional measures of a candidate's campaign fundraising and spending available in the DIME, I do not use these variables as controls to identify the CF score effect.
This is because previous research suggests that candidate ideology is more likely to influence a candidate's fundraising than vice-versa [@stone-simas:2010:candidate-valence; @barber-et-al:2016:ideological-donors; @thomsen-swers:2017:women-run].
The utility model underlying CF scores assumes that this is true _ex ante_, by modeling campaign contributions as a function of ideological affinity.

```{r clogit-n}
bind_rows(noinc_raw, fits_data) %>%
  unnest(data) %>%
  ungroup() %>%
  mutate(
    Party = ifelse(party == "D", "Democrats", "Republicans"),
    party = NULL,
    Subset = case_when(
      control_spec == "main" ~ "Full data",
      TRUE ~ "No incumbents"
    ),
    control_spec = NULL
  ) %>%
  group_by(Party, Subset) %>%
  summarize(
    `Primary Races` = comma(n_distinct(set)),
    `Total Candidates` = comma(n())
  ) %>%
  arrange(Subset) %>%
  knitr::kable(
    caption.short = "Number of primary races and primary candidates",
    caption = "Number of primary races and primary candidates",
    booktabs = TRUE
  )
```


I estimate separate models for Republicans and Democrats because control variables may confound the treatment effect differently for each party.
For instance, gender is thought to have a greater impact in Democratic primaries than in Republican primaries [@thomsen-swers:2017:women-run; @thomsen:2019:women-win; @thomsen:2020:ideology-gender].
It also may be the case that causal effects vary across party, either because Republican or Democratic voters are not equally aware of candidate ideology or because district-party ideology has different modifying effects for Republicans and Democrats.
I also estimate the same model with the sample limited to primary contests with no incumbent present, a practice employed by earlier researchers to sidestep the overwhelming likelihood that incumbents win reelection [e.g. @porter-treul:2020:primary-experience].
Table \@ref(tab:clogit-n) displays the number of candidates and primary contests in each of these subsets of data.

```{r clogit-n, include = TRUE}
```


### Bayesian modeling, priors, and prior simulation

Like other models featured in this project, the Bayesian setup of this model provides several important benefits.
The most important benefit is regularization in the spline function.
Although the spline function is beneficial because it can fit many complex functions, complex models always run a risk of overfitting.
The trade-off between flexibility and overfitting is especially salient for modeling heterogeneous treatment effects because growing the number of possible comparisons will also grow the number of false positives if no additional methodological adjustments are made.
This concern has led researchers to use regularized estimators to detect heterogeneous effects, which introduce bias to shrink heterogeneities toward zero.
Bayesian additive regression trees, for example, model flexible interactions by regularizing the tree structure in favor of shorter trees and partial pooling of "leaf" estimates toward the mean of the data [@hill:2011:bart; @green-kern:2012:bart].

```{r spline-coef-prior}
plot_spline_coef_draws <- tibble(
  raw = rnorm(10000),
  eta = abs(1.5 * rt(10000, df = 3)),
  phi = raw*eta,
) %>%
ggplot() +
  aes(x = phi) +
  geom_histogram(
    boundary = 0, bins = 100, fill = primary, alpha = 0.7
  ) +
  xlim(c(-10, 10)) +
  labs(
    x = TeX("Spline coefficient $\\phi_{k}$"),
    y = "Count",
    title = "Prior for Spline Coefficient",
    subtitle = "Normal prior with T(3) scale"
  )
```

```{r spline-prior}
# possible spline functions
n_coef_draws <- 10
num_knots <- 30
spline_degree <- 3

coef_draws <- 
  tibble(
    k = 1:(num_knots + spline_degree)
  ) %>%
  crossing(
    rep = 1:n_coef_draws
  ) %>%
  mutate(
    eta = 1.5 * rt(n(), df = 3) %>% abs(),
    phi_raw = rnorm(n()),
    phi = phi_raw * eta
  ) %>%
  select(rep, k, phi) %>%
  pivot_wider(
    names_from = "rep",
    values_from = "phi",
  ) %>%
  select(-k) %>%
  print()

spline_data <- tibble(delta = seq(0, 1, length.out = 1000))
spline_data <- spline_data %$%
  splines::bs(
    delta,
    df = num_knots + spline_degree,
    degree = spline_degree, 
    intercept = TRUE 
  ) %>%
  (function(x) x %*% as.matrix(coef_draws)) %>%
  as_tibble() %>%
  set_names(~ str_glue("f_{.}")) %>%
  bind_cols(spline_data, .) %>%
  pivot_longer(
    cols = starts_with("f_"), 
    names_to = "draw",
    values_to = "spline"
  ) %>% 
  print()

plot_prior_spline_functions <- 
  ggplot(spline_data) +
  aes(x = delta, y = spline) +
  geom_line(
    aes(group = draw), 
    color = primary
  ) +
  geom_hline(yintercept = 0) +
  coord_cartesian(ylim = c(-8, 8)) +
  labs(
    x = TeX("$\\Delta_{ir}$: Linear combination of CF score and $\\bar{\\theta}_{g}$"),
    y = "Spline function",
    title = "Prior Draws of Spline Function",
    subtitle = str_glue("Prior simulations from {n_coef_draws} draws")
  ) +
  scale_x_continuous(breaks = c(0, 1), labels = c("Min", "Max")) +
  scale_y_continuous(breaks = seq(-6, 6, 3))
```

```{r plot-spline-priors}
plot_spline_coef_draws + plot_prior_spline_functions
```


I use a hierarchical prior for the spline coefficients to penalize the complexity of the spline function.
The prior for each basis function's coefficient $\phi_{k}$ has a Normal distribution,
\begin{align}
  \phi_{k} &\sim \text{Normal}\left(0, \eta \right)
  (\#eq:spline-marginal)
\end{align}
where $\eta$ is estimated from the data.
By estimating an adaptive prior distribution for the spline coefficients, coefficients are shrunk toward zero through partial pooling.
This prior is implemented in Stan as using a non-centered parameterization, which decomposes $\phi_{k}$ into a standard Normal variable $\tilde{\phi}_{k}$ and a scale factor $\eta$.
\begin{align}
\begin{split}
  \phi_{k} &= \tilde{\phi}_{k}\eta \\ %_ 
  \tilde{\phi}_{k} &\sim \text{Normal}\left(0, 1\right) %_
\end{split}
(\#eq:spline-shrinkage)
\end{align}
The non-centered parameterization stretches a standard Normal distribution to create a Normal distribution with a scale of $\eta$.
This parameterization is valuable for Bayesian estimation because it de-correlates random variables in the posterior distribution, creating an easier posterior geometry for estimation algorithms.
I give the scale factor $\eta$ a Half-$T$ prior with $3$ degrees of freedom and a scale of $1.5$,
\begin{align}
  \eta &\sim \text{Half-T}\left(\nu = 3, \mu = 0, \sigma = 1.5\right)
  (\#eq:scale-T)
\end{align}
which regularizes the scale value toward zero, but has a modestly flat tail to allow strong signals from the data to depart from the prior.
This Normal-T mixture is similar to a "horseshoe prior" [@carvalho-et-al:2010:horseshoe-prior; @piironen-vehtari:2017:horseshoe-hyperprior; @piironen-vehtari:2017:horseshoe-sparse-vs-reg], which is a popular prior for estimating sparse coefficients with regularization.^[
  Note that "sparsity" in this context does not imply coefficients of exactly-zero as it does with non-Bayesian L1 regularization [@tibshirani:1996:lasso; @ratkovic-tingley:2017:sparse-lasso-plus].
  Sparse priors may result in posterior _modes_ at zero, but posterior intervals will contain non-zero values [@park-casella:2008:bayesian-lasso].
]
Unlike the horseshoe, which uses a half-Cauchy scale, the Half-T scale places lower probability on extremely large coefficients but doesn't regularize as strongly as a Half-Normal prior.
The left-side panel in Figure \@ref(fig:plot-spline-priors) plots a histogram of simulated coefficient draws from this prior, which features a spike at zero and flatter tails than a Normal-Normal mixture.^[
  The tails are long enough that many draws actually fall far outside the region plotted in the figure.
  These values are much rarer than the values contained in the plotted region, but they are much more probable than they would be under, for example, a Normal-Normal prior.
]


<!------- TO DO ---------
- plot a prior of the spline function???
- without a prior, this can oscillates to \pm infinity
- order-4 (3 degree) basis splines, 30 knots, N-C(0, 1) prior on 30 coefs
------------------------->

```{r plot-spline-priors, include = TRUE, fig.width = 9, fig.height = 5, out.width = "100%", fig.scap = "Prior draws of spline coefficient and spline function.", fig.cap = "Prior draws of spline coefficient and spline function. Left: histogram of prior draws for an individual spline coefficient. Right: draws from the implied prior over spline functions."}
```

The right panel of Figure \@ref(fig:plot-spline-priors) shows `r n_coef_draws` prior predictive draws of the spline function, resulting from `r n_coef_draws` coefficient vectors drawn from the hierarchical prior.
There are a few important details to note about the construction of this prior.
First, most of the "peaks" of the spline function are in a neighborhood near zero, especially within the $(-3, 3)$ interval.
Although at first this sounds like a very narrow prior, it is important to remember that the spline function is defined on the utility (logit) scale, where small changes in utility can have large, nonlinear effects.
For context, a coefficient of $3$ on the logit scale would increase the success probability from $.5$ to `r plogis(3) %>% round(2)` in a two-candidate choice set, which is a larger effect than almost anything that occurs regularly in elections.
Furthermore, a preference for a spline functions near zero is essential for regularization, so this amount of prior information is appropriate for controlling the spline fit.
At the same time, there are several peaks that decisively escape the $(-3, 3)$ neighborhood.
These larger peaks reflect the flatter-tailed $T$ prior on $\eta$, allowing larger coefficients.
The shape of the $T$ tail retains enough flexibility to detect a spike in utility even if the center of the prior concentrates spline functions near zero.
This plot also shows that 30 knots are more than enough flexibility to capture a utility spike along $\Delta$ space.

For the remaining coefficients $\gamma$, I specify a weakly informative prior,
\begin{align}
  \gamma &\sim \text{Normal}\left(0, 5\right)
  (\#eq:clogit-wt-priors)
\end{align}
which rules out explosive coefficient values while still allowing candidate attributes like incumbency to exhibit large correlations with candidate utility.
For causal inference, it is important not to regularize confounding effects too much to avoid re-introducing bias into treatment effect estimates [@hahn-et-al:2018:regularization-confounding; @hahn-et-al:2020:bayesian-causal-forests].^[
  For high dimensional problems where regularization cannot be avoided, recent work recommends separate treatment and response models [@hahn-et-al:2018:regularization-confounding; @hahn-et-al:2020:bayesian-causal-forests] with a split-sample approach [@ratkovic:2019:rehabilitating-regression].
]


```{r a-b-data}
unit_params <- tibble(
  a_raw = rnorm(10000),
  b_raw = rnorm(10000)
) %>%
  mutate(
    across(
      .cols = c(a_raw, b_raw), 
      .fns = list(id = ~ . / sqrt(a_raw^2 + b_raw^2))
    )
  ) %>%
  rename(
    `Constrained α` = a_raw_id, 
    `Constrained β` = b_raw_id
  )
```

```{r a-b-priors}
unit_params %>%
  pivot_longer(
    cols = contains("Constrained"), 
    names_to = "param",
    values_to = "value"
  ) %>%
  ggplot() +
  aes(x = value) +
  facet_wrap(
    ~ param, 
    scales = "free"
  ) +
  geom_histogram(alpha = 0.7, fill = primary) +
  labs(
    x = "Prior value", y = NULL,
    title = "Prior Draws for Ideal Point Distance Coefficients"
  ) +
  ggeasy::easy_remove_y_axis()
```

Because $\alpha$ and $\beta$ are constrained to have a norm of $1$, their values fall on the unit circle.
I give these parameters a joint prior that is flat along the unit circle.
Stan implements this prior automatically by drawing unnormalized parameters $\tilde{\alpha}$ and $\tilde{\beta}$ from independent standard Normal distributions and then dividing by their norm,
\begin{align}
\begin{split}
  \tilde{\alpha}, \tilde{\beta} &\sim \text{Normal}\left(0, 1\right) \\
  \alpha = \sqrt{\tilde{\alpha}^2 + \tilde{\beta}^2} \\
  \beta = \sqrt{\tilde{\alpha}^2 + \tilde{\beta}^2}
\end{split}
\end{align}
which creates a flat density over the unit circle.^[
  Technically, this transformation is undefined if the norm is exactly zero, which realistically never happens.
]
The marginal densities for $\alpha$ and $\beta$, shown in Figure \@ref(fig:a-b-priors) are not exactly flat due to the nonlinear transformation from Cartesian coordinates to polar coordinates.


```{r a-b-priors, include = TRUE, out.width = "100%", fig.width = 9, fig.height = 4, fig.scap = "Prior draws for ideal point distance coefficients.", fig.cap = "Prior draws of coefficients that map CF cores and district-party ideology to the common ideal point distance metric $\\Delta$. These priors create a flat prior on unit circle coordinates, even though the marginal priors are not flat."}
```


## Findings {#sec:vote-findings}


```{r tidy-5}
fits <- fits_raw %>%
  transmute(
    tidy_fit = map(
      .x = vb_fit,
      .f = tidy,
      conf.int = TRUE,
      conf.level = 0.9
    )
  ) %>%
  print()
```

```{r draws-5}
vb_draws <- fits_raw %>%
  transmute(
    draws = map(
      .x = vb_fit,
      .f = rstan::extract
    )
  ) %>%
  print()
```


```{r spline-coefs-data}
main_coefs <- fits %>%
  unnest(tidy_fit) %>%
  ungroup() %>%
  filter(
    str_detect(term, "_post") == FALSE,
    str_detect(term, "wt") | str_detect(term, "spline_scale")
  ) %>%
  mutate(
    index = parse_number(term),
    term_label = case_when(
      term == "wt[1]" ~ "Female",
      term == "wt[2]" ~ "Incumbent",
      term == "wt[3]" ~ "Log self-contribs (std.)",
      str_detect(term, "wt_spline") ~ 
        str_glue("Basis {index}") %>% as.character(),
      term == "spline_scale" ~ "Spline scale"
    ),
    prefix = case_when(
      str_detect(term, "spline") ~ "Spline Parameters",
      str_detect(term, "wt") ~ "Regression Coefs",
      TRUE ~ "Aux"
    )
  ) %>%
  filter(str_detect(term, "linkers") == FALSE) %>%
  mutate(
    term_label = fct_reorder(term_label, index) %>% fct_rev(),
    party_name = ifelse(party == "D", "Democrats", "Republicans")
  ) %>%
  print()
```

```{r plot-spline-coefs}
ggplot(main_coefs) +
  aes(
    x = term_label,
    y = estimate, 
    color = party_name
  ) +
  geom_hline(yintercept = 0) +
  geom_pointrange(
    aes(ymin = conf.low, ymax = conf.high, shape = party_name),
    position = position_dodge(width = -0.25),
    fill = "white"
  ) +
  facet_wrap(~ prefix, scales = "free") +
  coord_flip() +
  scale_color_manual(values = party_colors) +
  scale_shape_manual(values = c("Democrats" = 16, "Republicans" = 22)) +
  labs(
    x = NULL,
    y = "Posterior parameter value",
    color = NULL, 
    shape = NULL,
    title = "Conditional Logit Parameters",
    subtitle = "Fullrank variational estimations"
  ) +
  theme(
    legend.position = c(0.25, 0.2),
    legend.background = element_rect(fill = alpha("white", 0.9)),
    plot.title.position = "plot"
  )
```

```{r linker-plot}
vb_draws %>%
  transmute(
    link_draws = map(
      .x = draws,
      .f = ~ {
        .x$linkers %>% 
        as_tibble(.name_repair = "unique") %>%
        set_names(c("alpha", "beta"))
      }
    )
  ) %>%
  unnest(link_draws) %>%
  ggplot() +
  aes(x = alpha, y = beta, color = party) +
  geom_hline(yintercept = 0, color = "black") +
  geom_vline(xintercept = 0, color = "black") +
  geom_jitter(width = .1, height = .1, shape = 16, alpha = 0.5) +
  facet_wrap(
    ~ party, 
    labeller = as_labeller(c("D" = "Democrats", "R" = "Republicans"))
  ) +
  coord_fixed() +
  scale_color_manual(values = party_code_colors) +
  labs(
    title = "Coefficients for Ideal Point Distance (Δ)",
    subtitle = "Samples (jittered) from variational posterior",
    x = "CF score weight (α)",
    y = "District-party ideology\nweight (β)"
  ) +
  theme(
    legend.position = "none",
    plot.title.position = "plot"
  )
```

All models were estimated using Stan's full-rank variational inference algorithm, which approximates the posterior distribution as a collection of Normal distributions with a full-rank covariance matrix [@kucukelbir:2015:ADVI].
The main discussion of results focuses on the models estimated using the full datasets.
I briefly review the key trends among non-incumbent races in Section \@ref(sec:no-incumbents).

To facilitate the interpretation of the spline model, I first show Figure \@ref(fig:linker-plot), which contains posterior samples the coefficients that map CF scores and district-party ideology into the common ideal point distance measure $\Delta$.
Because I identify the latent $\Delta$ space by constraining these coefficients to have a norm of $1$, all pairs of parameters fall on the unit circle.
Points are jittered in the plot to convey which values have greater posterior probability.
As mentioned above, many possible ideal point mappings can be rationalized as part of the spline function, so the posterior distribution to does not concentrate very tightly around particular combinations of $\alpha$ and $\beta$ values.
This results in posterior samples that cover all four quadrants of the unit circle.
This is not concerning, however, because the common ideal point space is created to facilitate heterogeneous effects, not to be interpreted directly.

```{r linker-plot, include = TRUE, out.width = "100%", fig.height = 5, fig.width = 9, fig.scap = "Posterior draws of linear mapping parameters.", fig.cap = "Posterior draws of parameters that map CF scores and district-party ideology into $\\Delta$ space. Points all fall on the unit circle but are slightly jittered to convey posterior density."}
```

Coefficients from the candidate utility model are presented in Figure \@ref(fig:plot-spline-coefs).
The left panel shows regression coefficients for control variables in $\mathbf{c}_{ir}$: gender, incumbency, and candidate self-fundraising. 
These coefficients find that gender is positively related to candidate utility among Democrats more than Republicans, a finding that reflects recent evidence from @thomsen:2019:women-win.
Unsurprisingly, incumbency has a strong, positive relationship to candidate utility in both parties.
Candidate self-fundraising does not strongly relate to candidate utility in either party.
This could be because heavier self-funders reflect a mixture of wealthy candidates, who may be advantaged because of their connections to other wealthy funders, and down-on-their luck candidates who rely more heavily on self-fundraising to make up for meager fundraising receipts elsewhere.
The right panel shows all spline basis function coefficients and the scale parameter in the smoothing prior for the spline coefficients.
Most spline coefficients have posterior point estimates near zero, which is the intended result of the regularizing prior on the coefficients.
A few coefficients do depart from the prior, an initial indication that the spline regression detects a smooth function with a small number of "wiggles" rather than a highly variable function with many local peaks and troughs. 


```{r plot-spline-coefs, include = TRUE, out.width = "100%", fig.height = 8, fig.width = 9, fig.scap = "Posterior parameters from conditional logit.", fig.cap = "Posterior parameters from conditional logit. Points and intervals are variational point estimates and 90 percent quantile intervals from approximate posterior. Left panel shows regression weights for covariates. Right panel shows basis function coefficients and hierarchical scale parameter. There are greater than 30 spline coefficients because higher spline degrees create additional basis functions."}
```


```{r spline-plot}
spline_means <- fits %>%
  left_join(fits_data) %>%
  mutate(
    spline_means = map2(
      .x = tidy_fit,
      .y = stan_data,
      .f = ~ {
        rhs <- tibble(
          CF = .y$CF,
          i = .y$i
        )
        means <- .x %>%
          filter(
            str_detect(term, "spline_mean_post") |
            str_detect(term, "spline_lower_post") |
            str_detect(term, "spline_upper_post") 
          ) %>%
          mutate(i = parse_number(term)) 
        left_join(means, rhs, by = "i")
      }
    )
  ) %>%
  select(spline_means) %>%
  unnest(spline_means) %>%
  print()
```


```{r plot-spline-posterior}
spline_means %>%
  filter(str_detect(term, "mean")) %>%
  ggplot() +
  aes(x = CF, y = estimate, color = party) +
  geom_vline(xintercept = 0, color = "gray") +
  geom_hline(yintercept = 0, color = "black") +
  geom_ribbon(
    aes(ymin = conf.low, ymax = conf.high, fill = party),
    color = NA, 
    alpha = 0.3,
    show.legend = FALSE
  ) +
  geom_line(size = 1, show.legend = FALSE) +
  geom_line(
    data = filter(spline_means, str_detect(term, "mean") == FALSE),
    aes(linetype = str_detect(term, "lower")),
    color = "black"
  ) +
  geom_rug(aes(y = NULL), alpha = 0.2, show.legend = FALSE) +
  scale_color_manual(values = party_code_colors) +
  scale_fill_manual(values = party_code_colors) +
  scale_linetype_manual(
    values = c(2, 3),
    labels = c("TRUE" = "1 sd below mean", "FALSE" = "1 sd above mean")
  ) +
  facet_wrap(
    ~ party, 
    scales = 'free_x',
    labeller = as_labeller(c("D" = "Democrats", "R" = "Republicans"))
  ) +
  labs(
    title = "How CF Score Affects Candidate Utility",
    subtitle = "Negligible interaction with district-party ideology",
    x = "Candidate CF Score",
    y = "Spline function of ideal point distance",
    linetype = TeX("$\\bar{\\theta}_{g}$ value")
  ) +
  theme(
    legend.position = c(.9, .8),
    legend.background = element_rect(fill = alpha("white", 0.5))
  ) +
  # coord_cartesian(xlim = c(-5.5, 5.5))
  NULL
```

The key finding from the conditional logit model is the spline function plotted in Figure \@ref(fig:plot-spline-posterior).
The spline is a function of the common ideal point metric $\Delta$, which means the spline is a function of both CF scores and district-party ideology.
This means that the shape of the spline function comes two signals in the data.
First, which $\Delta$ values are related to candidate utility, and second, what combination of CF scores and district-party ideology (in terms of $\alpha$ and $\beta$ values) more strongly affect CF scores. 
I show candidate CF scores along the horizontal axis, and spline functions holding district-party ideology fixed at different values are plotted on the vertical axis.
Solid lines show the spline function conditioned on _average_ district-party ideology in each party, while dashed and dotted lines condition the spline function on district-party ideology values one standard deviation above and below the mean.
The shaded region shows the 90% posterior interval for the spline function conditioning on the average district-party ideology, calculated from samples from the variational posterior.^[
  It is worth noting the value of Bayesian computation for generating uncertainty intervals for a complex function such as this.
]
A spline function that drifts up and down across the range of CF scores reflects the causal effect of CF scores under the identification assumptions.
A spline function that varies for different values of district-party ideology reflects heterogeneous causal effects in districts with different district-party ideologies.
I show a rug plot across the bottom of each panel to indicate observed CF score values for the candidates in the data. 


```{r plot-spline-posterior, include = TRUE, out.width = "100%", fig.width = 9, fig.height = 4.5, fig.scap = "CF score effect on candidate utility.", fig.cap = "CF score effect on candidate utility. Spline function of CF score with district-party ideology held at mean, mean minus one standard deviation, and mean plus one standard deviation."}
```

The estimates for both Democrats and Republicans show that CF scores most strongly affect candidate utility at CF score values near the median partisan candidate, evidenced by the peak in the spline function near the center of each party's CF score distribution.
This means that candidates at the ideological center of their party are most valuable to primary electorates, while both centrist and extreme candidates are less valuable.
This signal is not overwhelmingly strong—the 90% credible interval covers zero—but the posterior distribution of spline functions generally supports this interpretation that primary voters prefer candidates that represent the ideological "core" of the party.
This peak in utility is what we should expect to find under an assumption of spatial utility voting: candidates are most likely to win when they take an optimal ideological stance, and less likely to win when they are more liberal or more conservative than the optimal stance.
This pattern suggests that a null hypothesis of non-ideological voting in primaries is unlikely to be true, because the utility function is not simply flat over candidate CF scores.
The location of the utility peak suggests that primary electorates are not focused by-and-large on electability, which would manifest as a utility peak at centrist CF score values.
It also suggests that primary electorates do not place a single-minded emphasis on partisan extremism, which would result in a utility peak at more extreme CF scores.
It is also important to note that these estimates control for incumbency, so the utility peak at party-median CF score values does not simply reflect the fact that high-value incumbent candidates are located in the ideological center of their respective parties.
Furthermore, results below suggest that a similar utility peak appears even when we remove all non-incumbents from the data.

Although the analysis suggests that candidate ideology does have a causal effect on CF scores, there is no evidence to suggest that this effect is heterogeneous across districts.
The utility function for districts with average district-party conservatism is almost identical to the utility functions for districts with above-average and below-average district-party conservatism, shown in the figure with dotted and dashed lines.
Conservative Republican candidates are not noticeably more valuable to more conservative Republican publics, nor are progressive Democratic especially valuable to more progressive Democratic publics.
The lack of heterogeneity may be because ideological House primary voting is too costly for voters.
Information about candidate ideology is not as readily available as it is in presidential primaries.
Candidates have to fit a particular party "type," but aside from that, the model doesn't capture the particular quirks of candidate identity, style, or campaign activity that sets primary winners apart from primary losers.

How do these results square with the strategic positioning dilemma?
The SPD theorizes that candidates position themselves close to the primary constituency to defeat possible or actual primary challengers.
While candidates may be able to position themselves to target the ideological preferences of their partisan constituents (shown in Chapter \@ref(ch:positioning)), these positioning maneuvers are ultimately ineffectual if primary voters do not perceive or act on candidates' ideological stances.
While the findings in this chapter suggest that primary candidates are more successful when they position themselves near the ideological core of their party, this benefit is very broad.
Primary electorates' place less value on very extreme and very moderate candidates than they place on "typical" partisan candidates, but the current data do not support finer conclusions than that.
I find no evidence that partisan constituencies with more consistent ideological preferences prefer more consistently ideological candidates, which suggests that primary candidates' ideological maneuvers may have little direct appeal to primary electorates.
Either primary constituencies do not perceive fine differences between candidates' ideological positioning, or they do not place much weight on candidate ideology when choosing a candidate.

These findings do not necessarily imply that candidate ideological maneuvering is useless to their primary successes.
Candidate ideology's effect on primary success may be mediated by other mechanisms. 
For instance, candidate ideology may affect campaign fundraising [@stone-simas:2010:candidate-valence; @barber-et-al:2016:ideological-donors; @thomsen-swers:2017:women-run], which could indirectly affect voters' awareness of candidates.
Indirect mechanisms such as these are possible, but their reliance on intermediary actors rather is a point of divergence from the strategic positioning dilemma, which construes candidate ideology as a more direct appeal to voters' policy preferences.


### Causal effects on primary outcomes {#sec:causal-probs}

So far, I have discussed model results in terms of candidate utility, a latent scale that does not directly manifest as a primary win or loss.
To understand how candidate ideology affects primary outcomes, utility must be translated to win probability.
This discussion provides a brief mathematical roadmap to describe causal effects on utility and win probability.

First, I discuss causal effects on candidate utility.
The average effect of a change in CF score on candidate utility can't be summarized by a single coefficient because of the nonlinear, interactive form of the spline function.
The causal effect must instead be calculated as the difference in spline function values at different CF score inputs. 
Let $\text{CUE}\left(\text{CF}, \text{CF}', \theta, c\right)$ be the conditional utility effect (CUE) of moving to $\text{CF}$ from a reference value $\text{CF}'$, which is defined as
\begin{align}
\begin{split}
  & \text{CUE}\left(\text{CF}, \text{CF}', \theta, c\right) = \hfill \\
  & \hfill \qquad \qquad \E{f\left(\text{CF}, \bar{\theta}_{g}\right) \mid
       \bar{\theta}_{g} = \theta, C_{ir} = c} -
    \E{f\left(\text{CF}', \bar{\theta}_{g}\right) \mid 
       \bar{\theta}_{g} = \theta, C_{ir} = c}
\end{split}
(\#eq:cate-utility)
\end{align}
given a fixed value of $\bar{\theta}_{g} = \theta$ and conditioning on covariates $C_{ir}$. 
The Bayesian approach defines a the probability distribution for $\text{CUE}\left(\text{CF}, \text{CF}', \theta, c\right)$, which marginalizes over model parameters:
\begin{align}
\begin{split}
  & \p{\text{CUE}\left(\text{CF}, \text{CF}', \theta, c\right) \mid \mathbf{y}} = \hfill \\
  & \hfill \qquad \qquad \int \p{f\left(\text{CF}, \theta \right) - f\left(\text{CF}', \theta\right), \alpha, \beta, \boldsymbol{\phi} \mid \theta, c, \mathbf{y}} \diff \, \alpha \, \beta \, \boldsymbol{\phi}    
\end{split}
(\#eq:prob-cate-utility)
\end{align}
The conditional average effect of CF score on candidate utility can be gleaned by simply contrasting the function values in Figure \@ref(fig:plot-spline-posterior) for two different CF score values.
I do not plot these causal contrasts directly because they would look identical to the spline function, except the function would be intercept-shifted up or down depending on the reference value $\text{CF}'$.

Instead, I highlight how changes in CF score affect primary win probability.
Two intervening factors affect how candidate utility becomes win probability.
First, win probability is a nonlinear function of candidate utility, so a utility shock can have different effects on win probability depending on the baseline utility at the reference value $\text{CF}'$.
Second, a primary race can feature a variable number of candidates, so a utility shock will have larger effects on win probability when the field of candidates is smaller, and smaller effects on win probability when the field of candidates is larger.
Let $\text{CWE}\left(\text{CF}, \text{CF}', \theta, c, r\right)$ be the conditional _win_ effect (CWE) of moving to $\text{CF}$ from a reference value $\text{CF}'$ in race $r$.
The CWE is a function of the race $r$ because each race could have different numbers of candidates with different utilities.
Because win probability is modeled as a softmax function of candidate utility, the CWE is the expected difference in softmax functions when CF score takes the value $\text{CF}$ versus the reference value $\text{CF}'$.
\begin{align}
\begin{split}
  & \text{CWE}\left(\text{CF}, \text{CF}', \theta, c, r\right) 
    = \hfill \\
  & \hfill \qquad \qquad
    \E{\text{softmax}\left(u_{ir}(CF)\right) \mid \theta, c, r} - 
    \E{\text{softmax}\left(u_{ir}(CF')\right) \mid \theta, c, r}
\end{split}
(\#eq:cate-pwin)
\end{align}
where $u_{ir}\left(CF_{ir}\right)$ represents a candidate utility, holding all variables besides CF score fixed.
In turn, the probability distribution for the CWE marginalizes over the parameters that compose the softmax function.
\begin{align}
\begin{split}
  & \p{\text{CWE}\left(\text{CF}, \text{CF}', \theta, c, r\right) \mid \mathbf{y}} = \hfill \\
  &  \hfill \qquad \qquad \int \p{\text{softmax}\left(u_{ir}(CF)\right) - \text{softmax}\left(u_{ir}(CF')\right), \alpha, \beta, \boldsymbol{\phi} \mid \theta, c, r, \mathbf{y}} \diff \, \alpha \, \beta \, \boldsymbol{\phi} \, \gamma
\end{split}
(\#eq:prob-cate-pwin)
\end{align}


```{r win-data}
win_diffs_long <- 
  here("data", "_model-output", "05-voting", "winprob-effects.rds") %>%
  read_rds() %>% 
  print()
```

```{r win-grid-subset}
win_subset <- win_diffs_long %>%
  mutate(
    incumbent = case_when(
      str_detect(race_type, "noninc") ~ "No Incumbent in Primary",
      TRUE ~ "Incumbent in Primary"
    ),
    candidates = parse_number(race_type),
    candidates = case_when(
      candidates == 2 ~ "One competitor",
      candidates == 3 ~ "Two competitors"
    ),
    party_name = case_when(
      party == "D" ~ "Democrats",
      party == "R" ~ "Republicans"
    )
  ) %>%
  filter(candidates == "One competitor") %>%
  print()
```

```{r win-grid}
ggplot(win_subset) +
  aes(x = CF, y = mean, color = party, fill = party) +
  geom_hline(yintercept = 0) +
  geom_ribbon(
    aes(ymin = lower, ymax = upper),
    alpha = 0.2,
    color = NA
  ) +
  geom_rug(aes(y = NULL), alpha = 0.2, show.legend = FALSE) +
  geom_line() +
  facet_grid(party_name ~ fct_rev(incumbent)) +
  scale_color_manual(values = party_code_colors) +
  scale_fill_manual(values = party_code_colors) +
  labs(
    x = "Candidate CF score",
    y = "Effect on win probability\n(vs. average CF score)",
    title = "How Candidate Ideology Affects Win Probability",
    subtitle = "In a two-candidate primary"
  ) +
  theme(legend.position = "none")
```

Figure \@ref(fig:win-grid) visualizes causal effects of CF score on win probability in a two-candidate race.
I use the average CF score in each party as the reference value for CF score.
Therefore the figure plots the how a candidate's ideological position affects their win probability _compared to holding the average ideological position_ for a candidate in their party.
I show effects only in races with two candidates total, which conveys an upper bound on the magnitude of causal effects—effects in more crowded primary fields would be smaller.
The left-side panels contain causal effects in a two-candidate race with no incumbent.
The right-side panel shows causal effects in a two-candidate race when the other candidate is the incumbent.

```{r win-grid, include = TRUE, out.width = "100%", fig.width = 8, fig.height = 6, fig.scap = "CF score effect on primary win probability.", fig.cap = "CF score effect on primary win probability in a three candidate race. Left panels feature a three-candidate race with no incumbents. Right panels feature a three-candidate race with one incumbent."}
```

The first thing to notice is that causal effects are generally negative.
This is because the average CF score in each party is very close (but not equal) to the optimal CF score as estimated by the model.
As candidates take ideological stances that are more moderate or more extreme than the optimal stance, their win probability generally decreases in all subsets of data.
These declines are not monotonically decreasing because the spline function detects noisy effects for CF score values with fewer observations.
Although the 90% credible intervals contain generally contain effects of zero, the posterior distribution suggests that negative causal effects are most probable.

Another important observation from Figure \@ref(fig:win-grid) is the way the presence of an incumbent dampens causal effects.
Because incumbents are already highly valued in primaries, shocks to challenger utility compose a smaller share of the total utility across all candidates.
In terms of win probability, because challengers are already unlikely to win a primary, a challenger with an ill-fitting ideological position can only hurt their chances so much before they confront the zero-bound on their win probability.


### Races with no incumbents {#sec:no-incumbents}

```{r splines-no-incumbent}
spline_means_noinc <- noinc_raw %>%
  mutate(
    tidy_fit = map(vb_fit, tidy, conf.int = TRUE, conf.level = 0.9),
    spline_means = map2(
      .x = tidy_fit,
      .y = stan_data,
      .f = ~ {
        rhs <- tibble(
          CF = .y$CF,
          i = .y$i
        )
        means <- .x %>%
          filter(
            str_detect(term, "spline_mean_post") |
            str_detect(term, "spline_lower_post") |
            str_detect(term, "spline_upper_post") 
          ) %>%
          mutate(i = parse_number(term)) 
        left_join(means, rhs, by = "i")
      }
    )
  ) %>%
  select(spline_means) %>%
  unnest(spline_means) %>%
  print()
```


```{r noinc-spline-posterior}
spline_means_noinc %>%
  filter(str_detect(term, "mean")) %>%
  ggplot() +
  aes(x = CF, y = estimate, color = party) +
  geom_vline(xintercept = 0, color = "gray") +
  geom_hline(yintercept = 0, color = "black") +
  geom_ribbon(
    aes(ymin = conf.low, ymax = conf.high, fill = party),
    color = NA, 
    alpha = 0.3,
    show.legend = FALSE
  ) +
  geom_line(size = 1, show.legend = FALSE) +
  geom_line(
    data = filter(spline_means_noinc, str_detect(term, "mean") == FALSE),
    aes(linetype = str_detect(term, "lower")),
    color = "black"
  ) +
  geom_rug(aes(y = NULL), alpha = 0.2, show.legend = FALSE) +
  scale_color_manual(values = party_code_colors) +
  scale_fill_manual(values = party_code_colors) +
  scale_linetype_manual(
    values = c(2, 3),
    labels = c("TRUE" = "1 sd below mean", "FALSE" = "1 sd above mean")
  ) +
  facet_wrap(
    ~ party, 
    scales = 'free_x',
    labeller = as_labeller(c("D" = "Democrats", "R" = "Republicans"))
  ) +
  labs(
    title = "CF Scores and Utility among Non-Incumbents",
    subtitle = "Model re-estimated on primary races with no incumbents",
    x = "Candidate CF Score",
    y = "Spline function of ideal point distance",
    linetype = TeX("$\\bar{\\theta}_{g}$ value")
  ) +
  theme(
    legend.position = c(.9, .8),
    legend.background = element_rect(fill = alpha("white", 0.5))
  ) +
  # coord_cartesian(xlim = c(-5.5, 5.5))
  NULL
```

Because incumbent candidates are likely to win a primary re-nomination, some scholars estimate models of primary candidate choice using only open-seat primaries with no incumbents [e.g. @porter-treul:2020:primary-experience].
I conduct this same analysis by dropping all races that feature an incumbent candidate, and I plot the spline function results in Figure \@ref(fig:noinc-spline-posterior).
I find essentially the same pattern as with the full data: candidates with moderate and extreme ideological stances are less highly values by primary electorates, and the average pattern has no noticeable heterogeneity across districts with different district-party ideologies.

```{r noinc-spline-posterior, include = TRUE, out.width = "100%", fig.width = 9, fig.height = 4.5, fig.scap = "CF score effect on candidate utility in races with no incumbent.", fig.cap = "CF score effect on candidate utility in races with no incumbent. Spline function of CF score with district-party ideology held at mean, mean minus one standard deviation, and mean plus one standard deviation."}
```

## Discussion {#sec:discuss-5}

### Causal Identification with CF Scores

Causal inference in this analysis depends on a key identifying assumption that candidate CF score values are ignorable, conditional on the observed covariates.
Using CF scores as a treatment variable under ignorability could be problematic because CF scores are functions of a measurement process rather than being direct measures of candidate ideology.
Because CF scores are affected by patterns of campaign contributions in the parties' campaign finance networks, these factors may play an indirect role in confounding the effect of CF scores. 

In this analysis, I find that candidates with "party-typical" CF scores have optimal spatial utility.
It could be the case that candidates only receive a party-typical CF score because of their central location in a campaign finance network, which indirectly captures a consensus view that donors find the candidate to be viable.
If the viability of a candidate affects their CF score, it's possible that viability could confound the relationship between CF scores and primary victory.
The use of proxy measures for modern causal inference is not well understood in political science, and attempts to confront the issue are only just emerging [@egami-et-al:2018:causal-inf-texts].
But because political science relies heavily on proxy measures in all subfields of quantitative research, learning to conduct causal inference with simultaneous measurement and inferential modeling is essential for credible causal research in the future.

The fact that party-typical CF scores are most valuable to primary electorates also raises a question of feedback over time.
If extreme candidates were the candidates that were most valued in primaries, candidates could perceive this pattern, and over time candidates would re-equilibrate their campaign positions to maximize their win probability.
This implies that the optimality of a campaign position causes it to be party-typical, not the other way around.
In other words, we may never observe a reality where the optimal campaign stance is not also party-typical, because candidates adjust their ideologies to maximize their benefit in equilibrium, the same way that a seller of goods adjusts the pricing of their product to meet demand.
Political scientists have studied the evolving stances of the parties over time [@carmines-stimson:1986:issue-evolution; @karol:2009:position-change; @schlozman:2015:movements-anchor; @rubin:2017:building-bloc], but these studies usually arise in the context of interparty polarization and interest group coalitions rather than the context of primary electioneering.

### Strategic voting in primaries

The analysis in this chapter tests an indirect implication of the strategic positioning dilemma, that ideological primary electorates reward and punish primary candidates based on how well the candidate positions themselves relative to the district-party constituency.
This invokes an assumption that primary voters are sincere, which may not be accurate.
For the purpose of this analysis, I proceeded with a simpler model of sincere voting, but there are ways this analysis might be extended to accommodate the possibility that primary voters strategically nominate the primary candidate most likely to win the general election.
This picture of strategic voting voting suggests that candidate choice is a nonlinear function of two ideological distances: the distance between candidate and district-party, and the distance between candidate and the district median voter.

To assimilate a test of strategic voting into the framework of this dissertation, we would require an additional set of ideal points for the district population over all, combining all parties.^[
  I take Kernell's -@kernell:2009:districts argument seriously that the district presidential vote is not a measure of ideology.
]
Such an ideal point could be constructed by augmenting the model in Chapter 2 to estimate three groups per district—Republicans, Democrats, and "other"—and then constructing the district ideal point as a population-weighted average of all three group ideal points.
This approach would require a poststratification dataset for estimating the marginal distribution of party ID in each district.
No census of party ID exists, but recent hierarchical modeling methods in political science have explored the modeling of population distributions from imperfect data with geographic and temporal smoothing [@caughey-wang:2019:smoothed-EI].
Such a project would be a major endeavor, but with the modeling innovations in this chapter, hierarchical opinion modeling approaches in Chapter \@ref(ch:model), and hierarchical modeling of population parameters, a unified approach to sincere and strategic primary voting would be a natural offshoot from this project.


## Closing remarks on the project

How does district-party ideology affect primary outcomes?
This question is complicated by statistical and causal identifiability problems.
Primary elections are choices between alternative candidates in the same district, so district-party ideology cannot directly affect candidate choice without some interaction with candidate-level variables.
I develop a statistical model that addresses statistical and causal identifiability problems, estimating the causal effect of candidate CF scores on candidate primary success with heterogeneous effects across district-parties with different policy ideologies.
Under the right causal identification assumptions, I find that candidate ideology does affect primary outcomes: deviating from an optimal ideological position reduces a candidate's chance of winning, penalizing both moderate and extremist candidates.
I find no evidence that this effect varies across district-parties, even though district-parties vary in their policy ideologies.
This could be because voters do not perceive or do not give much weight to the fine differences between candidates' ideological positioning.
It could also be because candidates' ideological stances have diffuse rather than specific effects on their success, increasing their credibility with campaign financiers or other party gatekeepers who can advance their careers, even if voters do not respond to these specific signals as strongly.

This project examined district-party ideology, candidate positioning, and primary outcomes as a study of the "strategic positioning dilemma" theory of primary competition.
Chapter \@ref(ch:arg) justifies the creation of ideal point measures for the policy preferences of partisan voters in a congressional district, a key independent variable in the theory. 
I build the scores in Chapter \@ref(ch:model) and outline the Bayesian approach to causal inference in Chapter \@ref(ch:causality), both of which I employ in empirical analyses.
Chapter \@ref(ch:positioning) studies the effect of district-party ideology on primary candidates' campaign positions.
Consistent with the strategic positioning dilemma, I find that district-party ideology _does_ affect campaign positions: candidates run as more conservative in conservative district-parties, and vice versa in more progressive district-parties.
Importantly, this relationship holds after controlling for district-level voting, suggesting that candidates' responsiveness to their partisan bases doesn't simply reflect the competitiveness of the general election.
Chapter \@ref(ch:voting) studies whether candidates' attempts to target their partisan bases pay off in primary elections.
While I do find that primary candidates are more successful in primary elections when they position themselves optimally, I do not find that the optimal campaign position is a narrow enough region of ideological space to have a large effect on primary outcomes.
Furthermore, I find no evidence that optimal campaign position is different from one district to the next, suggesting that candidates' efforts to position themselves toward the partisan base is appreciated by voters only in the broadest terms.
Primary electorates do not prefer candidates to be too moderate or too extreme, but I find no evidence that partisan ideological variation from one district to the next has a noticeable effect on which candidate ultimately wins the primary.

These results lend partial support to the strategic positioning dilemma.
Elite actors in the story—primary candidates—do behave in accordance with theoretical expectations.
The mass public's role in the story—judging which candidates best appease their appetites for ideological policy promises—is not supported by the analysis.
To be clear, I do not find contradictory evidence either.
Contradictory evidence would consist of electorates that tend to prefer moderate candidates, or electorates that have no ideological awareness whatsoever.
Instead, I find that primary electorates are "broadly" ideological in the sense that they nominate candidates who represent the general consensus in the party, with no obvious variation in that pattern across districts.
This combination of strategic behavior among elites with satisficing behavior among the public fits a prevailing current in modern US politics research: the mass public knows enough about their _general_ ideological tastes to make competent political decisions, but they are not so attuned to politics that they behave as hyper-informed political junkies.

This project has also exemplified other important contributions to contemporary Bayesian analysis and causal inference.
The empirical studies in Chapters \@ref(ch:positioning) and \@ref(ch:voting) demonstrate the use causal graphs to clarify causal assumptions and motivate the choice of causal estimands.
The Bayesian approach to causal inference in Chapter \@ref(ch:causality) justified the use of Bayesian modeling for causal inference, which enabled flexible model-building, pragmatic uses of prior information for model stability and regularization, and principled accounting of statistical uncertainty throughout the causal analyses.