update_prob.Rmd

---
title: Updating the Forecast on Election Night with R
author: "Pierre-Antoine Kremp"
output: 
  html_document:
    theme: readable
    highlight: tango
---


Election night promises to be interesting... Ballots will get counted; networks will start calling states; and electoral votes will be slowly allocated to candidates. 

But we will probably know the winner way before any candidate reaches 270 electoral votes. 

Why? Vote intentions are correlated across states. Results from just a handful of states should help us predict better who will win the remaining states. For example, if Hillary Clinton loses or underperforms the model's forecast in New Hampshire, this might indicate that she will also underperform in other swing states (maybe because there were a lot hidden Trump voters that polls missed in NH... and may equally have missed in PA, OH, FL, etc.). In other words, early results will not just matter as they directly affect the electoral votes tally, they should also help us refine our forecast in other states. 

# Enter `update_prob()`

I wrote a very simple function in R, `update_prob()`, that performs this calculation. It takes as input a set of events that you want to condition on (e.g. Clinton or Trump already won a set of states; or the two-way final score in a given state falls within a certain interval), and it displays the updated forecast by drawing from the conditional distribution. 

You can use it on election night, or before if you want to check how the forecast would change under various scenarios.

```{r echo=FALSE, message=FALSE}
setwd("~/GitHub/polls")
source('~/GitHub/polls/update_prob.R')
```

Here are just a few examples, using hypothetical and somewhat far-fetched cases: 

If we call `update_prob()` without adding any new information, the function returns the baseline prediction from the model:

```{r echo = TRUE} 
update_prob()
```

Now imagine that New Hampshire gets called for Clinton; her chance of winning the election naturally increases:

```{r echo=TRUE} 
update_prob(clinton_states = c("NH"))
```

What happens if Trump wins Pennsylvania? (An unlikely outcome here, since the model sees Pennsylvania as more favorable to Clinton than New Hampshire). Things start looking bad for Clinton:

```{r echo = TRUE}
update_prob(clinton_states = c("NH"), 
            trump_states   = c("PA")) 
```

If Ohio falls in the Trump column as well, we don't actually get much new information (Ohio is more pro-Trump than Pennsylvania anyway):

```{r echo = TRUE} 
update_prob(clinton_states = c("NH"), 
            trump_states   = c("PA", "OH")) 
```

It is also possible to plug in score intervals: for example: we don't know who is winning Florida for sure, but it looks like Clinton and Trump are close, with perhaps a slight lead for Clinton... Let's put the final score for Clinton somewhere between 49 and 52% in that state (taking her score in a two-way race, ignoring 3rd party candidates):

```{r echo = TRUE} 
update_prob(clinton_states = c("NH"), 
            trump_states   = c("OH", "PA"), 
            clinton_scores_list = list("FL" = c(49, 52))) 
```

We can add multiple score conditions: let's say Nevada is in the Clinton column and her score there is somewhere between 51 and 54%.

```{r echo = TRUE} 
update_prob(clinton_states = c("NH"), 
            trump_states   = c("OH", "PA"), 
            clinton_scores_list = list("FL" = c(49, 52), 
                                       "NV" = c(51, 54))) 
```

# How to use it

To use `update_prob()`, you only need 2 files, available from my [GitHub repository](http://github.com/pkremp/polls): 

* `update_prob.R`, which loads the data and defines the `update_prob()` function,
* `last_sim.RData`, which contains 4,000 simulations from the posterior distribution of the last model update.

Put the files into your working directory or use `setwd()`.

If you don't already have the `mvtnorm` package installed, you can do it by typing: `install.packages("mvtnorm")` in the console.

To create the functions and start playing, type: `source("update_prob.R")`, and the `update_prob()` function will be in your global environment.

The function accepts the following arguments:

* `clinton_states`: a character vector of states already called for Clinton;
* `trump_states`: a character vector of states already called for Trump;
* `clinton_scores_list`: a list of elements named with 2-letter state abbreviations; each element should be a numeric vector of length 2 containing the lower and upper bound of the interval in which Clinton share of the Clinton + Trump score is expected to fall. 
* `target_sim`: an integer indicating the minimum number of samples that should be drawn from the conditional distribution (set to 1000 by default).
* `show_all_states`: a logical value indicating whether to output the state by state expected win probabilities (set to `FALSE` by default).

# How it works

The function draws from a multivariate normal distribution of vote intentions (with the same mean and covariance as the simulations from the last model update) and rejects draws that do not match the constraints expressed in the `clinton_states`, `trump_states`, and `clinton_scores_list` arguments. Accepted draws (i.e. simulations that match the constraints) are recorded. The function keeps drawing from the multivariate normal distribution (in batches of 100,000 draws) until `target_sim` draws have been accepted.

It then calculates the electoral votes distribution and returns the proportion of cases in which Clinton wins the electoral college, along with the standard error of this proportion.

This is a very simple and arguably dumb sampling algorithm. If your constraints are too tight, or your scenario is too far-fetched, the rejection rate will be extremely high. Stan would be much more efficient as the sampling algorithm would learn which regions of the parameter space to focus on; but this seems to work well enough for this application -- unless you want to explore some really complicated and unlikely combinations of scenarios or with extremely tight constraints.

If the rejection rate is too high and sampling looks like it would take too long, the function stops and issues a message inviting you to check if your constraints make sense and to consider relaxing them a bit.