3 Model selection and validation.Rmd

---
title: "3 Model Selection and Validation"
author: "Paul Hively"
date: "May 16, 2016"
output: html_document
---

Source can be viewed [on GitHub](https://github.com/phively/gsb-sols-proj/blob/master/3%20Model%20selection%20and%20validation.Rmd)

# Setup and loading the data

Begin by loading the data used for the visualizations in part 1.

```{r, message=F, warning=F, cache=F}
#### Run Rscript0 to load useful packages and functions ----
source("Rscript0 - libraries.R")
source("f.Wrangle.R")
source("f.Diagnostics.R")
source("f.JMod.R")
```
```{r, message=F, warning=F, cache=F}
#### Model data cleanup ----
source("Rscript1a - modeling data appends.R")
source("Rscript1b - derived variables.R")
```

* [Rscript0 - libraries.R](https://github.com/phively/gsb-sols-proj/blob/master/Rscript0%20-%20libraries.R)
* [f.Wrangle.R](https://github.com/phively/gsb-sols-proj/blob/master/f.Wrangle.R)
* [f.Diagnostics.R](https://github.com/phively/gsb-sols-proj/blob/master/f.Diagnostics.R)
* [f.JMod.R](https://github.com/phively/gsb-sols-proj/blob/master/f.JMod.R)
* [Rscript1a - modeling data appends.R](https://github.com/phively/gsb-sols-proj/blob/master/Rscript1a%20-%20modeling%20data%20appends.R)
* [Rscript1b - derived variables.R](https://github.com/phively/gsb-sols-proj/blob/master/Rscript1b%20-%20derived%20variables.R)

The data frame `mderived` now contains the fields that may be useful for modeling.

# Quick review

My goal is to find a way to accurately forecast fundraising revenue. The current model -- "JMod" -- discounts solicitations to a fixed percentage of their expected value based on their current stage and how long it is until the end of the fiscal year.

| Stage    | July-Feb| Mar-Apr |   May   |  June   |
|----------|:-------:|:-------:|:-------:|:-------:| 
| Plan     |  1/6    | 1/8     | 0       |   0     |
| Clear    |  1/3    | 1/6     | 1/8     |   0     |
| Ask      |  2/3    | 1/3     | 1/6     |   1/8   |
| Oral     |  2/3    | 1/3     | 1/6     |   1/8   |
| Paperwork|  1      | 1       | 1       |   1     |

This makes use of three pieces of information: the current solicitation stage $S_{i}$, and today's date $t_{i}$ compared to the expected close date. 

[Last time](http://phively.github.io/gsb-sols-proj/2_Training_and_variable_selection.html), I ran the data through random forests and found several additional predictive variables: `Solicitation.Type.Desc`, `Planned.Amt`, `Expected.Amt`, and `Ask.Amt`.

```{r, echo=F, message=F, warning=F, cache=F}
# Load the saved errors file from last time
errors <- data.frame(read.csv("classification.errors.txt", sep="\t", stringsAsFactors=F))
# Keep only the relevant rows
errors <- errors[errors$Model %in% c("GLM", "JMod", "Random forest 3"),]
rownames(errors) <- NULL
```

# Logistic regression, take two

## Main effects model, raw data

What happens throwing all the main effects in without a transformation?

```{r, message=F, warning=F, cache=F}
# Plan model, main effects only
glm.pln <- glm(FY.Plan.Book ~ Solicitation.Type.Desc + Planned.Amt + Expected.Amt + Plan.Future + planning.fiscal.mo, data=mderived, family=binomial())
# Clear model, main effects only
glm.clr <- glm(FY.Clear.Book ~ Solicitation.Type.Desc + Planned.Amt + Expected.Amt + Clear.Future + clear.fiscal.mo, data=mderived, family=binomial())
# Ask model, main effects only
glm.ask <- glm(FY.Ask.Book ~ Solicitation.Type.Desc + Planned.Amt + Expected.Amt + Ask.Amt + Ask.Future + ask.fiscal.mo, data=mderived, family=binomial())
# Oral model, main effects only
glm.ora <- glm(FY.Oral.Book ~ Solicitation.Type.Desc + Planned.Amt + Expected.Amt + Ask.Amt + Oral.Future + oral.fiscal.mo, data=mderived, family=binomial())
# Put together the error rates
modnames <- c("Plan", "Clear", "Ask", "Oral")
confuse <- ConfusionMatrix(list(glm.pln, glm.clr, glm.ask, glm.ora), modnames=modnames, counts=F)
# Row to add to the error table
dat1 <- CalcErrorRates(confuse, model.name="GLM main effects notrans", modnames)
# Print the error table
kable(rbind(errors, dat1), digits=2)
```

That's a shockingly large improvement whether we use the original GLM or JMod as the baseline. It's not directly comparable to the out-of-bag Random forest 3 estimates, but I'll use a cross-validated error estimate after I'm happy with the final model form.

# Variable distributions

## Dollar amounts

Before looking at the data, I know that `Planned.Amt`, `Expected.Amt`, and `Ask.Amt` are continuous variables theoretically bounded by $[0,\infty)$, but expect them to be very unevenly distributed -- because of solicitation strategy they will tend to cluster around nice, round numbers like $100, $25,000, and so on. I don't want to use complex transformations (mixture basis, spline basis) to maintain interpretability, and because the lower bound is exactly 0, some simple transformations won't make sense.

Here's how the raw data look:

```{r, echo=F, message=F, warning=F, cache=F}
## Planned Ask Amount
ggplot(mderived[!is.na(mderived$Planned.Amt),], aes(x=Planned.Amt, y=..density..)) + geom_histogram(bins=60, alpha=.5) + geom_density(color="gray") + scale_x_continuous(labels=scales::dollar) + labs(title="Planned Ask Amount")
## Expected Amount
ggplot(mderived[!is.na(mderived$Expected.Amt),], aes(x=Expected.Amt, y=..density..)) + geom_histogram(bins=60, alpha=.5) + geom_density(color="gray") + scale_x_continuous(labels=scales::dollar) + labs(title="Expected Amount")
## Ask Amount
ggplot(mderived[!is.na(mderived$Ask.Amt),], aes(x=Ask.Amt, y=..density..)) + geom_histogram(bins=60, alpha=.5) + geom_density(color="gray") + scale_x_continuous(labels=scales::dollar) + labs(title="Ask Amount")
```

There's a very small number of very large ask amounts. It's a little hard to see, so here's what the data look like on a $\log_{10}$ scale. So that we don't have to drop the 0s, define the transformation:

$$ x^\text{*} = \log_{10}(x + \alpha)$$

The smallest non-zero ask amount is:

```{r, message=F, warning=F, cache=F}
(alpha <- min(na.omit(mderived$Ask.Amt[mderived$Ask.Amt != 0])))
```

Plotting the data under this transformation.

```{r, echo=F, message=F, warning=F, cache=F}
## Planned Ask Amount
ggplot(mderived, aes(x=(Planned.Amt + alpha), y=..density..)) + geom_histogram(bins=60, alpha=.5) + geom_density(color="gray") + scale_x_log10(breaks=10^(0:7), labels=scales::dollar) + labs(title="Planned Ask Amount", x=expression(paste(log[10],"(Planned.Amt + ",alpha,")")))
## Expected Amount
ggplot(mderived[!is.na(mderived$Expected.Amt),], aes(x=(Expected.Amt + alpha), y=..density..)) + geom_histogram(bins=60, alpha=.5) + geom_density(color="gray") + scale_x_log10(breaks=10^(0:7), labels=scales::dollar) + labs(title="Expected Amount", x=expression(paste(log[10],"(Expected.Amt + ",alpha,")")))
## Ask Amount
ggplot(mderived[!is.na(mderived$Ask.Amt),], aes(x=(Ask.Amt + alpha), y=..density..)) + geom_histogram(bins=60, alpha=.5) + geom_density(color="gray") + scale_x_log10(breaks=10^(0:7), labels=scales::dollar) + labs(title="Ask Amount", x=expression(paste(log[10],"(Ask.Amt + ",alpha,")")))
```

That's actually not too far from symmetric.

## Interactions

Are there interactions worth looking at between `Solicitation.Type.Desc` and the others?

```{r, message=F, warning=F, cache=F}
table(mderived$Solicitation.Type.Desc)
```

Okay, `Solicitation.Type.Desc` clearly needs to be aggregated, so let's go with Other, Outright Gift, and Standard Pledge. By default, the alphabetically first factor level is the baseline under the treatment constraint, so the model will show contrasts between Other and Outright Gift, and Other and Standard Pledge, which is nice for interpretability.

Look at `Ask.Amt` by `Sol.Type.Agg`:

```{r, echo=F, message=F, warning=F, cache=F}
## Ask Amount
ggplot(mderived[!is.na(mderived$Ask.Amt),], aes(x=(Ask.Amt + alpha), y=..density..)) + geom_histogram(bins=60, alpha=.55) + geom_density(color="gray") + scale_x_log10(breaks=10^(0:7), labels=scales::dollar) + labs(title="Ask Amount", x=expression(paste(log[10],"(Ask.Amt + ",alpha,")"))) + facet_grid(Sol.Type.Agg ~ .)
## Ask Amount with result
ggplot(mderived[!is.na(mderived$Ask.Amt),], aes(x=(Ask.Amt + alpha), y=..density.., fill=FY.Ask.Book)) + geom_histogram(bins=60, alpha=.25, position="identity") + geom_density(alpha=.25) + scale_x_log10(breaks=10^(0:7), labels=scales::dollar) + labs(title="Ask Amount with result", x=expression(paste(log[10],"(Ask.Amt + ",alpha,")"))) + facet_grid(Sol.Type.Agg ~ .)
```

Other and Standard Pledge look more nearly flat than Outright Gift, so it could be worth checking out.

Also check `ask.fiscal.mo` by `Sol.Type.Agg`:

```{r, echo=F, message=F, warning=F, cache=F}
fiscal.month.name <- strtrim(month.name, 3)[c(7:12, 1:6)]
## Ask by month
ggplot(mderived[!is.na(mderived$Ask.Amt),], aes(x=ask.fiscal.mo, y=..density..)) + geom_histogram(bins=12, alpha=.5) + geom_density(color="gray") + labs(title="Asks by gift type") + facet_grid(Sol.Type.Agg ~ .) + scale_x_continuous(labels=fiscal.month.name, breaks=c(1:12))
## Ask by month with result
ggplot(mderived[!is.na(mderived$Ask.Amt),], aes(x=ask.fiscal.mo, y=..density.., fill=FY.Ask.Book)) + geom_histogram(bins=12, alpha=.25, position="identity") + geom_density(alpha=.25) + labs(title="Asks by gift type with result") + facet_grid(Sol.Type.Agg ~ .) + scale_x_continuous(labels=fiscal.month.name, breaks=c(1:12))
```

Outright gift asks have more success by far in December. Could be worth checking out.

Based on previous findings I expect the Annual Fund group is mostly responsible for plummeting closed solicitation rates toward the end of the fiscal year. Let's check.

```{r, echo=F, message=F, warning=F, cache=F}
## Ask by month with result
ggplot(mderived[!is.na(mderived$Ask.Amt) & !is.na(mderived$Ask.Band),], aes(x=ask.fiscal.mo, y=..density.., fill=FY.Ask.Book)) + geom_histogram(bins=12, alpha=.25, position="identity") + geom_density(alpha=.25) + labs(title="Asks by gift type") + facet_grid(Sol.Type.Agg ~ Ask.Band) + scale_x_continuous(labels=fiscal.month.name, breaks=c(1:12)) + theme(axis.text.x=element_text(angle=90, vjust=.4))
```

Confirmed -- the AF group gives on their schedule, not ours.

Additional considerations -- does it make sense to *remove* 0s before the $\log_{10}$ transformation as opposed to adding a constant $\alpha$?

* `Planned.Amt` = $0 is an error
* `Ask.Amt` = $0 is an error
* `Expected.Amt` = $0 is not an error; reflects what the solicitation manager thought was most likely at the time

I think it makes sense to keep the $0 values and transform them appropraitely. If necessary, `Ask.Amt` = $0 can be filled in with `Planned.Amt` and vice versa.

# Main effects model with transformed data

```{r, message=F, warning=F, cache=F}
# Function to transform dollar amounts as above
LogTrans <- function(x, a=alpha) {log10(x + a)}
# Function to turn fiscal month into a factor
FacMonth <- function(x) {factor(x, labels=fiscal.month.name)}
# New variables
mderived <- mderived %>% mutate(lt.Planned.Amt = LogTrans(Planned.Amt), lt.Expected.Amt = LogTrans(Expected.Amt), lt.Ask.Amt = LogTrans(Ask.Amt), fm.planning = FacMonth(planning.fiscal.mo), fm.clear = FacMonth(clear.fiscal.mo), fm.ask = FacMonth(ask.fiscal.mo), fm.oral = FacMonth(oral.fiscal.mo))
# Plan model, main effects only
glm.pln.t <- glm(FY.Plan.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Plan.Future + fm.planning, data=mderived, family=binomial())
# Clear model, main effects only
glm.clr.t <- glm(FY.Clear.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Clear.Future + fm.clear, data=mderived, family=binomial())
# Ask model, main effects only
glm.ask.t <- glm(FY.Ask.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Ask.Future + fm.ask, data=mderived, family=binomial())
# Oral model, main effects only
glm.ora.t <- glm(FY.Oral.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Oral.Future + fm.oral, data=mderived, family=binomial())
# Put together the error rates
confuse <- ConfusionMatrix(list(glm.pln.t, glm.clr.t, glm.ask.t, glm.ora.t), modnames=modnames, counts=F)
# Row to add to the error table
dat2 <- CalcErrorRates(confuse, model.name="GLM main effects trans", modnames)
# Print the error table
kable(rbind(errors, dat1, dat2), digits=2)
```

This gives the same error rate as the untransformed data, interestingly. Let's look at each of these models.

```{r, message=F, warning=F, cache=F}
summary(glm.pln.t)
```

Planned and Expected seem to cancel each other out to some degree, but just including one increases the error rate: 

```{r, message=F, warning=F, cache=F}
# Error comparison
rbind(dat2, CalcErrorRates(ConfusionMatrix(list(
    update(glm.pln.t, . ~ . -lt.Expected.Amt),
    update(glm.clr.t, . ~ . -lt.Expected.Amt),
    update(glm.ask.t, . ~ . -lt.Expected.Amt),
    update(glm.ora.t, . ~ . -lt.Expected.Amt)
  ), modnames=modnames, counts=F
), model.name="No Expected.Amt", modnames))
```

So it looks like Month doesn't make much of a difference in the end, which is intellectally unsatisfying, but perhaps an inevitable result of optimizing for probability to close (too noisy). The final models can drop these.

```{r, message=F, warning=F, cache=F}
# Plan model, main effects only
glm.pln.t <- glm(FY.Plan.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Plan.Future, data=mderived, family=binomial())
# Clear model, main effects only
glm.clr.t <- glm(FY.Clear.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Clear.Future, data=mderived, family=binomial())
# Ask model, main effects only
glm.ask.t <- glm(FY.Ask.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Ask.Future, data=mderived, family=binomial())
# Oral model, main effects only
glm.ora.t <- glm(FY.Oral.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Oral.Future, data=mderived, family=binomial())
# Put together the error rates
confuse <- ConfusionMatrix(list(glm.pln.t, glm.clr.t, glm.ask.t, glm.ora.t), modnames=modnames, counts=F)
# Row to add to the error table
dat3 <- CalcErrorRates(confuse, model.name="GLM trans no month", modnames)
# Print the error table
kable(rbind(errors, dat1, dat2, dat3), digits=2)
# Save the models
save(glm.pln.t, glm.clr.t, glm.ask.t, glm.ora.t, file="logistic.models.Rdata")
```

# Cross-validated error estimate

In-sample error estimates are biased, so let's see what the 5-fold cross-validated error is. Note that it's important that the coefficients be calculated separately for each holdout sample.

```{r, message=F, warning=F, cache=F}
## Randomly shuffle the data
set.seed(57141)
k <- 5
rand.ind <- sample(1:nrow(mderived), nrow(mderived))
n.size <- ceiling(nrow(mderived)/k)
folds <- list()
# Grab next n.size shuffled row indices
for (i in 1:k) {folds[[i]] <- na.omit(rand.ind[(n.size * (i - 1) + 1):(n.size * i)])}
## Cross-validation 
# Vector to store the errors
error.list <- NULL
for (i in 1:k) {
  # Plan model, main effects only
  glm.pln.t <- glm(FY.Plan.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Plan.Future, data=mderived[-folds[[i]],], family=binomial())
  # Clear model, main effects only
  glm.clr.t <- glm(FY.Clear.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Clear.Future, data=mderived[-folds[[i]],], family=binomial())
  # Ask model, main effects only
  glm.ask.t <- glm(FY.Ask.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Ask.Future, data=mderived[-folds[[i]],], family=binomial())
  # Oral model, main effects only
  glm.ora.t <- glm(FY.Oral.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Oral.Future, data=mderived[-folds[[i]],], family=binomial())
  # Put together the error rates
  confuse <- ConfusionMatrix(list(glm.pln.t, glm.clr.t, glm.ask.t, glm.ora.t), testdata=mderived[folds[[i]],], modnames=modnames, counts=T)
  # Row to add to the error table
  error.list <- rbind(error.list, CalcErrorRates(confuse, model.name="GLM trans no month, xval", modnames))
}
## Add averaged error to the table
dat4 <- error.list %>% group_by(Model) %>% summarise(Plan = sum(Plan)/nrow(model.matrix(glm.pln)), Clear = sum(Clear)/nrow(model.matrix(glm.clr)), Ask = sum(Ask)/nrow(model.matrix(glm.ask)), Oral = sum(Oral)/nrow(model.matrix(glm.ora)))
kable(rbind(errors, dat3, dat4), digits=2)
```

The error actually comes out pretty much the same.

Of course, these error rates and models need to be compared to JMod in its native environment, namely predicted dollar amounts.

# Packages used
```{r}
session_info()
```