-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path3 Model selection and validation.Rmd
283 lines (224 loc) · 16.2 KB
/
3 Model selection and validation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
---
title: "3 Model Selection and Validation"
author: "Paul Hively"
date: "May 16, 2016"
output: html_document
---
Source can be viewed [on GitHub](https://github.com/phively/gsb-sols-proj/blob/master/3%20Model%20selection%20and%20validation.Rmd)
# Setup and loading the data
Begin by loading the data used for the visualizations in part 1.
```{r, message=F, warning=F, cache=F}
#### Run Rscript0 to load useful packages and functions ----
source("Rscript0 - libraries.R")
source("f.Wrangle.R")
source("f.Diagnostics.R")
source("f.JMod.R")
```
```{r, message=F, warning=F, cache=F}
#### Model data cleanup ----
source("Rscript1a - modeling data appends.R")
source("Rscript1b - derived variables.R")
```
* [Rscript0 - libraries.R](https://github.com/phively/gsb-sols-proj/blob/master/Rscript0%20-%20libraries.R)
* [f.Wrangle.R](https://github.com/phively/gsb-sols-proj/blob/master/f.Wrangle.R)
* [f.Diagnostics.R](https://github.com/phively/gsb-sols-proj/blob/master/f.Diagnostics.R)
* [f.JMod.R](https://github.com/phively/gsb-sols-proj/blob/master/f.JMod.R)
* [Rscript1a - modeling data appends.R](https://github.com/phively/gsb-sols-proj/blob/master/Rscript1a%20-%20modeling%20data%20appends.R)
* [Rscript1b - derived variables.R](https://github.com/phively/gsb-sols-proj/blob/master/Rscript1b%20-%20derived%20variables.R)
The data frame `mderived` now contains the fields that may be useful for modeling.
# Quick review
My goal is to find a way to accurately forecast fundraising revenue. The current model -- "JMod" -- discounts solicitations to a fixed percentage of their expected value based on their current stage and how long it is until the end of the fiscal year.
| Stage | July-Feb| Mar-Apr | May | June |
|----------|:-------:|:-------:|:-------:|:-------:|
| Plan | 1/6 | 1/8 | 0 | 0 |
| Clear | 1/3 | 1/6 | 1/8 | 0 |
| Ask | 2/3 | 1/3 | 1/6 | 1/8 |
| Oral | 2/3 | 1/3 | 1/6 | 1/8 |
| Paperwork| 1 | 1 | 1 | 1 |
This makes use of three pieces of information: the current solicitation stage $S_{i}$, and today's date $t_{i}$ compared to the expected close date.
[Last time](http://phively.github.io/gsb-sols-proj/2_Training_and_variable_selection.html), I ran the data through random forests and found several additional predictive variables: `Solicitation.Type.Desc`, `Planned.Amt`, `Expected.Amt`, and `Ask.Amt`.
```{r, echo=F, message=F, warning=F, cache=F}
# Load the saved errors file from last time
errors <- data.frame(read.csv("classification.errors.txt", sep="\t", stringsAsFactors=F))
# Keep only the relevant rows
errors <- errors[errors$Model %in% c("GLM", "JMod", "Random forest 3"),]
rownames(errors) <- NULL
```
# Logistic regression, take two
## Main effects model, raw data
What happens throwing all the main effects in without a transformation?
```{r, message=F, warning=F, cache=F}
# Plan model, main effects only
glm.pln <- glm(FY.Plan.Book ~ Solicitation.Type.Desc + Planned.Amt + Expected.Amt + Plan.Future + planning.fiscal.mo, data=mderived, family=binomial())
# Clear model, main effects only
glm.clr <- glm(FY.Clear.Book ~ Solicitation.Type.Desc + Planned.Amt + Expected.Amt + Clear.Future + clear.fiscal.mo, data=mderived, family=binomial())
# Ask model, main effects only
glm.ask <- glm(FY.Ask.Book ~ Solicitation.Type.Desc + Planned.Amt + Expected.Amt + Ask.Amt + Ask.Future + ask.fiscal.mo, data=mderived, family=binomial())
# Oral model, main effects only
glm.ora <- glm(FY.Oral.Book ~ Solicitation.Type.Desc + Planned.Amt + Expected.Amt + Ask.Amt + Oral.Future + oral.fiscal.mo, data=mderived, family=binomial())
# Put together the error rates
modnames <- c("Plan", "Clear", "Ask", "Oral")
confuse <- ConfusionMatrix(list(glm.pln, glm.clr, glm.ask, glm.ora), modnames=modnames, counts=F)
# Row to add to the error table
dat1 <- CalcErrorRates(confuse, model.name="GLM main effects notrans", modnames)
# Print the error table
kable(rbind(errors, dat1), digits=2)
```
That's a shockingly large improvement whether we use the original GLM or JMod as the baseline. It's not directly comparable to the out-of-bag Random forest 3 estimates, but I'll use a cross-validated error estimate after I'm happy with the final model form.
# Variable distributions
## Dollar amounts
Before looking at the data, I know that `Planned.Amt`, `Expected.Amt`, and `Ask.Amt` are continuous variables theoretically bounded by $[0,\infty)$, but expect them to be very unevenly distributed -- because of solicitation strategy they will tend to cluster around nice, round numbers like $100, $25,000, and so on. I don't want to use complex transformations (mixture basis, spline basis) to maintain interpretability, and because the lower bound is exactly 0, some simple transformations won't make sense.
Here's how the raw data look:
```{r, echo=F, message=F, warning=F, cache=F}
## Planned Ask Amount
ggplot(mderived[!is.na(mderived$Planned.Amt),], aes(x=Planned.Amt, y=..density..)) + geom_histogram(bins=60, alpha=.5) + geom_density(color="gray") + scale_x_continuous(labels=scales::dollar) + labs(title="Planned Ask Amount")
## Expected Amount
ggplot(mderived[!is.na(mderived$Expected.Amt),], aes(x=Expected.Amt, y=..density..)) + geom_histogram(bins=60, alpha=.5) + geom_density(color="gray") + scale_x_continuous(labels=scales::dollar) + labs(title="Expected Amount")
## Ask Amount
ggplot(mderived[!is.na(mderived$Ask.Amt),], aes(x=Ask.Amt, y=..density..)) + geom_histogram(bins=60, alpha=.5) + geom_density(color="gray") + scale_x_continuous(labels=scales::dollar) + labs(title="Ask Amount")
```
There's a very small number of very large ask amounts. It's a little hard to see, so here's what the data look like on a $\log_{10}$ scale. So that we don't have to drop the 0s, define the transformation:
$$ x^\text{*} = \log_{10}(x + \alpha)$$
The smallest non-zero ask amount is:
```{r, message=F, warning=F, cache=F}
(alpha <- min(na.omit(mderived$Ask.Amt[mderived$Ask.Amt != 0])))
```
Plotting the data under this transformation.
```{r, echo=F, message=F, warning=F, cache=F}
## Planned Ask Amount
ggplot(mderived, aes(x=(Planned.Amt + alpha), y=..density..)) + geom_histogram(bins=60, alpha=.5) + geom_density(color="gray") + scale_x_log10(breaks=10^(0:7), labels=scales::dollar) + labs(title="Planned Ask Amount", x=expression(paste(log[10],"(Planned.Amt + ",alpha,")")))
## Expected Amount
ggplot(mderived[!is.na(mderived$Expected.Amt),], aes(x=(Expected.Amt + alpha), y=..density..)) + geom_histogram(bins=60, alpha=.5) + geom_density(color="gray") + scale_x_log10(breaks=10^(0:7), labels=scales::dollar) + labs(title="Expected Amount", x=expression(paste(log[10],"(Expected.Amt + ",alpha,")")))
## Ask Amount
ggplot(mderived[!is.na(mderived$Ask.Amt),], aes(x=(Ask.Amt + alpha), y=..density..)) + geom_histogram(bins=60, alpha=.5) + geom_density(color="gray") + scale_x_log10(breaks=10^(0:7), labels=scales::dollar) + labs(title="Ask Amount", x=expression(paste(log[10],"(Ask.Amt + ",alpha,")")))
```
That's actually not too far from symmetric.
## Interactions
Are there interactions worth looking at between `Solicitation.Type.Desc` and the others?
```{r, message=F, warning=F, cache=F}
table(mderived$Solicitation.Type.Desc)
```
Okay, `Solicitation.Type.Desc` clearly needs to be aggregated, so let's go with Other, Outright Gift, and Standard Pledge. By default, the alphabetically first factor level is the baseline under the treatment constraint, so the model will show contrasts between Other and Outright Gift, and Other and Standard Pledge, which is nice for interpretability.
Look at `Ask.Amt` by `Sol.Type.Agg`:
```{r, echo=F, message=F, warning=F, cache=F}
## Ask Amount
ggplot(mderived[!is.na(mderived$Ask.Amt),], aes(x=(Ask.Amt + alpha), y=..density..)) + geom_histogram(bins=60, alpha=.55) + geom_density(color="gray") + scale_x_log10(breaks=10^(0:7), labels=scales::dollar) + labs(title="Ask Amount", x=expression(paste(log[10],"(Ask.Amt + ",alpha,")"))) + facet_grid(Sol.Type.Agg ~ .)
## Ask Amount with result
ggplot(mderived[!is.na(mderived$Ask.Amt),], aes(x=(Ask.Amt + alpha), y=..density.., fill=FY.Ask.Book)) + geom_histogram(bins=60, alpha=.25, position="identity") + geom_density(alpha=.25) + scale_x_log10(breaks=10^(0:7), labels=scales::dollar) + labs(title="Ask Amount with result", x=expression(paste(log[10],"(Ask.Amt + ",alpha,")"))) + facet_grid(Sol.Type.Agg ~ .)
```
Other and Standard Pledge look more nearly flat than Outright Gift, so it could be worth checking out.
Also check `ask.fiscal.mo` by `Sol.Type.Agg`:
```{r, echo=F, message=F, warning=F, cache=F}
fiscal.month.name <- strtrim(month.name, 3)[c(7:12, 1:6)]
## Ask by month
ggplot(mderived[!is.na(mderived$Ask.Amt),], aes(x=ask.fiscal.mo, y=..density..)) + geom_histogram(bins=12, alpha=.5) + geom_density(color="gray") + labs(title="Asks by gift type") + facet_grid(Sol.Type.Agg ~ .) + scale_x_continuous(labels=fiscal.month.name, breaks=c(1:12))
## Ask by month with result
ggplot(mderived[!is.na(mderived$Ask.Amt),], aes(x=ask.fiscal.mo, y=..density.., fill=FY.Ask.Book)) + geom_histogram(bins=12, alpha=.25, position="identity") + geom_density(alpha=.25) + labs(title="Asks by gift type with result") + facet_grid(Sol.Type.Agg ~ .) + scale_x_continuous(labels=fiscal.month.name, breaks=c(1:12))
```
Outright gift asks have more success by far in December. Could be worth checking out.
Based on previous findings I expect the Annual Fund group is mostly responsible for plummeting closed solicitation rates toward the end of the fiscal year. Let's check.
```{r, echo=F, message=F, warning=F, cache=F}
## Ask by month with result
ggplot(mderived[!is.na(mderived$Ask.Amt) & !is.na(mderived$Ask.Band),], aes(x=ask.fiscal.mo, y=..density.., fill=FY.Ask.Book)) + geom_histogram(bins=12, alpha=.25, position="identity") + geom_density(alpha=.25) + labs(title="Asks by gift type") + facet_grid(Sol.Type.Agg ~ Ask.Band) + scale_x_continuous(labels=fiscal.month.name, breaks=c(1:12)) + theme(axis.text.x=element_text(angle=90, vjust=.4))
```
Confirmed -- the AF group gives on their schedule, not ours.
Additional considerations -- does it make sense to *remove* 0s before the $\log_{10}$ transformation as opposed to adding a constant $\alpha$?
* `Planned.Amt` = $0 is an error
* `Ask.Amt` = $0 is an error
* `Expected.Amt` = $0 is not an error; reflects what the solicitation manager thought was most likely at the time
I think it makes sense to keep the $0 values and transform them appropraitely. If necessary, `Ask.Amt` = $0 can be filled in with `Planned.Amt` and vice versa.
# Main effects model with transformed data
```{r, message=F, warning=F, cache=F}
# Function to transform dollar amounts as above
LogTrans <- function(x, a=alpha) {log10(x + a)}
# Function to turn fiscal month into a factor
FacMonth <- function(x) {factor(x, labels=fiscal.month.name)}
# New variables
mderived <- mderived %>% mutate(lt.Planned.Amt = LogTrans(Planned.Amt), lt.Expected.Amt = LogTrans(Expected.Amt), lt.Ask.Amt = LogTrans(Ask.Amt), fm.planning = FacMonth(planning.fiscal.mo), fm.clear = FacMonth(clear.fiscal.mo), fm.ask = FacMonth(ask.fiscal.mo), fm.oral = FacMonth(oral.fiscal.mo))
# Plan model, main effects only
glm.pln.t <- glm(FY.Plan.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Plan.Future + fm.planning, data=mderived, family=binomial())
# Clear model, main effects only
glm.clr.t <- glm(FY.Clear.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Clear.Future + fm.clear, data=mderived, family=binomial())
# Ask model, main effects only
glm.ask.t <- glm(FY.Ask.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Ask.Future + fm.ask, data=mderived, family=binomial())
# Oral model, main effects only
glm.ora.t <- glm(FY.Oral.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Oral.Future + fm.oral, data=mderived, family=binomial())
# Put together the error rates
confuse <- ConfusionMatrix(list(glm.pln.t, glm.clr.t, glm.ask.t, glm.ora.t), modnames=modnames, counts=F)
# Row to add to the error table
dat2 <- CalcErrorRates(confuse, model.name="GLM main effects trans", modnames)
# Print the error table
kable(rbind(errors, dat1, dat2), digits=2)
```
This gives the same error rate as the untransformed data, interestingly. Let's look at each of these models.
```{r, message=F, warning=F, cache=F}
summary(glm.pln.t)
```
Planned and Expected seem to cancel each other out to some degree, but just including one increases the error rate:
```{r, message=F, warning=F, cache=F}
# Error comparison
rbind(dat2, CalcErrorRates(ConfusionMatrix(list(
update(glm.pln.t, . ~ . -lt.Expected.Amt),
update(glm.clr.t, . ~ . -lt.Expected.Amt),
update(glm.ask.t, . ~ . -lt.Expected.Amt),
update(glm.ora.t, . ~ . -lt.Expected.Amt)
), modnames=modnames, counts=F
), model.name="No Expected.Amt", modnames))
```
So it looks like Month doesn't make much of a difference in the end, which is intellectally unsatisfying, but perhaps an inevitable result of optimizing for probability to close (too noisy). The final models can drop these.
```{r, message=F, warning=F, cache=F}
# Plan model, main effects only
glm.pln.t <- glm(FY.Plan.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Plan.Future, data=mderived, family=binomial())
# Clear model, main effects only
glm.clr.t <- glm(FY.Clear.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Clear.Future, data=mderived, family=binomial())
# Ask model, main effects only
glm.ask.t <- glm(FY.Ask.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Ask.Future, data=mderived, family=binomial())
# Oral model, main effects only
glm.ora.t <- glm(FY.Oral.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Oral.Future, data=mderived, family=binomial())
# Put together the error rates
confuse <- ConfusionMatrix(list(glm.pln.t, glm.clr.t, glm.ask.t, glm.ora.t), modnames=modnames, counts=F)
# Row to add to the error table
dat3 <- CalcErrorRates(confuse, model.name="GLM trans no month", modnames)
# Print the error table
kable(rbind(errors, dat1, dat2, dat3), digits=2)
# Save the models
save(glm.pln.t, glm.clr.t, glm.ask.t, glm.ora.t, file="logistic.models.Rdata")
```
# Cross-validated error estimate
In-sample error estimates are biased, so let's see what the 5-fold cross-validated error is. Note that it's important that the coefficients be calculated separately for each holdout sample.
```{r, message=F, warning=F, cache=F}
## Randomly shuffle the data
set.seed(57141)
k <- 5
rand.ind <- sample(1:nrow(mderived), nrow(mderived))
n.size <- ceiling(nrow(mderived)/k)
folds <- list()
# Grab next n.size shuffled row indices
for (i in 1:k) {folds[[i]] <- na.omit(rand.ind[(n.size * (i - 1) + 1):(n.size * i)])}
## Cross-validation
# Vector to store the errors
error.list <- NULL
for (i in 1:k) {
# Plan model, main effects only
glm.pln.t <- glm(FY.Plan.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Plan.Future, data=mderived[-folds[[i]],], family=binomial())
# Clear model, main effects only
glm.clr.t <- glm(FY.Clear.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + Clear.Future, data=mderived[-folds[[i]],], family=binomial())
# Ask model, main effects only
glm.ask.t <- glm(FY.Ask.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Ask.Future, data=mderived[-folds[[i]],], family=binomial())
# Oral model, main effects only
glm.ora.t <- glm(FY.Oral.Book ~ Sol.Type.Agg + lt.Planned.Amt + lt.Expected.Amt + lt.Ask.Amt + Oral.Future, data=mderived[-folds[[i]],], family=binomial())
# Put together the error rates
confuse <- ConfusionMatrix(list(glm.pln.t, glm.clr.t, glm.ask.t, glm.ora.t), testdata=mderived[folds[[i]],], modnames=modnames, counts=T)
# Row to add to the error table
error.list <- rbind(error.list, CalcErrorRates(confuse, model.name="GLM trans no month, xval", modnames))
}
## Add averaged error to the table
dat4 <- error.list %>% group_by(Model) %>% summarise(Plan = sum(Plan)/nrow(model.matrix(glm.pln)), Clear = sum(Clear)/nrow(model.matrix(glm.clr)), Ask = sum(Ask)/nrow(model.matrix(glm.ask)), Oral = sum(Oral)/nrow(model.matrix(glm.ora)))
kable(rbind(errors, dat3, dat4), digits=2)
```
The error actually comes out pretty much the same.
Of course, these error rates and models need to be compared to JMod in its native environment, namely predicted dollar amounts.
# Packages used
```{r}
session_info()
```