forked from proback/BeyondMLR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path04-Poisson-Regression.Rmd
1299 lines (978 loc) · 108 KB
/
04-Poisson-Regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Chapter 4"
subtitle: "Poisson Regression"
output:
pdf_document:
number_sections: yes
html_document: default
---
# Poisson Regression {#ch-poissonreg}
## Learning Objectives
After finishing this chapter, you should be able to:
- Describe why simple linear regression is not ideal for Poisson data.
- Write out a Poisson regression model and identify the assumptions for inference.
- Write out the likelihood for a Poisson regression and describe how it could be used to estimate coefficients for a model.
- Interpret estimated coefficients from a Poisson regression and construct confidence intervals for them.
- Use deviances for Poisson regression models to compare and assess models.
- Use an offset to account for varying effort in data collection.
- Fit and use a zero-inflated Poisson (ZIP) model.
## Introduction to Poisson Regression
Consider the following questions:
1. Are the number of motorcycle deaths in a given year related to a state's helmet laws?
2. Does the number of employers conducting on-campus interviews during a year differ for public and private colleges?
3. Does the daily number of asthma-related visits to an Emergency Room differ depending on air pollution indices?
4. Has the number of deformed fish in randomly selected Minnesota lakes been affected by changes in trace minerals in the water over the last decade?
Each example involves predicting a response using one or more explanatory variables, although these examples have response variables that are counts per some unit of time or space. A Poisson random variable is often used to model counts; see Chapter \@ref(ch-distthry) for properties of the Poisson distribution. Since a Poisson random variable is a count, its minimum value is zero and, in theory, the maximum is unbounded. We'd like to model our main parameter $\lambda$, the average number of occurrences per unit of time or space, as a function of one or more covariates. For example, in the first question above, $\lambda_i$ represents the average number of motorcycle deaths in a year for state $i$, and we hope to show that state-to-state variability in $\lambda_i$ can be explained by state helmet laws.
For an OLS linear regression model, the parameter of interest is the average response, $\mu_i$, for subject $i$, and $\mu_i$ is modeled as a line in the case of one explanatory variable. By analogy, it might seem reasonable to try to model the Poisson parameter $\lambda_i$ as a linear function of an explanatory variable but there are some problems with this approach. In fact, a model like $\lambda_i=\beta_0+\beta_1x_i$ doesn't work well for Poisson data. A line is certain to yield negative values for certain $x_i$, but $\lambda_i$ can only take on values from 0 to $\infty$. In addition, the equal variance assumption in linear regression inference is violated because as the mean rate for a Poisson variable increases the variance also increases (recall from Chapter \@ref(ch-distthry) that, if $Y$ is the observed count, then $E(Y)=Var(Y)=\lambda$).
One way to avoid these problems is to model log($\lambda_i$) instead of $\lambda_i$ as a function of the covariates. The log($\lambda_i$) takes on values from $-\infty$ to $\infty$. We can also take into account the increase in the variance with an increasing mean using this approach. (Note that throughout *Broadening Your Statistical Horizons* we use log to represent the natural logarithm.) Thus, we will consider the **Poisson regression** model:
\begin{equation}
log(\lambda_i)=\beta_0+\beta_1 x_i
(\#eq:poisReg)
\end{equation}
where the observed values $Y_i \sim$ Poisson with $\lambda=\lambda_i$ for a given $x_i$. For example, each state $i$ can potentially have a different $\lambda$ depending on its value of $x_i$, where $x_i$ could represent presence or absence of a particular helmet law. Note that the Poisson regression model contains no separate error term like the $\epsilon$ we see in linear regression, because $\lambda$ determines both the mean and the variance of a Poisson random variable.
### Poisson Regression Assumptions
Much like OLS, using Poisson regression to make inferences requires model assumptions.
1. __Poisson Response__ The response variable is a count per unit of time or space, described by a Poisson distribution.
2. __Independence__ The observations must be independent of one another.
3. __Mean=Variance__ By definition, the mean of a Poisson random variable must be equal to its variance.
4. __Linearity__ The log of the mean rate, log($\lambda$), must be a linear function of x.
### A Graphical Look at Poisson Regression
```{r load_packageschp4, include=FALSE}
library(dplyr)
library(ggplot2) # enable ggplot
library(gridExtra) # enable multiple plotting
library(knitr) #for kable
library(mosaic)
library(xtable)
library(pscl) # needed for the zero inflation model
library(multcomp)
library(pander)
library(MASS)
library(readr)
library(forcats)
if(knitr::is_html_output()){options(knitr.table.format = "html")} else {options(knitr.table.format = "latex")}
```
```{r, OLSpois, fig.align="center",out.width="60%", fig.cap='Regession Models: Linear Regression (left) and Poisson Regression (right)',echo=FALSE, warning=FALSE, message=FALSE}
## Sample data for graph of OLS normality assumption
## Code from https://stackoverflow.com/questions/31794876/ggplot2-how-to-curve-small-gaussian-densities-on-a-regression-line?rq=1
set.seed(0)
dat <- data.frame(x=(x=runif(10000, 0, 50)),
y=rnorm(10000, 10*x, 100))
## breaks: where you want to compute densities
breaks <- seq(0, max(dat$x), len=5)
dat$section <- cut(dat$x, breaks)
## Get the residuals
dat$res <- residuals(lm(y ~ x, data=dat))
## Compute densities for each section, flip the axes, add means of sections
## Note: densities need to be scaled in relation to section size (2000 here)
dens <- do.call(rbind, lapply(split(dat, dat$section), function(x) {
d <- density(x$res, n=5000)
res <- data.frame(x=max(x$x)- d$y*1000, y=d$x+mean(x$y))
res <- res[order(res$y), ]
## Get some data for normal lines as well
xs <- seq(min(x$res), max(x$res), len=5000)
res <- rbind(res, data.frame(y=xs + mean(x$y),
x=max(x$x) - 1000*dnorm(xs, 0, sd(x$res))))
res$type <- rep(c("empirical", "normal"), each=5000)
res
}))
dens$section <- rep(levels(dat$section), each=10000)
ols_assume <- ggplot(dat, aes(x, y)) +
geom_point(size = 0.1) +
geom_smooth(method="lm", fill=NA, lwd=2) +
geom_path(data=dens[dens$type=="normal",], aes(x, y, group=section),
color="salmon", lwd=1.1) +
theme_bw() +
geom_vline(xintercept=breaks, lty=2)
# Now make Poisson regression picture
set.seed(0)
dat <- data.frame(x=(x=runif(1000, 0, 20)),
y=rpois(1000, exp(.1*x)))
## breaks: where you want to compute densities
breaks <- seq(2, max(dat$x), len=5)
dat$section <- cut(dat$x, breaks)
## Get the residuals
dat$res <- dat$y - .1*dat$x
## Compute densities for each section, flip the axes, add means of sections
## Note: densities need to be scaled in relation to section size
dens <- do.call(rbind, lapply(split(dat, dat$section), function(x) {
d <- density(x$res, n=500)
res <- data.frame(x=max(x$x)- d$y*10, y=d$x+mean(x$y))
res <- res[order(res$y), ]
## Get some data for poisson lines as well
xs <- seq(min(x$y), max(x$y), len=500)
res <- rbind(res, data.frame(y=xs,
x=max(x$x) - 10*dpois(round(xs), exp(.1*max(x$x)))))
res$type <- rep(c("empirical", "poisson"), each=500)
res
}))
dens$section <- rep(levels(dat$section), each=1000)
pois_assume <- ggplot(dat, aes(x, y)) +
geom_point(size = 0.1) +
geom_smooth(method="loess", fill=NA, lwd=2) +
geom_path(data=dens[dens$type=="poisson",], aes(x, y, group=section),
color="salmon", lwd=1.1) +
theme_bw() +
geom_vline(xintercept=breaks, lty=2)
grid.arrange(ols_assume, pois_assume, ncol = 2)
```
Figure \@ref(fig:OLSpois) illustrates a comparison of the OLS model for inference to Poisson regression using a log function of $\lambda$.
1. The graphic displaying the ordinary least squares (OLS) inferential model appears in the left panel of Figure \@ref(fig:OLSpois). It shows that for each level of X, the responses appear to be approximately normal. The panel on the right side of Figure \@ref(fig:OLSpois) depicts what a Poisson regression model looks like. For each level of X, the responses follow a Poisson distribution (Assumption 1). For Poisson regression, small values of $\lambda$ are associated with a distribution that is noticeably skewed with lots of small values and only a few larger ones. As $\lambda$ increases the distribution of the responses begins to look more and more like a normal distribution.
2. In the OLS regression model, the variation in $Y$ at each level of X, $\sigma^2$, is the same. For Poisson regression the responses at each level of X become more variable with increasing means, where variance=mean (Assumption 3).
3. In the case of OLS, the mean responses for each level of X, $\mu_{Y|X}$, fall on a line. In the case of the Poisson model, the mean values of $Y$ at each level of $X$, $\lambda_{Y|X}$, fall on a curve, not a line, although the logs of the means should follow a line (Assumption 4).
## Case Studies Overview
We take a look at the Poisson regression model in the context of three case studies. Each case study is based on real data and real questions. Modeling household size in the Philippines introduces the idea of regression with a Poisson response along with its assumptions. A quadratic term is added to a model to determine an optimal size per household, and methods of model comparison are introduced. The campus crime case study introduces two big ideas in Poisson regression modeling: offsets, to account for sampling effort, and overdispersion, when actual variability exceeds what is expected by the model. Finally, the weekend drinking example uses a modification of a Poisson model to account for more zeros than would be expected for a Poisson random variable. These three case studies also provide context for some of the familiar concepts related to modeling such as exploratory data analysis, estimation, and residual plots.
## Case Study: Household Size in the Philippines
How many other people live with you in your home? The number of people sharing a house differs from country to country and often from region to region. International agencies use household size when determining needs of populations, and the household sizes determine the magnitude of the household needs.
The Philippine Statistics Authority (PSA) spearheads the Family Income and Expenditure Survey (FIES) nationwide. The survey, which is undertaken every three years, is aimed at providing data on family income and expenditure, including levels of consumption by item of expenditure. Our [data](https://www.kaggle.com/issatingzon/exploratory-data-analysis/data), from the 2015 FIES, is a subset of 1500 of the 40,000 observations [@PSA]. Our dataset focuses on five regions: Central Luzon, Metro Manila, Ilocos, Davao, and Visayas (see Figure \@ref(fig:philippinesmap)).
```{r, philippinesmap, out.width='50%', fig.show='hold', fig.cap="Regions of the Philippines", echo=FALSE}
knitr::include_graphics("data/map_of_philippines.jpg")
```
At what age are heads of households in the Philippines most likely to find the largest number of people in their household? Is this association similar for poorer households (measured by the presence of a roof made from predominantly light/salvaged materials)? We begin by explicitly defining our response, $Y=$ number of people other than the head of the household. We then define the explanatory variables: age of the head of the household, type of roof (predominantly light/salvaged material or predominantly strong material), and location (Central Luzon, Davao Region, Ilocos Region, Metro Manila, or Visayas). Note that predominantly light/salvaged materials are a combination of light material, mixed but predominantly light material, and mixed but predominantly salvaged material, and salvaged matrial. Our response is a count, so we consider a Poisson regression where the parameter of interest is $\lambda$, the average number of people, other than the head, per household. We will primarily examine the relationship between household size and age of the head of household, controlling for location and income.
```{r, include=FALSE}
fHH1 <- read_csv("data/fHH1.csv")
```
### Data Organization {#organizedata1}
The first five rows from our data set `fHH1.csv` are illustrated in Table \@ref(tab:fHH1table1). Each line of the data file refers to a household at the time of the survey:
- `age` = the age of the head of household
- `numLT5` = the number in the household under 5 years of age
- `total` = the number of people in the household other than the head
- `roof` = the type of roof in the household (either Predominantly Light/Salvaged Material, or Predominantly Strong Material, where stronger material can sometimes be used as a proxy for greater wealth)
- `location` = where the house is located (Central Luzon, Davao Region, Ilocos Region, Metro Manila, or Visayas)
```{r, fHH1table1, echo=FALSE, comment=NA}
library(kableExtra)
library(knitr)
headfHH1 <- fHH1 %>%
filter(row_number() < 6)
kable(headfHH1, booktabs=T,caption="The first five observations from the Philippines Household case study.")
```
### Exploratory Data Analyses {#explore}
```{r, include=FALSE}
mean(fHH1$total)
var(fHH1$total)
#sd(fHH1$total)
prop.table(table(fHH1$roof))
fHH1 %>% group_by(roof) %>%
summarise(mean=mean(total), sd=sd(total), var=var(total), n=n())
fHH1 %>% group_by(location) %>%
summarise(mean=mean(total), sd=sd(total), var=var(total), n=n())
```
For the rest of this case study, we will refer to the number of people in a household as the total number of people in that specific household __besides__ the head of household. The average number of people in a household is 3.68 (Var = 5.53), and there are anywhere from 0 to 16 people in the house. Over 11.1\% of these households are made from predominantly light and salvaged material. The mean number of people in a house for houses with a roof made from predominantly strong material is is 3.69 (Var=5.55), whereas houses with a roof made from predominantly light/salvaged material average 3.64 people (Var=5.41). Of the various locations, Visayas has the largest household size, on average, with a mean of 3.90 in the household, and the Davao Region has the smallest with a mean of 3.39.
```{r, nhouse, fig.align="center",out.width="60%", fig.cap='Philippines: Distribution of household size',echo=FALSE, warning=FALSE, message=FALSE}
ggplot(fHH1, aes(total)) + geom_histogram(binwidth = .25) +
xlab("Number in the house excluding head of household") +
ylab("Count of households")
```
Figure \@ref(fig:nhouse) reveals a fair amount of variability in the number in each house; responses range from 0 to 16 with many of the respondents reporting between 1 and 5 people in the house. Like many Poisson distributions, this graph is right skewed. It clearly does not suggest that the number of people in a household is a normally distributed response.
```{r, totalPoisByAge, fig.align="center", out.width="60%", fig.cap= 'Distribution of household sizes by age group of the household head.', echo=FALSE, warning = FALSE}
library(knitr) #for kable
cuts = cut(fHH1$age,breaks=c(15,20,25,30,35,40,45,50,55,60,65,70))
ageGrps <- data.frame(cuts,fHH1)
ggplot(data = ageGrps, aes(x = total)) +
geom_histogram(binwidth = .25) +
facet_wrap(cuts) +
xlab("Household size")
```
Figure \@ref(fig:totalPoisByAge) further shows that responses can be reasonably modeled with a Poisson distribution when grouped by a key explanatory variable: Age of the household head. These last two plots together suggest that Assumption 1 (Poisson Response) is satisfactory in this case study.
For Poisson random variables, the variance of $Y$ (i.e., the square of the standard deviation of $Y$), is equal to its mean, where $Y$ represents the size of an individual household. As the mean increases, the variance increases. So, if the response is a count and the mean and variance are approximately equal for each group of $X$, a Poisson regression model may be a good choice. In Table \@ref(tab:table1chp4) we display age groups by 5-year increments, to check to see if the empirical means and variances of the number in the house are approximately equal for each age group. This provides us one way in which to check the Poisson assumption.
```{r, table1chp4, echo=FALSE}
# Mean = Variance
table1chp4<- ageGrps %>% group_by(cuts) %>%
summarise(mnNum= mean(total),varNum=var(total),n=n())
kable(table1chp4, booktabs=T,
caption="Compare mean and variance of household size within each age group",
col.names = c("Age Groups", "Mean", "Variance", "n"))
```
If there is a problem with this assumption most often we see variances much larger than means. Here, as expected, we see more variability as age increases. However, it appears that the variance is smaller than the mean for lower ages, while the variance is greater than the mean for higher ages. Thus, there is some evidence of a violation of the mean=variance assumption (Assumption 3), although any violations are modest.
The Poisson regression model also implies that log($\lambda_i$), not the mean household size $\lambda_i$, is a linear function of age; i.e., $log(\lambda_i)=\beta_0+\beta_1\textrm{age}_i$. Therefore, to check the linearity assumption (Assumption 4) for Poisson regression we would like to plot log($\lambda_i$) by age. Unfortunately, $\lambda_i$ is unknown. Our best guess of $\lambda_i$ is the observed mean number in the household for each age (level of $X$). Because these means are computed for observed data, they are referred to as **empirical** means. Taking the logs of the empirical means and plotting by age provides a way to assess the linearity assumption. The smoothed curve added to Figure \@ref(fig:ageXnhouse) suggests that there is a curvilinear relationship between age and the log of the mean household size, implying that adding a quadratic term should be considered. This finding is consistent with the researchers' hypothesis that there is an age at which a maximum household size occurs. It is worth noting that we are not modeling the log of the empirical means, rather it is the log of the *true* rate that is modeled. Looking at empirical means, however, does provide an idea of the form of the relationship between log($\lambda)$ and $x_i$.
```{r, ageXnhouse,fig.align="center",out.width="60%",fig.cap= 'The log of the mean household sizes, besides the head of household, by age of the head of household, with loess smoother.', echo=FALSE, warning = FALSE}
## Checking linearity assumption: Empirical log of the means plot
sumStats <- fHH1 %>% group_by(age) %>%
summarise(mntotal = mean(total),logmntotal = log(mntotal),n=n())
ggplot(sumStats, aes(x=age, y=logmntotal)) +
geom_point()+
geom_smooth(method = "loess", size = 1.5)+
xlab("Age of head of the household") +
ylab("Log of the empirical mean number in the house")
```
We can extend Figure \@ref(fig:ageXnhouse) by fitting separate curves for each region (see Figure \@ref(fig:byregion)). This allows us to see if the relationship between mean household size and age is consistent across region. In this case, the relationships are pretty similar; if they weren't we could consider adding an age-by-region interaction to our eventual Poisson regression model.
```{r, byregion, fig.align="center", out.width="60%", fig.cap= 'Empirical log of the mean household sizes vs. age of the head of household, with loess smoother by region.', echo=FALSE, warning = FALSE}
sumStats2 <- fHH1 %>% group_by(age, location) %>%
summarise(mntotal = mean(total),logmntotal = log(mntotal), n=n())
ggplot(sumStats2, aes(x=age, y=logmntotal, color=location)) +
geom_point()+
geom_smooth(method = "loess", se=FALSE)+
xlab("Age of head of the household") +
ylab("Log empirical mean household size")
```
Finally, the independence assumption (Assumption 2) can be assessed using knowledge of the study design and the data collection process. In this case, we do not have enough information to assess the independence assumption with the information we are given. If each household was not selected individually in a random manner, but rather groups of households were selected from different regions with differing customs about living arrangements, the independence assumption would be violated. If this were the case, we could use a multilevel model like those discussed in later chapters with a village term.
### Estimation and Inference {#sec-PoisInference}
We first consider a model for which log($\lambda$) is linear in age. We then will determine whether a model with a quadratic term in age provides a significant improvement based on trends we observed in the exploratory data analysis.
R reports an estimated regression equation for the linear Poisson model as:
\[
\widehat{log(\lambda)} = 1.55 - 0.0047 (age)
\]
```{r, 4verb2, echo=FALSE, comment=NA}
# Modeling
modela = glm(total~age,family=poisson,data=fHH1)
tmp <- capture.output(summary(modela))
cat(tmp[c(3:4, 9:13, 16:20)], sep='\n')
#exp(modela$coefficients)
```
How can the coefficient estimates be interpreted in terms of this example? As done when interpreting slopes in the OLS models, we consider how the estimated mean number in the house, $\lambda$, changes as the age of the household head increases by an additional year. But in place of looking at change in the mean number in the house, with a Poisson regression we consider the log of the mean number in the house and then convert back to original units. For example, consider a comparison of two models---one for a given age ($x$) and one after increasing age by 1 ($x+1$)
\begin{align}
log(\lambda_X) &= \beta_0 + \beta_1X \\
log(\lambda_{X+1}) &= \beta_0 + \beta_1(X+1) \\
log(\lambda_{X+1})-log(\lambda_X) &= \beta_1 \\
log \left(\frac{\lambda_{X+1}}{\lambda_X}\right) &= \beta_1\\
\frac{\lambda_{X+1}}{\lambda_X} &= e^{\beta_1}
(\#eq:rateRatio)
\end{align}
These results suggest that by exponentiating the coefficient on age we obtain the multiplicative factor by which the mean count changes. In this case, the mean number in the house changes by a factor of $e^{-0.0047}=0.995$ or decreases by 0.5\% with each additional year older the household head is; or, we predict a 0.47\% increase in mean household size for a 1 year decrease in age of the household head ($1/.995=1.0047$). The quantity on the left hand side of Equation \@ref(eq:rateRatio) is referred to as a __rate ratio__ or __relative risk__, and it represents a percent change in the response for a unit change in X. In fact, for regression models in general, whenever a variable (response or explanatory) is logged, we make interpretations about multiplicative effects on that variable, while with unlogged variables we can reach our usual interpretations about additive effects.
```{r, 4verb2a, message=FALSE, include=FALSE}
# Wald type CI by hand
beta1hat <- summary(modela)$coefficients[2,1]
beta1se <- summary(modela)$coefficients[2,2]
beta1hat - 1.96*beta1se # lower bound
beta1hat + 1.96*beta1se # upper bound
exp(beta1hat - 1.96*beta1se)
exp(beta1hat + 1.96*beta1se)
```
Typically the standard errors for the estimated coefficients are included in Poisson regression output. Here the standard error for the estimated coefficient for age is 0.00094. We can use the standard error to construct a confidence interval for $\beta_1$. A 95\% CI provides a range of plausible values for the `age` coefficient and can be constructed:
\[\hat\beta_1-Z^*\cdot SE(\hat\beta_1), \hat\beta_1+Z^*\cdot SE(\hat\beta_1)\]
\[-0.0047-1.96*0.00094, -0.0047+1.96*0.00094\]
\[ (-0.0065, -0.0029).
\]
Exponentiating the endpoints yields a confidence interval for the relative risk; i.e., the percent change in household size for each additional year older. Thus $(e^{-0.0065},e^{-0.0029})=(0.993,0.997)$ suggests that we are 95\% confident that the mean number in the house decreases between 0.7\% and 0.3\% for each additional year older the head of household is. It is best to construct a confidence interval for the coefficient and then exponentiate the endpoints because the estimated coefficients more closely follow a normal distribution than the exponentiated coefficients. There are other approaches to constructing intervals in these circumstances, including profile likelihood, the delta method, and bootstrapping, and we will discuss some of those approaches later. In this case, for instance, the profile likelihood interval is nearly identical to the Wald-type (normal theory) confidence interval above.
```{r, message=FALSE}
# CI for betas using profile likelihood
confint(modela)
exp(confint(modela))
```
If there is no association between age and household size, there is no change in household size for each additional year, so $\lambda_X$ is equal to $\lambda_{X+1}$ and the ratio $\lambda_{X+1}/\lambda_X$ is 1. In other words, if there is no association between age and household size, then $\beta_1=0$ and $e^{\beta_1}=1$. Note that our interval for $e^{\beta_1}$, (0.993,0.997), does not include 1, so the model with age is preferred to a model without age; i.e., age is significantly associated with household size. Note that we could have similarly confirmed that our interval for $\beta_1$ does not include 0 to show the significance of age as a predictor of household size.
Another way to test the significance of the age term is to calculate a __Wald-type statistic__. A Wald-type test statistic is the estimated coefficient divided by its standard error. When the true coefficient is 0, this test statistic follows a standard normal distribution for sufficiently large $n$. The estimated coefficient associated with the linear term in age is ${\hat{\beta}_1}=-0.0047$ with standard error $SE(\hat{\beta}_1)=0.00094$. The value for the Wald test statistic is then $Z=\hat{\beta}_1/SE(\hat{\beta}_1)=-5.026$, where $Z$ follows a standard normal distribution if $\beta_1=0$. In this case, the two-sided p-value based on the standard normal distribution for testing $H_0:\beta_1=0$ is almost 0 ($p=0.000000501$). In conclusion, we have statistically significant evidence (Z = -5.026, p < .001) that average household size decreases as age of the head of household increases.
### Using Deviances to Compare Models {#sec-Devtocompare}
There is another way in which to assess how useful age is in the model. A __deviance__ is a way in which to measure how the observed data deviates from the model predictions; it will be defined more precisely in Section \@ref(sec-PoisResid), but it is similar to sum of squared errors (unexplained variability in the response) in OLS regression. Because we want models that minimize deviance, we calculate the __drop-in-deviance__ when adding age to the model with no covariates (the **null model**). The deviances for the null model and the model with age can be found in the model output. A residual deviance for the model with age is reported as 2337.1 with 1498 df. The output also includes the deviance and degrees of freedom for the null model (2362.5 with 1499 df). The drop-in-deviance is 25.4 (2362.5 - 2337.1) with a difference of only 1 df, so that the addition of one extra term (age) reduced unexplained variability by 25.4. If the null model were true, we would expect the drop-in-deviance to follow a $\chi^2$ distribution with 1 df. Therefore the p-value for comparing the null model to the model with age is found by determining the probability that the value for a $\chi^2$ random variable with one degree of freedom exceeds 25.4, which is essentially 0. Once again, we can conclude that we have statistically significant evidence ($\chi^2_{\text{df} =1}=25.4$, $p < .001$) that average household size decreases as age of the head of household increases.
```{r}
# p-value for test comparing the null and first order models
# drop.in.dev <- modela$null.deviance - modela$deviance; drop.in.dev
# diff.in.df <- modela$df.null - modela$df.residual; diff.in.df
# 1-pchisq(drop.in.dev, diff.in.df)
model0 = glm(total~1,family=poisson,data=fHH1)
anova(model0, modela, test = "Chisq")
```
More formally, we are testing:
$$Null \ (reduced) \ Model: \log(\lambda) = \beta_0 \ or \ \beta_1=0$$
$$Larger \ (full) \ Model: \log(\lambda) = \beta_0 + \beta_1(\textrm{age}) \ or \ \beta_1 \neq 0 $$
In order to use the drop-in-deviance test, the models being compared must be **nested**; i.e., all the terms in the smaller model must appear in the larger model. Here the smaller model is the null model with the single term $\beta_0$ and the larger model has $\beta_0$ and $\beta_1$, so the two models are indeed nested. For nested models, we can compare the models' residual deviances to determine whether the larger model provides a significant improvement.
Here, then, is a summary of these two approaches to hypothesis testing about terms in Poisson regression models:
<p align="center"> __Drop-in-deviance test to compare models__ </p>
- Compute the deviance for each model, then calculate: drop-in-deviance = residual deviance for reduced model - residual deviance for the larger model.
- When the reduced model is true, the drop-in-deviance $\sim \chi^2_d$
where d= the difference in the degrees of freedom associated with the two models (that is, the difference in the number of terms / coefficients).
- A large drop-in-deviance favors the larger model.
<p align="center"> __Wald test for a single coefficient__ </p>
- Wald-type statistic = estimated coefficient / standard error
- When the true coefficient is 0, for sufficiently large $n$, the test statistic $\sim$ N(0,1).
- If the magnitude of the test statistic is large, there is evidence that the true coefficient is not 0.
The drop-in-deviance and the Wald-type tests usually provide consistent results; however, if there is a discrepancy the drop-in-deviance is preferred. Not only does the drop-in-deviance test perform better in more cases, but it's also more flexible. If two models differ by one term, then the drop-in-deviance test essentially tests if a single coefficient is 0 like the Wald test does, while if two models differ by more than one term, the Wald test is no longer appropriate.
### Using Likelihoods to fit Poisson Regression Models (Optional) {#likelihood.sec}
Before continuing with model building, we take a short detour to see how coefficient estimates are determined in a Poisson regression model. The least squares approach requires a linear relationship between the parameter, $\lambda_i$ (the expected or mean response for observation $i$), and $x_i$ (the age for observation $i$). However, it is log$(\lambda_i)$, not $\lambda_i$, that is linearly related to X with the Poisson model. The assumptions of equal variance and normality also do not hold for Poisson regression. Thus, the method of least squares will not be helpful for inference in Poisson Regression. Instead of OLS, we employ the likelihood principle to find estimates of our model coefficients. We look for those coefficient estimates for which the likelihood of our data is maximized; these are the __maximum likelihood estimates__.
The likelihood for n *independent* observations is the product of the probabilities. For example, if we observe five households with household sizes of 4, 2, 8, 6, and 1 person beyond the head, the likelihood is:
\[ Likelihood = P(Y_1=4)*P(Y_2=2)*P(Y_3=8)*P(Y_4=6)*P(Y_5=1)\]
Recall that the probability of a Poisson response can be written
\[P(Y=y)=\frac{e^{-\lambda}\lambda^y}{y!}\] for $y = 0, 1, 2, ...$
So the likelihood can be written as
\begin{align*}
Likelihood&= \frac{ e^{-\lambda_1}\lambda_1^4 }{ 4! }*
\frac{ e^{-\lambda_2}\lambda_2^2 }{ 2! } *\frac{e^{-\lambda_3}\lambda_3^8}{8!}*
\frac{e^{-\lambda_4}\lambda_4^6}{6!}*\frac{e^{-\lambda_5}\lambda_5^1}{1!}\\
\end{align*}
where each $\lambda_i$ can differ for each household depending on a particular $x_i$. As in chapter \@ref(ch-beyondmost), it will be easier to find a maximum if we take the log of the likelihood and ignore the constant term resulting from the sum of the factorials:
\begin{align}
-logL& \propto \lambda_{1}-4log(\lambda_{1})+\lambda_{2}-2log(\lambda_{2}) \nonumber \\
& +\lambda_{3}-8log(\lambda_{3})+\lambda_{4}-6log(\lambda_{4}) \nonumber \\
& +\lambda_{5}-log(\lambda_{5})
(\#eq:poisLoglik)
\end{align}
Now if we had the age of the head of the household for each house (X), we consider the Poisson regression model:
\[
log(\lambda_i)=\beta_0+\beta_1x_i
\]
This implies that $\lambda$ differs for each age and can be determined using
\[\lambda_i=e^{\beta_0+\beta_1x_i.}\]
If the ages are $X=c(32,21,55,44,28)$ years, our loglikelihood can be written:
\begin{align}
logL & \propto [-e^{\beta_0+\beta_132}+4({\beta_0+\beta_132})]+
[-e^{\beta_0+\beta_121}+2({\beta_0+\beta_121})]+ \nonumber \\
& [-e^{\beta_0+\beta_155}+8({\beta_0+\beta_155})]+
[-e^{\beta_0+\beta_144}+6({\beta_0+\beta_144})]+ \nonumber \\
& [-e^{\beta_0+\beta_128}+({\beta_0+\beta_128})]
(\#eq:poisLoglik2)
\end{align}
To see this, match the terms in Equation \@ref(eq:poisLoglik) with those in Equation \@ref(eq:poisLoglik2) noting that $\lambda_i$ has been replaced with $e^{\beta_0+\beta_1x_i}$. It is Equation \@ref(eq:poisLoglik2) that will be used to estimate the coefficients $\beta_0$ and $\beta_1$. Although this looks a little more complicated than the loglikelihoods we saw in Chapter \@ref(ch-beyondmost), the fundamental ideas are the same. In theory, we try out different possible values of $\beta_0$ and $\beta_1$ until we find the two for which the loglikelihood is largest. Most statistical software packages have automated search algorithms to find those values for $\beta_0$ and $\beta_1$ that maximize the loglikelihood.
### Second Order Model
In Section \@ref(sec-Devtocompare), the Wald-type test and drop-in-deviance test both suggest that a linear term in age is useful. But our exploratory data analysis in Section \@ref(explore) suggests that a quadratic model might be more appropriate. A quadratic model would allow us to see if there exists an age where the number in the house is, on average, a maximum. The output for a quadratic model appears below.
```{r, 4verb3, echo=FALSE, comment=NA}
# Quadratic Model
fHH1 = fHH1 %>% mutate(age2 = age*age)
modela2 = glm(total~age+age2,family=poisson,data=fHH1)
tmp <- capture.output(summary(modela2))
cat(tmp[c(3:4, 9:14, 17:21)], sep='\n')
```
We can assess the importance of the quadratic term in two ways. First, the p-value for the Wald-type statistic for age$^2$ is statistically significant (Z = -11.058, p < 0.001). Another approach is to perform a drop-in-deviance test.
```{r, 4verb3a}
# p-value for test comparing the null and first order models
# drop.in.dev <- modela$deviance - modela2$deviance; drop.in.dev
# diff.in.df <- modela$df.residual - modela2$df.residual; diff.in.df
# 1-pchisq(104.78, 1)
anova(modela, modela2, test = "Chisq")
```
$H_0$: log($\lambda$)=$\beta_0+\beta_1age$ (reduced model)
$H_A:$ log($\lambda$)=$\beta_0+\beta_1age + \beta_2age^2$ (larger model)
The first order model has a residual deviance of 2337.1 with 1498 df and the second order model, the quadratic model, has a residual deviance of 2200.9 with 1497 df. The drop-in-deviance by adding the quadratic term to the linear model is 2337.1 - 2200.9 = 136.2 which can be compared to a $\chi^2$ distribution with one degree of freedom. The p-value is esentially 0, so the observed drop of 136.2 again provides significant support for including the quadratic term.
We now have an equation in age which yields the estimated log(mean number in the house).
\[
\textrm{log(mean numHouse)} = -0.333 + 0.071(\textrm{age}) - 0.00071 (\textrm{age}^2)
\]
```{r, include=FALSE}
# Finding the age where the number in the house is a maximum
coefa2 = modela2$coefficients[3]
coefa = modela2$coefficients[2]
coefi = modela2$coefficients[2]
estLogNumHouse.f <- function(age){
return(coefa2*(age)^2 + coefa*(age) + coefi)
}
optimize(estLogNumHouse.f, interval=c(20,70), maximum=TRUE)
```
As shown in the following, with calculus we can determine that the maximum estimated additional number in the house is $e^{1.608} = 4.996$ when the head of the household is 50.328 years old.
\begin{align}
\textrm{log(total)} & = -0.333 + 0.071(\textrm{age}) - 0.00071 (\textrm{age}^2) \\
\frac{d}{d\textrm{age}}\textrm{log(total)} & = 0 + 0.071 - 0.0014 (\textrm{age}) = 0 \\
(\textrm{age}) & = 50.04 \\
\textrm{max[log(total)]} & = -0.333 + 0.071 \times 50.04 - 0.00071 \times (50.04)^2 = 1.608
\end{align}
### Adding a covariate
We should consider other covariates that may be related to household size. By controlling for important covariates, we can obtain more precise estimates of the relationship between age and household size. In addition, we may discover that the relationship between age and household size may differ by levels of a covariate. One important covariate to consider is location. As described earlier in the case study, there are 5 different regions that are associated with the `Location` variable: Central Luzon, Metro Manila, Visayas, Davao Region, and Ilocos Region. Assessing the utility of including the covariate `Location` is, in essence, comparing two nested models; here the quadratic model is compared to the quadratic model plus terms for *Location*. Results from the fitted model appears below; note that the Central Luzon is the reference region that all other regions are compared to.
```{r, 4verb5, comment=NA, echo=FALSE}
modela2L = glm(total~age+age2+location,family=poisson,data=fHH1)
tmp <- capture.output(summary(modela2L))
cat(tmp[c(3:5, 10:19, 22:26)], sep='\n')
```
```{r, include=FALSE}
exp(modela2L$coefficients)
```
```{r, 4verb5a, include = FALSE}
# p-value for test comparing the null and first order models
drop.in.dev <- modela2$deviance - modela2L$deviance; drop.in.dev
diff.in.df <- modela2$df.residual - modela2L$df.residual; diff.in.df
1-pchisq(drop.in.dev, diff.in.df)
```
Our Poisson regression model now looks like:
\begin{align}
\textrm{log(total)} & = -0.384 + 0.070 \cdot \textrm{age} - 0.00070 \cdot \textrm{age}^2 +0.061 \cdot \textrm{IlocosRegion} + \\
& 0.054 \cdot\textrm{MetroManila} +0.112 \cdot\textrm{Visayas} - 0.019 \cdot \textrm{DavaoRegion}
\end{align}
Notice that because there are 5 different locations, we must represent the effects of different locations through 4 indicator variables. For example, $\hat{\beta}_6=-0.0194$ indicates that, after controlling for the age of the head of household, the log mean household size is 0.0194 lower for households in the Davao Region than for households in the reference location of Central Luzon. In more interpretable terms, mean household size is $e^{-0.0194}=0.98$ times "higher" (i.e., 2\% lower) in the Davao Region than in Central Luzon, when holding age constant.
```{r}
anova(modela2,modela2L, test="Chisq")
```
To test if the mean household size significantly differs by location, we must use a drop-in-deviance test, rather than a Wald-type test, because four terms (instead of just one) are added when including the `location` variable. From the Analysis of Deviance table above, adding the four terms corresponding to location to the quadratic model with age produces a statistically significant improvement $(\chi^2=13.144, df = 4, p=0.0106)$, so there is significant evidence that mean household size differs by location, after controlling for age of the head of household. Further modeling (not shown) shows that after controlling for location and age of the head of household, mean household size did not differ between the two types of roofing material.
```{r, 4morephil, comment=NA, echo=FALSE, include = FALSE}
modela4 = glm(total ~ age + age2 + location + roof,
family = poisson, data = fHH1)
summary(modela4)
```
### Residuals for Poisson Models (Optional) {#sec-PoisResid}
Residual plots may provide some insight into Poisson regression models, especially linearity and outliers, although the plots are not quite as useful here as they are for OLS. There are a few options for computing residuals and predicted values. Residuals may have the form of residuals for OLS models or the form of deviance residuals which, when squared, sum to the total deviance for the model. Predicted values can be estimates of the counts, $e^{\beta_0+\beta_1X}$, or log counts, $\beta_0+\beta_1X$. We will typically use the deviance residuals and predicted counts.
The residuals for OLS in simple linear regression have the form:
\begin{align}
OLS\hspace{2mm}residual_i &= obs_i - fit_i \nonumber \\
&={Y_i-\hat{\mu}_i} \nonumber \\
&= Y_i-(\hat{\beta}_0 +\hat{\beta}_1 X_i)
(\#eq:OLSresid)
\end{align}
Residual sum of squares (RSS) are formed by squaring and adding these residuals, and we generally seek to minimize RSS in model building. We have several options for creating residuals for Poisson regression models. One is to create residuals in much the same way as we do in OLS. For Poisson residuals, the predicted values are denoted by $\hat{\lambda}_i$ (in place of $\hat{\mu}_i$ in Equation \@ref(eq:OLSresid)); they are then standardized by dividing by the standard error, $\sqrt{\hat{\lambda}_i}$. These kinds of residuals are referred to as __Pearson residuals__.
\begin{equation}
\textrm{Pearson residual}_i = \frac{Y_i-\hat{\lambda}_i}{\sqrt{\hat{\lambda}_i}}
(\#eq:pearson)
\end{equation}
Pearson residuals have the advantage that you are probably familiar with their meaning and the kinds of values you would expect. For example, after standardizing we expect most Pearson residuals to fall between -2 and 2. However, __deviance residuals__ have some useful properties that make them a better choice for Poisson regression.
First, we define a __deviance residual__ for an observation from a Poisson regression,
\begin{equation}
\textrm{deviance residual}_i = \textrm{sign}(Y_i-\hat{\lambda}_i)
\sqrt{
2 \left[Y_i log\left(\frac{Y_i}{\hat{\lambda}_i}\right)
-(Y_i - \hat{\lambda}_i) \right]}
(\#eq:deviance)
\end{equation}
Where $\textrm{sign}(x)$ is defined such that:
\[ \textrm{sign}(x) = \begin{cases} 1 & \textrm{if }\ x > 0 \\
-1 & \textrm{if }\ x < 0 \\
0 & \textrm{if }\ x = 0\end{cases}\]
As its name implies, a deviance residual describes how the observed data deviates from the fitted model. Squaring and summing the deviances for all observations produces the __residual deviance__ $=\sum (\textrm{deviance residual})^2_i$. Relatively speaking, observations for good fitting models will have small deviances; that is, the predicted values will deviate little from the observed. However, you can see that the deviance for an observation does not easily translate to a difference in observed and predicted responses as is the case with OLS models.
A careful inspection of the deviance formula reveals several places where the deviance compares $Y$ to $\hat{\lambda}$: the sign of the deviance is based on the difference between $Y$ and $\hat{\lambda}$, and under the radical sign we see the ratio $Y/\hat{\lambda}$ and the difference $Y -\hat{\lambda}$. When $Y = \hat{\lambda}$, that is, when the model fits perfectly, the difference will be 0 and the ratio will be 1 (so that its log will be 0). So like the residuals in OLS, an observation that fits perfectly will not contribute to the sum of the squared deviances. This definition of a deviance depends on the likelihood for Poisson models. Other models will have different forms for the deviance depending on their likelihood.
```{r, resid1, fig.align="center",out.width="60%",fig.cap= 'Residual plot for the Poisson model of household size by age of the household head', echo=FALSE}
# Residual plot for the first order model
## Log scale
lfitteda = predict(modela) # log scale
lresida = resid(modela) # linear model
lresid.df = data.frame(lfitteda,lresida)
ggplot(lresid.df,aes(x=lfitteda, y=lresida)) +
geom_point()+
geom_smooth(method = "loess", size = 1.5)+
geom_line(y=0, size=1.5, col="red")+
xlab("Fitted values") +
ylab("Deviance Residuals")
```
A plot (Figure \@ref(fig:resid1)) of the deviance residuals versus predicted responses for the first order model exhibits curvature, supporting the idea that the model may improved by adding a quadratic term. Other details related to residual plots can be found in a variety of sources including @McCullagh1989.
### Goodness-of-fit {#sec-PoisGOF}
The model residual deviance can be used to assess the degree to which the predicted values differ from the observed. When a model is true, we can expect the residual deviance to be distributed as a $\chi^2$ random variable with degrees of freedom equal to the model's residual degrees of freedom. Our model thus far, the quadratic terms for age plus the indicators for location, has a residual deviance of 2187.8 with 1493 df. The probability of observing a deviance this large if the model fits is esentially 0, saying that there is significant evidence of lack-of-fit.
```{r, gof1, comment=NA}
1-pchisq(modela2$deviance, modela2$df.residual) # GOF test
```
There are several reasons why lack-of-fit may be observed. We may be missing important covariates or interactions; a more comprehensive data set may be needed. There may be extreme observations that may cause the deviance to be larger than expected; however, our residual plots did not reveal any unusual points. Lastly, there may be a problem with the Poisson model. In particular, the Poisson model has only a single parameter, $\lambda$, for each combination of the levels of the predictors which must describe both the mean and the variance. This limitation can become manifest when the variance appears to be larger than the corresponding means. In that case, the response is more variable than the Poisson model would imply, and the response is considered to be overdispered.
## Least Squares Regression vs. Poisson Regression
\begin{gather*}
\underline{\textrm{Response}} \\
\mathbf{OLS:}\textrm{ Normal} \\
\mathbf{Poisson Regression:}\textrm{ counts} \\
\textrm{ } \\
\underline{\textrm{Variance}} \\
\mathbf{OLS:}\textrm{ Equal for each level of X} \\
\mathbf{Poisson Regression:}\textrm{ Equal to the mean for each level of X} \\
\textrm{ } \\
\underline{\textrm{Model Fitting}} \\
\mathbf{OLS:}\ \mu=\beta_0+\beta_1x \textrm{ using Least Squares}\\
\mathbf{Poisson Regression:}\ log(\lambda)=\beta_0+\beta_1x \textrm{ using Maximum Likelihood}\\
\textrm{ } \\
\underline{\textrm{EDA}} \\
\mathbf{OLS:}\textrm{ plot X vs. Y; add line} \\
\mathbf{Poisson Regression:}\textrm{ find }log(\bar{y})\textrm{ for several subgroups; plot vs. X} \\
\textrm{ } \\
\underline{\textrm{Comparing Models}} \\
\mathbf{OLS:}\textrm{ extra sum of squares F-tests; AIC/BIC} \\
\mathbf{Poisson Regression:}\textrm{ Drop in Deviance tests; AIC/BIC} \\
\textrm{ } \\
\underline{\textrm{Interpreting Coefficients}} \\
\mathbf{OLS:}\ \beta_1=\textrm{ change in }\mu_y\textrm{ for unit change in X} \\
\mathbf{Poisson Regression:}\ e^{\beta_1}=\textrm{ percent change in }\lambda\textrm{ for unit change in X}
\end{gather*}
## Case Study: Campus Crime
Students want to feel safe and secure when attending a college or university. In response to legislation, the US Department of Education seeks to provide data and reassurances to students and parents alike. All postsecondary institutions that participate in federal student aid programs are required by the Jeanne Clery Disclosure of Campus Security Policy and Campus Crime Statistics Act and the Higher Education Opportunity Act to collect and report data on crime occurring on campus to the Department of Education. In turn, this data is publicly available on the website of the Office of Postsecondary Education. We are interested in looking at whether there are regional differences in violent crime on campus controlling for differences in the type of school.
### Data Organization
Each row of `c_data.csv` contains crime information from a post secondary institution, either a college or university. The variables include:
- `type` = college (C) or university (U)
- `region` = region of the country (C = Central, MW = Midwest, NE = Northeast, SE = Southeast, SW = Southwest, and W = West)
- `nv` = the number of violent crimes for that institution for the given year
- `Enrollment` = enrollment at the school
- `enroll1000` = enrollment at the school, in thousands
- `nvrate` = number of violent crimes per 1000 students
```{r, include=FALSE}
#Getting started-Crime
# Crime data for Universities and Colleges
library(readr)
library(forcats)
c.data <- read_csv("data/c_data.csv")
names(c.data)
head(c.data)
summary(c.data)
with(c.data,round(prop.table(table(type)),3))
#creating dataset using 30:70 C:U
with(c.data,table(type,region))
#creating dataset using 30:70 C:U
ggplot(c.data, aes(x=nv)) +
geom_histogram() # looks Poisson
```
```{r, 4verb10, comment=NA, echo=FALSE}
head(c.data, n=10)
```
### Exploratory Data Analysis
```{r, nviolent, fig.align="center",out.width="60%", fig.cap='Histogram of number of violent crimes by institution',echo=FALSE, warning=FALSE, message=FALSE}
#This needs a caption as well
ggplot(c.data, aes(x=nv)) +
geom_histogram(bins = 15) +
xlab("Number of violent crimes")
```
A graph of the number of violent crimes, Figure \@ref(fig:nviolent), reveals the pattern often found with distributions of counts of rare events. Many schools reported no violent crimes or very few crimes. A few schools have a large number of crimes making for a distribution that appears to be far from normal. Therefore, Poisson regression should be used to model our data; Poisson random variables are often used to represent counts (e.g., number of violent crimes) per unit of time or space (e.g., one year).
Let's take a look at the covariates of interest for these schools: type of institution and region. In our data, the majority of institutions are universities (65\% of the 81 schools) and only 35\% are colleges. Interest centers on whether the different regions tend to have different crime rates. Table \@ref(tab:regions) contains the name of each region and each column represents the percentage of schools in that region which are colleges or universities. The proportion of colleges varies from a low of 20\% in the Southwest (SW) to a high of 50\% in the West (W).
```{r, regions, echo=FALSE}
table2chp4 <- with(c.data,round(prop.table(table(type,region),2),3))
kable(table2chp4, booktabs=T, caption = 'Proportion of colleges and universities within region in the campus crime data set')
```
While a Poisson regression model is a good first choice because the responses are counts per year, it is important to note that the counts are not directly comparable because they come from different size schools. This issue sometimes is referred to as the need to account for *sampling effort*; in other words, we expect schools with more students to have more reports of violent crime since there are more students who could be affected. We cannot compare the 30 violent crimes from the first school in the data set to no violent crimes for the second school when their enrollments are vastly different; 5,590 for school 1 versus 540 for school 2. We can take the differences in enrollments into account by including an __offset__ in our model, which we will discuss in the next section. For the remainder of the EDA, we examine the violent crime counts in terms of the rate per 1,000 enrolled $\frac{\textrm{number of violent crimes}}{\textrm{number enrolled}} \cdot 1000$.
Note that there is a noticeable outlier for a Southeastern school (5.4 violent crimes per 1000 students), and there is an observed rate of 0 for the Southwestern colleges which can lead to some computational issues. We therefore combined the SW and SE to form a single category of the South, and we also removed the extreme observation from the data set.
```{r, include=FALSE}
# Combining the southern colleges and universities
c.data <- c.data %>%
mutate(region, region = fct_recode(region, "S" = "SW", "S"="SE"))
```
```{r, table4ch4, echo=FALSE}
# Removing Outlier
c.data <- c.data %>%
filter(nvrate<5)
# Checking mean=variance assumption
table4ch4 <- c.data %>%
group_by(region, type) %>%
dplyr::summarize(MeanCount = mean(nv, na.rm=TRUE),
VarCount = var(nv, na.rm=TRUE),
MeanRate = mean(nvrate, na.rm=TRUE),
VarRate = var(nvrate, na.rm=TRUE),
n = n())
kable(table4ch4, booktabs=T, caption = 'The mean and variance of the violent crime rate by region and type of institution')
```
```{r, boxtyperegion, fig.align="center",out.width="60%", fig.cap='Boxplot of violent crime rate by region and type of institituion',echo=FALSE, warning=FALSE, message=FALSE}
#Insert boxplot without the outlier and combining S and SE
#This needs a caption
ggplot(c.data, aes(x = region, y = nvrate, fill = type)) +
geom_boxplot() +
ylab("Violent crimes per 1000 students")
```
Table \@ref(tab:table4ch4) and Figure \@ref(fig:boxtyperegion) display mean violent crime rates that are generally lower at the colleges within a region (with the exception of the Northeast). In addition, the regional pattern of rates at universities appears to differ from that of the colleges.
### Accounting for Enrollment
Although working with the observed rates (per 1000 students) is useful during the exploratory data analysis, we do not use these rates explicitly in the model. The counts (per year) are the Poisson responses when modeling, so we must take into account the enrollment in a different way. Our approach is to include a term on the right side of the model called an __offset__, which is the log of the enrollment, in thousands. There is an intuitive heuristic for the form of the offset. If we think of $\lambda$ as the mean number of violent crimes per year, then $\lambda/1000$ represents the number per 1000 students, so that the yearly count is adjusted to be comparable across schools of different sizes. Adjusting the yearly count by enrollment is equivalent to adding $log(\textrm{enroll1000})$ to the right hand side of the Poisson regression equation---essentially as a predictor with a fixed coefficient of 1:
\begin{align}
log(\frac{\lambda}{\textrm{enroll1000}} )= \beta_0 + \beta_1(\textrm{type}) \nonumber \\
log(\lambda)-log(\textrm{enroll1000}) = \beta_0 + \beta_1(\textrm{type}) \nonumber \\
log(\lambda) = \beta_0 + \beta_1(\textrm{type}) + log(\textrm{enroll1000})
\end{align}
While this heuristic is helpful, it is important to note that it is *not* $\frac{\lambda}{ \textrm{enroll1000}}$ that we are modeling. We are still modeling $log(\lambda)$, but we're adding an offset to adjust for differing enrollments, where the offset has the unusual feature that the coefficient is fixed at 1.0. As a result, no estimated coefficient for `enroll1000` will appear in the output.
## Modeling Assumptions
In Table \@ref(tab:table4ch4), we see that the variances are greatly higher than the mean counts in almost every group. Thus, we have reason to question the Poisson regression assumption of variability equal to the mean; we will have to return to this issue after some initial modeling. The fact that the variance of the rate of violent crimes per 1000 students tends to be on the same scale as the mean tells us that adjusting for enrollment may provide some help, although that may not completely solve our issues with excessive variance.
As far as other model assumptions, linearity with respect to $log(\lambda)$ is difficult to discern without continuous predictors, and it is not possible to assess independence without knowing how the schools were selected.
## Initial Models
We are interested primarily in differences in violent crime between institutional types controlling for difference in regions, so we fit a model with both region and our offset. Note that the central region is the reference level in our model.
```{r, 4verb11, comment=NA, echo=FALSE}
## Modeling
modeltr = glm(nv~type+region,family=poisson, offset=log(enroll1000),data=c.data)
tmp <- capture.output(summary(modeltr))
cat(tmp[c(3:5, 10:18, 21:25)], sep='\n')
```
From our model the Northeast and the South differ significantly from the Central region (p= 0.00000037 and p=0.0000924 respectively). The estimated coefficient of 0.778 means that the violent crime rate per 1,000 in the Northeast is nearly 2.2 ($e^{0.778}$) times that of the Central region controlling for the type of school. A wald-type confidence interval for this factor can be constructed by first calculating a CI for the coefficient (0.778 $\pm$ $1.96 \cdot 0.153$) and then exponentiating (1.61 to 2.94).
### Tukey's Honestly Significant Differences
Comparisons to regions other than the Central region can be accomplished by changing the reference region. If many comparisons are made, it would be best to adjust for multiple comparisons using a method such as **Tukey's Honestly Significant Differences**, which considers all pairwise comparisons among regions. This method helps control the large number of false positives that we would see if we ran multiple t-tests comparing groups. The honestly significant difference compares a standardized mean difference between two groups to a critical value from a studentized range distribution.
```{r, 4verb12, comment=NA, echo=FALSE}
tmp <- capture.output(summary(glht(modeltr, mcp(region="Tukey"))))
cat(tmp[c(4:5, 10:22, 24)], sep='\n')
```
In our case, Tukey's Honestly Significant Differences simultaneously evaluates all 10 mean differences between pairs of regions. We find that the Northeast has significantly higher rates of violent crimes than the Central, Midwest, and Western regions, while the South has significantly higher rates of violent crimes than the Central and the Midwest, controlling for the type of institution. The University indicator is significant and, after exponentiating the coefficient, can be interpreted as an approximately ($e^{0.280}$) 32\% increase in violent crime rate over colleges after controlling for region.
These results certainly suggest significant differences in regions and type of institution. However, the EDA findings suggest the effect of the type of institution may vary depending upon the region, so we consider a model with an interaction between region and type.
```{r, 4verb13, comment=NA, echo=FALSE}
modeli = glm(nv ~ type + region + region:type, family = poisson,
offset = log(enroll1000), data = c.data)
tmp <- capture.output(summary(modeli))
cat(tmp[c(3:5, 10:22, 25:29)], sep='\n')
```
These results provide convincing evidence of an interaction between the effect of region and the type of institution. A drop-in-deviance test like the one we carried out in the previous case study confirms the significance of the contribution of the interaction to this model. We have statistically significant evidence ($\chi^2=71.98, df=4, p<.001$) that the difference between colleges and universities in violent crime rate differs by region. For example, our model estimates that violent crime rates are 13.6 ($e^{.196+2.411}$) times higher in universities in the West compared to colleges, while in the Northeast we estimate that violent crime rates are 2.4 ($\frac{1}{e^{.196-1.070}}$) times higher in colleges.
```{r, 4verb14, comment=NA, echo=FALSE}
anova(modeltr,modeli,test="Chisq") # addition of intxn terms sign
```
The residual deviance (276.70 with 70 df) suggests significant lack of fit in the interaction model (p < .001). One possibility is that there are other important covariates that could be used to describe the differences in the violent crime rates. Without additional covariates to consider, we look for extreme observations, but we have already eliminated the most extreme of the observations.
In the absence of other covariates or extreme observations, we consider overdispersion as a possible explanation of the significant lack-of-fit.
## Overdispersion {#sec-overdispPois}
### Dispersion parameter adjustment
**Overdispersion** suggests that there is more variation in the response than the model implies. Under a Poisson model, we would expect the means and variances of the response to be about the same in various groups. Without adjusting for overdispersion, we use incorrect, artificially small standard errors leading to artificially small p-values for model coefficients. We may also end up with artificially complex models.
We can take overdispersion into account in several different ways. The simplest is to use an estimated dispersion factor to inflate standard errors. Another way is to use a negative-binomial regression model. We begin with using an estimate of the dispersion parameter.
We can estimate a dispersion parameter, $\phi$, by dividing the model deviance by its corresponding degrees of freedom; i.e., $\hat\phi=\frac{\sum(\textrm{Pearson residuals})^2}{n-p}$ where $p$ is the number of model parameters. It follows from what we know about the $\chi^2$ distribution that if there is no overdispersion, this estimate should be close to one. It will be larger than one in the presence of overdispersion. We inflate the standard errors by multiplying the variance by $\phi$ so that the standard errors are larger than the likelihood approach would imply; i.e., $SE_Q(\hat\beta)=\sqrt{\hat\phi}*SE(\hat\beta)$, where $Q$ stands for "quasipoisson" since multiplying variances by $\phi$ is an ad-hoc solution. Our process for model building and comparison is called **quasilikelihood**---similar to likelihood but without exact underlying distributions. If we choose to use a dispersion parameter with our model, we refer to the approach as quasilikelihood. The following output illustrates a quasipoisson approach to the interaction model:
```{r, 4verb15, comment=NA, echo=FALSE}
# Account for overdispersion with quasipoisson model
modeliq = glm(nv ~ type + region + region:type, family = quasipoisson,
offset = log(enroll1000), data = c.data)
tmp <- capture.output(summary(modeliq))
cat(tmp[c(3:5, 10:22, 25:29)], sep='\n')
```
In the absence of overdispersion, we expect the dispersion parameter estimate to be 1.0. The estimated dispersion parameter here is much larger than 1.0 (4.447) indicating overdispersion (extra variance) that should be accounted for. The larger estimated standard errors in the quasipoisson model reflect the adjustment. For example, the standard error for the West region term from a likelihood based approach is 0.7906, whereas the quasilikelihood standard error is $\sqrt{4.47}*0.7906$ or 1.6671. This term is no longer significant under the quasipoisson model. In fact, after adjusting for overdispersion (extra variation), none of the model coefficients in the quasipoisson model are significant at the .05 level! This is because standard errors were all increased by a factor of 2.1 ($\sqrt{\hat\phi}=\sqrt{4.447}=2.1$), while estimated coefficients remain unchanged.
Note that tests for individual parameters are now based on the t-distribution rather than a standard normal distribution, with test statistic $t=\frac{\hat\beta}{SE_Q(\hat\beta)}$ following an (approximate) t-distribution with $n-p$ degrees of freedom if the null hypothesis is true ($H_O:\beta=0$). Drop-in-deviance tests can be similarly adjusted for overdispersion in the quasipoisson model. In this case, you can divide the test statistic by the estimated dispersion parameter and compare the result to an F-distribution with the difference in the model degrees of freedom for the numerator and the degrees of freedom for the larger model in the denominator. That is, $F=\frac{\textrm{drop in deviance}}{\hat\phi}$ follows an (approximate) F-distribution when the null hypothesis is true ($H_O$: reduced model sufficient). The output below tests for an interaction between region and type of institution after adjusting for overdispersion (extra variance):
```{r, 4verb8, comment=NA}
phi <- sum(resid(modeli, type='pearson')^2) / modeli$df.residual
drop.in.dev <- modeltr$deviance - modeli$deviance
diff.in.df <- modeltr$df.residual - modeli$df.residual
Fstat <- drop.in.dev / summary(modeliq)$dispersion
Fstat
1-pf(Fstat, diff.in.df, modeli$df.residual)
```
Here, even after adjusting for overdispersion, we still have statistically significant evidence ($F=16.19, p<.001$) that the difference between colleges and universities in violent crime rate differs by region.
### Negative binomial modeling
Another approach to dealing with overdispersion is to model the response using a negative binomial instead of Poisson. An advantage of this approach is that it introduces another parameter in addition to $\lambda$ which gives the model more flexibility and, as opposed to the quasipoisson model, the negative binomial model assumes an explicit likelihood model. You may recall that negative binomial random variables take on nonnegative integer values which is consistent with modeling counts. This model posits selecting a $\lambda$ for each institution and then generating a count using a Poisson random variable with the selected $\lambda$. With this approach, the counts will be more dispersed than would be expected for observations based on a single Poisson variable with rate $\lambda$. (See Guided Exercises on the Gamma-Poisson mixture in Chapter \@ref(ch-distthry).)
Mathematically, you can think of the negative binomial model as a Poisson model where $\lambda$ is also random, following a gamma distribution. Specifically, if $Y|\lambda \textrm{~ Poisson}(\lambda)$ and $\lambda \textrm{~ gamma}(r,\frac{1-p}{p})$, then $Y \textrm{~ NegBinom}(r,p)$ where $E(Y)=\frac{pr}{1-p}=\mu$ and $Var(Y)=\frac{pr}{(1-p)^2}=\mu+\frac{\mu^2}{r}$. The overdispersion in this case is given by $\frac{\mu^2}{r}$, which approaches 0 as $r$ increases (so smaller values of $r$ indicate greater overdispersion).
Here is what happens if we apply a negative binomial regression model to the interaction model, which we've already established suffers from overdispersion issues under regular Poisson regression:
```{r, 4verb16, comment=NA, echo=FALSE}
# Account for overdispersion with negative binomial model
c.data2 <- c.data %>%
mutate(enroll1000 = ifelse(enroll1000 < 1, 1, enroll1000))
modelinb <- glm.nb(nv ~ type + region + region:type,
offset(log(enroll1000)), data = c.data2)
tmp <- capture.output(summary(modelinb))
cat(tmp[c(3:6, 11:23, 26:29)], sep='\n')
```
These results differ from the quasipoisson model. Several effects are now statistically significant at the .05 level: the effect of type of institution for the Central region ($Z=2.64, p=.008$), the difference between Northeast and Central regions for colleges ($Z=2.76, p=.006$), the difference between Northeast and Central regions in type effect ($Z=-2.01, p=.044$), and the difference between West and Central regions in type effect ($Z=2.71, p=.007$). Compared to the quasipoisson model, negative binomial coefficient estimates are generally in the same direction and similar in size, but negative binomial standard errors are somewhat smaller.
In summary, we explored the possibility of differences in the violent crime rate between colleges and universities, controlling for region. Our initial efforts seemed to suggest that there are indeed differences between colleges and universities, and the pattern of those differences depend upon the region. However, this model exhibited significant lack-of-fit which remained after the removal of an extreme observation. In the absence of additional covariates, we accounted for the lack-of-fit by using a quasilikelihood approach and a negative binomial regression, which provided slightly different conclusions. We may want to look for additional covariates and or more data.
## Case Study: Weekend drinking {#cs:drinking}
Sometimes when analyzing Poisson data, you may see many more zeros in your data set than you would expect for a Poisson random variable. For example, an informal survey of students in an introductory statistics course included the question, "How many alcoholic drinks did you consume last weekend?" This survey was conducted on a dry campus where no alcohol is officially allowed, even among students of drinking age, so we expect that some portion of the respondents never drink. The non-drinkers would thus always report zero drinks. However, there will also be students who are drinkers reporting zero drinks because they just did not happen to drink during the past weekend. Our zeros, then, are a mixture of responses from non-drinkers and drinkers who abstained. Ideally, we'd like to sort out the non-drinkers and drinkers when performing our analysis.
```{r, include=FALSE}
#Getting started-weekenddrinks
# File: weekendDrinks
zip.data <- read.csv("data/weekendDrinks.csv")
names(zip.data)
dim(zip.data)
## Observed data
obs.table <- tally(group_by(zip.data, drinks)) %>%
mutate(prop=round(n/sum(n),3))
obs.table # 47% reported 0 drinks last weekend
g.obs <- obs.table %>%
ggplot(aes(y=prop,x=drinks)) +
geom_bar(stat="identity") +
labs(x= "Number of drinks", y="Proportion") +
ggtitle("a) Observed")+
coord_cartesian(ylim = c(0, .5))
g.obs
## Poisson model
### lambda = mean number of drinks
sum1 <- zip.data %>%
summarise(lambda=mean(drinks),maxDrinks = max(drinks))
possible.values = with(sum1,0:maxDrinks)
model.prob <- with(sum1,dpois(possible.values,lambda))
pois.model <- data.frame(possible.values,model.prob)
g.model<- ggplot(pois.model,
aes(y=model.prob,x=possible.values)) +
geom_bar(stat="identity")+
labs(x= "Number of drinks", y="Probability") +
ggtitle("b) Poisson Model")+
coord_cartesian(ylim = c(0, .5))
g.model
```
### Research Question
The purpose of this survey is to explore factors related to drinking behavior on a dry campus. What proportion of students on this dry campus never drink? What factors, such as off campus living and sex, are related to whether students drink? Among those who do drink, to what extent is moving off campus associated with the number of drinks in a weekend? It is commonly assumed that males' alcohol consumption is greater than females'; is this true on this campus? Answering these questions would be a simple matter if we knew who was and was not a drinker in our sample. Unfortunately, the non-drinkers did not identify themselves as such, so we will need to use the data available with a model that allows us to estimate the proportion of drinkers and non-drinkers.
### Data Organization
Each line of `weekendDrinks.csv` contains data provided by a student in an introductory statistics course. In this analysis, the response of interest is the respondent's report of the number of alcoholic `drinks` they consumed the previous weekend, whether the student lives `off.campus`, and `sex`. We will also consider whether a student is likely a `firstYear` student based on the `dorm` he or she lives in. Here is a sample of observations from this dataset:
```{r, include=FALSE}
# predictors: sex, dorm, off campus
sex.table <- tally(group_by(zip.data, sex)) %>%
mutate(prop=round(n/sum(n),3))
sex.table
dorm.table <- tally(group_by(zip.data, dorm)) %>%
mutate(prop=round(n/sum(n),3))
dorm.table
zip.data <- zip.data %>%
mutate(off.campus=ifelse(dorm=="off campus",1,0))
off.table <- tally(group_by(zip.data, off.campus)) %>%
mutate(prop=round(n/sum(n),3))
off.table
# Grand Mean Model: no predictors
gmn.model=glm(drinks~1,family=poisson,data=zip.data)
summary(gmn.model)
exp(coef(gmn.model)) # same as mean number of drinks for this simple model
```
```{r, 4verb19, comment=NA, include=FALSE}
## Fitting a Zero Inflated Poisson (ZIP) model
zip.data <- zip.data %>%
mutate(firstYear=dorm%in%c("kildahl","mohn","kittlesby"))
fy.table <- tally(group_by(zip.data, firstYear)) %>%
mutate(prop=round(n/sum(n),3))
fy.table
```
```{r, 4verb17, comment=NA}
head(zip.data[2:5])
```
### Exploratory Data Analysis
As always we take stock of the amount of data; here there are 77 observations. Large sample sizes are preferred for the type of model we will consider, and n=77 is on the small side. We proceed with that in mind.
A premise of this analysis is that we believe that those responding zero drinks are coming from a mixture of non-drinkers and drinkers who abstained the weekend of the survey.
- __Non-drinkers__: respondents who never drink and would always reply with zero
- __Drinkers__: obviously this includes those responding with one or more drinks, but it also includes people who are drinkers but did not happen to imbibe the past weekend. These people reply zero but are not considered non-drinkers.
```{r, obsVmodel, fig.align="center",out.width="60%", fig.cap='Observed (a) versus Modeled (b) Number of Drinks',echo=FALSE, warning=FALSE, message=FALSE}
#obsDrinks and modelDrinks
obsVmodel <- grid.arrange(g.obs,g.model,ncol=1)
```
Beginning the EDA with the response, number of drinks, we find that over 46\% of the students reported no drinks during the past weekend. Figure \@ref(fig:obsVmodel)a portrays the observed number of drinks reported by the students. The mean number of drinks reported the past weekend is 2.013. Our sample consists of 74\% females and 26\% males, only 9\% of whom live off campus.
Because our response is a count it is natural to consider a Poisson regression model. You may recall that a Poisson distribution has only one parameter, $\lambda$, for its mean and variance. Here we will include an additional parameter, $\alpha$. We define $\alpha$ to be the true proportion of non-drinkers in the population.
The next step in the EDA is especially helpful if you suspect your data contains excess zeros. Figure \@ref(fig:obsVmodel)b is what we might expect to see under a Poisson model. Bars represent the probabilities for a Poisson distribution (using the Poisson probability formula) with $\lambda$ equal to the mean observed number of drinks, 2.013 drinks per weekend. Comparing this Poisson distribution to what we observed (Figure \@ref(fig:obsVmodel)a), it is clear that many more zeros have been reported by the students than you would expect to see if the survey observations were coming from a Poisson distribution. This doesn't surprise us because we had expected a subset of the survey respondents to be non-drinkers; i.e., they would not be included in this Poisson process. This circumstance actually arises in many Poisson regression settings. We will define $\lambda$ to be the mean number of drinks among those who drink, and $\alpha$ to be the proportion of non-drinkers ("true zeros"). Then, we will attempt to model $\lambda$ and $\alpha$ (or functions of $\lambda$ and $\alpha$) simultaneously using covariates like sex, first-year status, and off-campus residence. This type of model is referred to as a __zero-inflated Poisson model__ or __ZIP model__.
### Modeling
We first fit a simple Poisson model with the covariates `off.campus` and `sex`.
```{r, 4verb18, comment=NA, echo=FALSE}
# Models with Predictors
pois.m1=glm(drinks~off.campus+sex,family=poisson, data=zip.data)
tmp <- capture.output(summary(pois.m1))
cat(tmp[c(3:4, 9:14, 17:21)], sep='\n')
```
```{r}
# Exponentiated coefficients
exp(coef(pois.m1))
# Goodness-of-fit test
gof.ts = pois.m1$deviance
gof.pvalue = 1 - pchisq(gof.ts, pois.m1$df.residual)
gof.pvalue
```
Both covariates are statistically significant, but a goodness-of-fit test reveals that there remains significant lack-of-fit (residual deviance: 230.54 with only 74 df; p<.001 based on $\chi^2$ test with 74 df). In the absence of important missing covariates or extreme observations, this lack-of-fit may be explained by the presence of a group of non-drinkers.
A zero-inflated Poisson regression model to take non-drinkers into account consists of two parts:
- One part models the association, among drinkers, between number of drinks and the predictors of sex and off campus residence.
- The other part uses a predictor for first-year status to obtain an estimate of the proportion of non-drinkers based on the reported zeros.
The form for each part of the model follows. The first part looks like an ordinary Poisson regression model:
\[
log(\lambda)=\beta_0+\beta_1(\textrm{off.campus})+ \beta_2(\textrm{sex})
\]
where $\lambda$ is the mean number of drinks in a weekend *among those who drink*.
The second part has the form
\[
logit(\alpha)=\beta_0+\beta_1(\textrm{firstYear})
\]
where $\alpha$ is the probability of being in the non-drinkers group and $logit(\alpha) = log( \alpha/(1-\alpha))$. We'll provide more detail on the logit in the Logistic Regression chapter. There are many ways in which to structure this model; here we use different predictors in the two pieces, athough it would have been perfectly fine to use the same predictors for both pieces, or even no predictors for one of the pieces.
### Fitting a ZIP Model
How is it possible to fit such a model? We cannot observe whether a respondent is a drinker or not (which probably would've been good to ask). The ZIP model is a special case of a more general type of statistical model referred to as a __latent variable model__. More specifically, it is a type of a __mixture model__ where observations for one or more groups occur together and the group membership is unknown. Zero-inflated models are a particularly common example of a mixture model, but the response does not need to follow a Poisson distribution. Likelihood methods are at the core of this methodology, but fitting is an iterative process where it is necessary to start out with some guesses (or starting values). In general, it is important to know that models like ZIP exist, although we'll only explore interpretations and fitting options for a single case study here.
Here is the general idea of how ZIP models are fit. Imagine that the graph of the Poisson distribution Figure \@ref(fig:obsVmodel)b is removed from the observed data distribution in Figure \@ref(fig:obsVmodel)a. Some zero responses will remain. These would correspond to non-drinkers, and the proportion of all observations these zeros constitute might make a reasonable estimate for $\alpha$, the proportion of non-drinkers. The likelihood is used and some iterating in the fitting process is involved because the Poisson distribution in Figure \@ref(fig:obsVmodel)b is based on the mean of the observed data, which means it is the average among all students, not only among drinkers. Furthermore, the likelihood incorporates the predictors, `sex` and `off.campus`. So there is a little more to it than computing the proportion of zeros, but this heuristic should provide you a general idea of how these kinds of models are fit. The software package `R` along with the `library(pscl)` will fit a ZIP model for us using the function `zeroinfl`.
```{r, 4verb20, comment=NA, echo=FALSE}
zip.m2 <- zeroinfl(drinks ~ off.campus + sex | firstYear, data = zip.data)
tmp <- capture.output(summary(zip.m2))
cat(tmp[c(3:4, 9:20)], sep='\n')
```
```{r}
exp(coef(zip.m2)) # exponentiated coefficients
```
Our model uses `firstYear` to distinguish drinkers and non-drinkers ("Zero-inflation model coefficients") and `off.campus` and `sex` to help explain the differences in the number of drinks among drinkers ("Count model coefficients"). Again, we could have used the same covariates for the two pieces of a ZIP model, but neither `off.campus` or `sex` proved to be a useful predictor of drinkers vs. non-drinkers.
We'll first consider the "Count model coefficients," which provide information on how the sex and off-campus status of a student who is a drinker are related to the number of drinks reported by that student over a weekend. As we have done with previous Poisson regression models, we exponentiate each coefficient for ease of interpretation. Thus, for those who drink, the average number of drinks for males is $e^{1.0209}$ or 2.76 times the number for females (Z = 5.827, p < 0.001) given that you are comparing people who live in comparable settings, i.e., either both on or both off campus. Among drinkers, the mean number of drinks for students living off campus is $e^{0.4159}=1.52$ times that of students living on campus for those of the same sex (Z = 2.021, p = 0.0433).
The "Zero-inflation model coefficients" refer to separating drinkers from non-drinkers. An important consideration in separating drinkers from non-drinkers may be whether this is their first year, where `firstYear` is a 0/1 indicator variable.
We have
\[
log(\alpha/(1-\alpha)) =-0.6036+1.1364(\textrm{firstYear})
\]
However we are interested in $\alpha$, the proportion of non-drinkers. Exponentiating the coefficient for the first year term for this model yields 3.12. Here it is interpreted as the odds ($\frac{\alpha}{1-\alpha}$) that a first year is a non-drinker is 3.12 times the odds that an upper class student is a non-drinker. Furthermore, with a little algebra (solving the equation with $log(\alpha/(1-\alpha)$) for $\alpha$),
we have
\[
\hat{\alpha} =
\frac{exp^ {-0.6036+1.1364(\textrm{firstYear})}}
{1+e^{
-0.6036+1.1364(\textrm{firstYear})
}
}.
\]
The estimated chance that a first year student is a non-drinker is
\[
\frac{e^{0.533}}{1+e^{0.533}} = 0.630
\]
or 63.0\%, while for non-first years, the estimated probability of being a non-drinker in 0.354. If you have seen logistic regression, you'll recognize that this transformation is what is used to estimate a probability. More on this in the Logistic Regression chapter.
### Comparing ZIP to ordinary Poisson with Vuong Test (Optional)
Moving from ordinary Poisson to zero-inflated Poisson has helped us address additional research questions: What proportion of students are non-drinkers and what factors are associated with whether or not a student is a non-drinker? While a ZIP model seems more faithful to the nature and structure of this data, can we quantitatively show that a zero-inflated Poisson is better than an ordinary Poisson model?