forked from proback/BeyondMLR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path07-Correlated-Data.Rmd
588 lines (447 loc) · 42.4 KB
/
07-Correlated-Data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
---
title: "Chapter 7"
subtitle: 'Correlated Data'
output:
html_document: default
pdf_document:
number_sections: yes
---
# Correlated Data {#ch-corrdata}
```{r loadpackages7, include=FALSE}
library(ggplot2)
library(gridExtra)
library(dplyr)
library(knitr)
library(kableExtra)
library(lme4)
library(readr)
library(tidyr)
library(ICC)
library(knitr)
if(knitr::is_html_output()){options(knitr.table.format = "html")} else {options(knitr.table.format = "latex")}
```
## Learning Objectives
After finishing this chapter, you should be able to:
- Given a data structure, recognize when there is a potential for correlation.
- Identify observational units at varying levels.
- Provide reasons why correlated observations may cause problems when modeling if they are not addressed.
- Describe overdispersion and why it might occur.
- Understand how correlation in data can be taken into account using random effects models.
- Describe differences in the kind of inferences that can be made using fixed versus random effects models.
- Use software to fit a model taking correlation into account.
## Introduction
Introductory statistics courses typically require responses which are approximately normal and independent of one another. We saw from the first chapters in this book that there are models for non-normal responses, so we have already broadened the types of applications we can handle. In this chapter, we relax an additional assumption from introductory statistics, that of independence. Here we introduce methods that allow correlation to be taken into account. When modeling, correlation can be considered for normal and non-normal responses. Taken together, we can handle many more applications than the average introductory student.
In this chapter, we first focus on recognizing data structures that may imply correlation, introducing new terminology for discussing correlated data and its effects. Next, we consider potential problems correlated outcomes may cause and why we need to take correlation into account when modeling. Models which take correlation into account are then described and fit with the help of R.
This chapter will feel different than other chapters in this book. In order to understand correlated data and contrast it with independent observations, we will be exploring several simulations and associated thought questions. In that way, you should develop a feel for correlated data which will carry through into the remaining chapters to this book. In addition, as you proceed through the simulations, fill out Table \@ref(tab:simulationTableCh7) below; this will help you see patterns as you compare analyses which account for correlation with those that do not.
```{r, simulationTableCh7, echo=FALSE}
#### one large table
tScenario <- c("1a", "", "1b", "", "",
"Scenario","2a", "", "2b", "")
tModel <- c("Binomial", "Quasibinomial", "Binomial", "Quasibinomial", "",
"Model", "Binomial", "Quasibinomial", "Binomial", "Quasibinomial")
tModelName <- ifelse(knitr::is_html_output(),
c("fit_1a_binom", "fit_1a_quasi", "fit_1b_binom", "fit_1b_quasi", "", "Model Name",
"fit_2a_binom","fit_2a_quasi", "fit_2b_binom", "fit_2b_quasi"),
c("fit\\_1a\\_binom", "fit\\_1a\\_quasi","fit\\_1b\\_binom","fit\\_1b\\_quasi", "",
"Model Name", "fit\\_2a\\_binom","fit\\_2a\\_quasi", "fit\\_2b\\_binom", "fit\\_2b\\_quasi"))
tBeta <- c("","","","","",
"$\\beta_1$", "","","","")
tSEBeta <- c("","","","","",
"SE $\\beta_1$", "","","","")
tTStat <- c("","","","","",
"$t$", "","","","")
tPVal <- c("","","","","",
"p value", "","","","")
tPhi <- c("","","","","",
"$\\phi$", "","","","")
tEst <- c("","","","","",
"Est odds ratio", "","","","")
tCI <- c("","","","","",
"CI odds ratio", "","","","")
tMeanCount <- c("","X","","X","",
"Mean count Dose=1", "","X","","X")
tSDCount <- c("","X","","X","",
"SD count Dose=1", "","X","","X")
tGOFP <- c("","X","","X","",
"GOF p value", "","X","","X")
scenarioSimTab <- tibble(tScenario, tModel, tModelName, tBeta, tSEBeta, tTStat, tPVal, tPhi, tEst, tCI, tMeanCount, tSDCount, tGOFP)
colnames(scenarioSimTab) <- c("Scenario", "Model", "Model Name", "$\\beta_0$", "SE $\\beta_0$", "$t$", "p value", "$\\phi$", "Est prob", "CI prob", "Mean count", "SD count", "GOF p value")
kable(scenarioSimTab, booktabs=T, caption="Summary of simulations for Dams and Pups case study", escape=F) %>%
kable_styling(font_size=12) %>%
row_spec(6, bold=T)
```
## Recognizing correlation
Correlated data is encountered in nearly every field. In education, student scores from a particular teacher are typically more similar than scores of other students who have had a different teacher. During a study measuring depression indices weekly over the course of a month, we usually find that four measures for the same patient tend to be more similar than depression indices from other patients. In political polling, opinions from members of the same household are usually more similar than opinions of members from other randomly selected households. The structure of these data sets suggest inherent patterns of similarities or correlation among outcomes. This kind of correlation specifically concerns correlation of observations *within the same teacher or patient or household* and is referred to as __intraclass correlation__.
Correlated data often takes on a multilevel structure. That is, population elements are grouped into aggregates, and we often have information on both the individual elements and the aggregated groups. For instance, students are grouped by teacher, weekly depression measures are grouped by patient, and survey respondents are grouped by household. In these cases, we refer to **levels** of measurement and observational units at each level. For example, students might represent **level one observational units** while teachers represent **level two observational units**, where **level one** is the most basic level of observation, and level one observations are aggregated to form **level two** observations. Then, if we are modeling a response such as test score, we may want to examine the effects of student characteristics such as sex and ethnicity, and teacher characteristics such as years of experience. Student characteristics would be considered **level one covariates**, while teacher characteristics would be **level two covariates**.
## Case Study: Dams and pups
A __teratogen__ is a substance or exposure that can result in harm to a developing fetus. An experiment can be conducted to determine whether increasing levels of a potential teratogen results in increasing probability of defects in rat pups. We will simulate the results of an experiment in which 24 dams (mother rats) are randomized to four groups: 3 groups of 6 dams each are administered a potential teratogen in high (3mg), medium (2mg), or low (1mg) doses, and the fourth group serves as the control. Each of the 24 dams produces 10 rat pups, and each pup is examined for the presence of an anomaly or deformity.
## Sources of Variability
Before we analyze data from our simulated experiment, let's step back and look at the big picture. Statistics is all about analyzing and explaining variability, so let's consider what sources of variability we have in the dams and pups example. There are several reasons why the counts of the number of defective pups might differ from dam to dam, and it is helpful to explicitly identify these reasons in order to determine how the dose levels affect the pups while also accommodating correlation.
__Dose Effect__ The dams and pups experiment is being carried out to determine whether different dose levels affect the development of defects differently. Of particular interest is determining whether a **dose-response** effect is present. A dose-response effect is evident when dams receiving higher dose levels produce higher proportions of pups with defects. Knowing defect rates at specific dose levels is typically of interest within this experiment and beyond. Publishing the defect rates for each dose level in a journal paper, for example, would be of interest to other teratologists. For that reason, we refer to dose level effects as __fixed effects__.
__Dams (Litter) Effect__ In many settings like this, there is a litter effect as well. For example, some dams may exhibit a propensity to produce pups with defects while others rarely produce litters with defective pups. That is, observations on pups within the same litter are likely to be similar or correlated. Unlike the dose effect, teratologists reading experiment results are not interested in the estimated probability of defect for each dam in the study, and we would not report these estimated probabilities in a paper. However, there may be interest in the *variability* in litter-specific defect probabilities; accounting for dam-to-dam variability reduces the amount of unexplained variability and leads to more precise estimates of fixed effects like dose. Often this kind of effect is modeled using the idea that randomly selected dams produce __random effects__. This provides one way in which to model correlated data, in this case the correlation between pups from the same dam. We elaborate on this idea throughout the remainder of the text.
__Pup-to-pup variability__ The within litter pup-to-pup differences reflect random, unexplained variation in the model.
## Scenario 1: No covariates
In Scenario 1, we will ignore dose and assume that dose has no effect on the probability of a deformed pup. You can follow along with the simulation (and modify it as desired) in the *Rmd* file for this chapter (note that some lines of code are run but not printed in the chapter text), and fill out Table \@ref(tab:simulationTableCh7).
First, we will consider Scenario 1a where each dam's probability of producing a deformed pup is $p = 0.5$; thus, each dam's log odds are 0. We then will compare this to Scenario 1b, where dams' probabilities of producing a deformed pup follow a beta distribution where $\alpha = \beta = 0.5$ (which has expected value $0.5$). So, in Scenario 1b, each dam has a different probability of producing a deformed pup, but the probabilities average out to 0.5, whereas in Scenario 1a each dam has an identical probability of producing a deformed pup (0.5). Figure \@ref(fig:scenario1ProbabilityPlot) illustrates these two scenarios; every dam in Scenario 1a has a probability of 0.5 (at the red line), while probabilities in Scenario 1b are selected from the black dashed distribution, so that probabilities near 0 and 1 are more likely than probabilities near 0.5. The histogram shows 24 randomly generated probabilities under one run of Scenario 1b.
The R code below produces a simulated number of deformed pups for each of the 24 dams under Scenario 1a and Scenario 1b, and Figure \@ref(fig:scenario1Plot) displays distributions of counts of deformed pups per dam for the two simulations. In Scenario 1a, the mean number of deformed pups is 5.17 with standard deviation 1.49. In Scenario 1b, the mean number of deformed pups is 5.67 with standard deviation 4.10.
```{r ch7Seed, include = FALSE}
set.seed(2) # to get the same simulated results as reported here
```
```{r damScenario1, message=FALSE}
pi_1a <- rep(0.5, 24)
count_1a <- rbinom(24, 10, pi_1a)
pi_1b <- rbeta(24,.5,.5)
count_1b <- rbinom(24, 10, pi_1b)
```
```{r scenario1ProbabilityPlot, fig.align="center",out.width="60%",fig.cap= "Dam Probabilities in Scenario 1", echo=FALSE, warning=FALSE, message=FALSE}
theoretical_pi <- tibble(x = 1:250000,
p1 = rbeta(x, 0.5, 0.5))
tibble(x = 1:24, pi_1b) %>%
ggplot() +
geom_histogram(bins = 5, aes(x = pi_1b, y = ..density..)) +
coord_cartesian(xlim = c(0,1)) +
geom_density(data = theoretical_pi, aes(x = p1), bw = 0.025, linetype = 3) +
geom_vline(xintercept = 0.5, color = "red", linetype = 2) +
labs(title = "", x = "Probability of Deformity")
```
```{r, scenario1Plot, fig.align="center",out.width="60%",fig.cap= "Counts of deformed pups per dam under Scenarios 1a and 1b", echo=FALSE, warning=FALSE, message=FALSE}
scenario_1 <- tibble(pi_1a, count_1a, pi_1b, count_1b) %>%
mutate(phat_1a = count_1a / 10, phat_1b = count_1b / 10)
hist_1a <- ggplot(data = scenario_1, aes(x = count_1a)) +
geom_histogram(bins = 5) + coord_cartesian(xlim = c(0, 10)) +
labs(title = "Scenario 1a: Binomial, p = 0.5",
x = "Count of deformed pups per dam")
hist_1b <- ggplot(data = scenario_1, aes(x = count_1b)) +
geom_histogram(bins = 5) + coord_cartesian(xlim = c(0, 10)) +
labs(title = "Scenario 1b: Binomial, p ~ Beta(0.5, 0.5)",
x = "Count of deformed pups per dam")
grid.arrange(hist_1a, hist_1b, ncol=1)
```
```{r scenario1Summary, include=FALSE}
scenario_1 %>%
summarise(mean_1a = mean(count_1a), sd_1a = sd(count_1a),
mean_1b = mean(count_1b), sd_1b = sd(count_1b) )
```
**Thought Questions**
> 1. Will the counts of deformed pups for dams in Scenario 1a behave like a binomial distribution with $n=10$ and $p=0.5$ (that is, like counting heads in 10 flips of a fair coin)? Why or why not?
> 2. Will the counts of deformed pups for dams in Scenario 1b behave like a binomial distribution with $n=10$ and $p=0.5$ (that is, like counting heads in 10 flips of a fair coin)? If not, extend the coin flipping analogy to Scenario 1b.
> 3. Is Scenario 1b realistic? Why might some dams have higher probabilities than others?
If we were to model the number of deformed pups per dam in Scenario 1a, we could ignore the potential of a dam effect (since all dams behave the same) and proceed with regular binomial regression as in Chapter \@ref(ch-logreg). Since we have no predictors, we would start with the model:
\begin{equation}
\log \bigg( \frac{\hat{p}}{1-\hat{p}} \bigg) = \hat{\beta}_0, \textrm{ where } \hat{\beta}_0 = 0.067
(\#eq:mod1aBinomial)
\end{equation}
which produces an estimated odds of deformity $\widehat{p}/(1-\widehat{p}) = e^{0.067} = 1.069$ and estimated probability $\widehat{p} = 0.517$. Creating 95\% confidence intervals using a profile likelihood approach, we get:
(ref:CI) 95% CI for
\[
\begin{alignedat}{2}
&\textrm{(ref:CI) } p/(1-p) &&= (0.830, 1.378) \\
&\textrm{(ref:CI) } p &&= (0.454, 0.579)
\end{alignedat}
\]
```{r scenario1aBinom, include=FALSE}
fit_1a_binom <- glm(phat_1a ~ 1, family=binomial, weight=rep(10,24), data = scenario_1)
summary(fit_1a_binom)
exp(coef(fit_1a_binom)) # estimated odds of deformity
exp(coef(fit_1a_binom)) / (1+exp(coef(fit_1a_binom))) # estimated prob of deformity
confint(fit_1a_binom)
exp(confint(fit_1a_binom))
exp(confint(fit_1a_binom)) / (1 + exp(confint(fit_1a_binom)))
```
```{r scenario1aQuasi, include=FALSE}
fit_1a_quasi = glm(phat_1a ~ 1, family=quasibinomial, weight=rep(10,24), data=scenario_1)
summary(fit_1a_quasi)
exp(coef(fit_1a_quasi)) # estimated odds of deformity
exp(coef(fit_1a_quasi)) / (1+exp(coef(fit_1a_quasi))) # estimated prob of deformity
confint(fit_1a_quasi)
exp(confint(fit_1a_quasi))
exp(confint(fit_1a_quasi)) / (1 + exp(confint(fit_1a_quasi)))
```
However, we can account for potential overdispersion with a __quasibinomial model__, just as we did in Section \@ref(sec-logOverdispersion), in case the observed variance is larger than the variance under a binomial model. Quasibinomial regression yields the same estimate for $\beta_0$ as the binomial regression model ($\hat{\beta}_0 = 0.067$), but we now have overdispersion paramater $\widehat{\phi} = 0.894$. This gives us the following 95\% profile likelihood-based confidence intervals:
\[
\begin{alignedat}{2}
&\textrm{(ref:CI) } p/(1-p) &&= (0.841, 1.359) \\
&\textrm{(ref:CI) } p &&= (0.457, 0.576)
\end{alignedat}
\]
Turning to Scenario 1b, where each dam has a unique probability of producing a pup with a deformity based on a beta distribution, we can fit binomial and quasibinomial models as well.
```{r scenario1bBinom, include=FALSE}
fit_1b_binom <- glm(phat_1b ~ 1, family=binomial, weight=rep(10,24), data = scenario_1)
summary(fit_1b_binom)
exp(coef(fit_1b_binom)) # estimated odds of deformity
exp(coef(fit_1b_binom)) / (1+exp(coef(fit_1b_binom))) # estimated prob of deformity
confint(fit_1b_binom)
exp(confint(fit_1b_binom))
exp(confint(fit_1b_binom)) / (1 + exp(confint(fit_1b_binom)))
```
A binomial model gives regression equation
\begin{equation}
\log\bigg(\frac{\hat{p}}{1-\hat{p}}\bigg) = 0.268,
(\#eq:mod1bBinom)
\end{equation}
with associated profile likelihood 95\% confidence intervals:
\[
\begin{alignedat}{2}
&\textrm{(ref:CI) } p/(1-p) &&= (1.014, 1.691) \\
&\textrm{(ref:CI) } p &&= (0.504, 0.628)
\end{alignedat}
\]
```{r scenario1bQuasi, include=FALSE}
fit_1b_quasi = glm(phat_1b ~ 1, family=quasibinomial, weight=rep(10,24), data=scenario_1)
summary(fit_1b_quasi)
exp(coef(fit_1b_quasi)) # estimated odds of deformity
exp(coef(fit_1b_quasi)) / (1+exp(coef(fit_1b_quasi))) # estimated prob of deformity
confint(fit_1b_quasi)
exp(confint(fit_1b_quasi))
exp(confint(fit_1b_quasi)) / (1 + exp(confint(fit_1b_quasi)))
```
We could compare this to a quasibinomial model. With overdispersion paramater $\widehat{\phi} = 6.858$, we now have profile likelihood-based intervals:
\[
\begin{alignedat}{2}
&\textrm{(ref:CI) } p/(1-p) &&= (0.673, 2.594) \\
&\textrm{(ref:CI) } p &&= (0.402, 0.722)
\end{alignedat}
\]
**Thought Questions**
> 4. Describe how the quasibinomial analysis of Scenario 1b differs from the binomial analysis of the same simulated data. Refer to Table \@ref(tab:simulationTableCh7) when answering this question; you will need to run the R code in the R markdown file for this chapter to completely fill out the table. Do confidence intervals contain the true model parameters?
> 5. Why are differences between quasibinomial and binomial models of Scenario 1a less noticeable than the differences in Scenario 1b?
## Scenario 2: Dose effect
In Scenario 1, each dam's probability of producing a deformed pup was independent of their dosage of the teratogen. In Scenario 2 we allow for a dose effect. To recall, in this hypothetical experiment we have 24 total dams evenly split into 4 groups recieving either no dose (coded as `dose = 0` mg), a low dose (`dose = 1`), a medium dose (`dose = 2`), or a high dose (`dose = 3`) of the teratogen.
We will suppose that true probability that a dam's pup has a deformity is related to the dose the dam received through this model:
\[ \log \bigg(\frac{p}{1-p} \bigg) = -2 + 1.33\; (\textrm{dose}) \]
That is, we assume that the log odds of a deformity is linearly related to dose through the equation above, and the odds of a deformity are 3.79 times greater ($e^{1.33}$) for each 1mg increase in dose.
In Scenario 2a, a dam who received a dose of $x$ would have probability
\[p = P(\textrm{deformity}\mid \textrm{dose} = x) = e^{-2+1.33x}/(1+e^{-2+1.33x}). \]
Thus, dams who received doses of 0, 1, 2, and 3 mg would have probabilities 0.12, 0.34, 0.66, and 0.88, respectively, under Scenario 2a.
```{r secnario2aPCalculations, include=FALSE}
x <- 0:3
p_2 <- exp(-2+4/3*x)/(1+exp(-2+4/3*x))
p_2
```
In Scenario 2b, each dam who received a dose of $x$ has probability of deformity randomly chosen from a beta distribution where $\alpha = 2p/(1-p)$ and $\beta = 2$. These beta distribution parameters ensure that, on average, dams with a dose $x$ in Scenario 2b have the same probability of a deformed pup as dams with dose $x$ in Scenario 2a. For example, dams receiving the 1mg dosage under Scenario 2b would have probabilities following a beta distribution with $\alpha = 2(0.34)/(1-0.34) = 1.03$ and $\beta = 2$, which has mean $\frac{\alpha}{\alpha + \beta}=0.34$. The big difference is that *all* dams receiving the 1mg dosage in Scenario 2a have probability 0.34 of a deformed pup, whereas dams receiving the 1mg dosage in Scenario 2b each have a unique probability of a deformed pup, but those probabilities average out to 0.34.
Figure \@ref(fig:scenario2bPlot) displays histograms for each dosage group of each dam's probability of producing deformed pups under Scenario 2b as well as theoretical distributions of probabilities. A vertical line is displayed at each hypothetical distribution's mean; the vertical line represents the fixed probability of a deformed pup for all dams under Scenario 2a.
```{r damScenario2, message=FALSE}
set.seed(1)
dose <- c(rep(0,6),rep(1,6),rep(2,6),rep(3,6))
pi_2a <- exp(-2+4/3*dose)/(1+exp(-2+4/3*dose))
count_2a <- rbinom(24, 10, pi_2a)
b <- 2
a <- b*pi_2a / (1-pi_2a)
pi_2b <- rbeta(24, a, b)
count_2b <- rbinom(24, 10, pi_2b)
```
```{r, scenario2bPlot, fig.align="center",out.width="60%",fig.height=8.5,fig.cap= "Observed (histograms) and theoretical (density curves) distributions of dams' probabilities of producing deformed pups by dose group in Scenario 2b. The red vertical line represents the fixed probability of a deformed pup by dose group in Scenario 2a.", echo=FALSE, warning=FALSE, message=FALSE}
scenario_2 <- tibble(dose, pi_2a, count_2a, pi_2b, count_2b)
theoretical_pi <- tibble(x = 1:50000,
p1 = rbeta(x, shape1 = 2*p_2[1]/(1-p_2[1]), shape2 = 2),
p2 = rbeta(x, shape1 = 2*p_2[2]/(1-p_2[2]), shape2 = 2),
p3 = rbeta(x, shape1 = 2*p_2[3]/(1-p_2[3]), shape2 = 2),
p4 = rbeta(x, shape1 = 2*p_2[4]/(1-p_2[4]), shape2 = 2))
hist1 <- ggplot() +
geom_histogram(data = scenario_2[1:6,], bins = 5, aes(x = pi_2b, y = ..density..)) +
coord_cartesian(xlim = c(0,1)) +
geom_density(data = theoretical_pi, aes(x = p1), bw = 0.05, linetype = 3) +
#stat_function(fun = dbeta,
# args = list(shape1 = 2*0.119/(1-0.119), shape2 = 2),
# xlim = c(0.01,1)) +
geom_vline(xintercept = p_2[1], color = "red", linetype = 2) +
labs(title = "Dosage = 0mg", x = "Probability of Deformity")
hist2 <- ggplot() +
geom_histogram(data = scenario_2[7:12,], aes(x = pi_2b, y = ..density..), bins = 5) +
coord_cartesian(xlim = c(0,1)) +
geom_density(data = theoretical_pi, aes(x = p2), bw = 0.05, linetype = 3) +
geom_vline(xintercept = p_2[2], color = "red", linetype = 2) +
labs(title = "Dosage = 1mg", x = "Probability of Deformity")
hist3 <- ggplot() +
geom_histogram(data = scenario_2[13:18,], aes(x = pi_2b, y = ..density..), bins = 5) +
coord_cartesian(xlim = c(0,1)) +
geom_density(data = theoretical_pi, aes(x = p3), bw = 0.05, linetype = 3) +
geom_vline(xintercept = p_2[3], color = "red", linetype = 2) +
labs(title = "Dosage = 2mg", x = "Probability of Deformity")
hist4 <- ggplot() +
geom_histogram(data = scenario_2[19:24,], aes(x = pi_2b, y = ..density..), bins = 5) +
coord_cartesian(xlim = c(0,1)) +
geom_density(data = theoretical_pi, aes(x = p4), bw = 0.05, linetype = 3) +
geom_vline(xintercept = p_2[4], color = "red", linetype = 2) +
labs(title = "Dosage = 3mg", x = "Probability of Deformity")
grid.arrange(hist1, hist2, hist3, hist4, ncol=1)
```
```{r scenario2Summary, include=FALSE}
scenario_2 %>%
summarise(mean_2a = mean(count_2a), sd_2a = sd(count_2a),
mean_2b = mean(count_2b), sd_2b = sd(count_2b) )
```
```{r scenario2Tab, echo=FALSE}
scenario2Tab <- scenario_2 %>%
group_by(dose) %>%
summarise(mean_2a_pi = round(mean(pi_2a),3), sd_2a_pi = round(sd(pi_2a),3),
mean_2a_cnt = round(mean(count_2a),3), sd_2a_cnt = round(sd(count_2a),3),
mean_2b_pi = round(mean(pi_2b),3), sd_2b_pi = round(sd(pi_2b),3),
mean_2b_cnt = round(mean(count_2b),3), sd_2b_cnt = round(sd(count_2b),3)) %>%
as.data.frame()
colnames(scenario2Tab) <- c("Dosage","Mean p", "SD p","Mean Count", "SD Count",
"Mean p", "SD p","Mean Count", "SD Count")
kable(scenario2Tab, booktabs = T, caption="Summary Statistics of Scenario 2 by Dose") %>%
add_header_above(c(" " = 1, "Scenario 2a" = 4, "Scenario 2b" = 4))
```
**Thought Questions**
> 6. Compare and contrast the probabilities associated with the 24 dams under Scenarios 1a and 1b to the probabilities under Scenarios 2a and 2b.
> 7. In Scenario 2a, dams produced 4.79 deformed pups on average, with standard deviation 3.20. Scenario 2b saw an average of 4.67 with standard deviation 3.58. Explain why comparisons by dose are more meaningful than these overall comparisons. You might refer to the results in Table \@ref(tab:scenario2Tab).
> 8. In Table \@ref(tab:simulationTableCh7), predict what you'll see in the column headed "CI_odds_ratio". Among the 4 entries: What can you say about the center and the width of the confidence intervals? Which will be similar and why? Which will be different and how?
We first model Scenario 2a without adjusting for potential overdispersion. Binomial regression gives us the model:
\begin{equation}
\log\bigg( \frac{\hat{p}}{1-\hat{p}} \bigg) = -2.02 + 1.26\;(\textrm{dose})
(\#eq:mod2aBinom)
\end{equation}
Equation \@ref(eq:mod2aBinom) has associated odds ratio corresponding to a 1mg dose increase of $e^{\beta_1} = 3.54$ with 95\% profile likelihood confidence interval $(2.61, 4.96)$. We can be 95\% confident that odds of deformity are between 2.61 and 4.96 times higher for each 1 mg increase in dose.
```{r scenario2aBinom, include=FALSE}
scenario_2 <- scenario_2 %>%
mutate(dose = dose,
phat_2a = count_2a / 10, phat_2b = count_2b / 10,
logit_2a = log ((count_2a + 0.5) / (10 - count_2a + 0.5)),
logit_2b = log ((count_2b + 0.5) / (10 - count_2b + 0.5)) )
fit_2a_binom = glm(phat_2a ~ dose, family=binomial, weight=rep(10,24), data=scenario_2)
summary(fit_2a_binom)
exp(confint(fit_2a_binom))
```
Alternatively, we can use a quasibinomial model to account for any overdispersion in the binomial model. With $\widehat{\phi} = 1.27$, we have the 95\% profile likelihood confidence interval for $e^\beta_1$: $(2.51, 5.19)$.
```{r scenario2aQuasi, include=FALSE}
fit_2a_quasi = glm(phat_2a ~ dose, family=quasibinomial, weight=rep(10,24),
data=scenario_2)
summary(fit_2a_quasi)
exp(confint(fit_2a_quasi))
```
Turning to Scenario 2b, where probabilities were different between dams recieving the same dosage, we have the binomial model
\begin{equation}
\log\bigg( \frac{p}{1-p} \bigg) = -2.41 + 1.46\;(\textrm{dose})
(\#eq:mod2bBinom)
\end{equation}
Generating a 95\% confidence interval for $e^{\beta_1}$, we have: $(3.09, 6.27)$.
```{r scenario2bBinom, include=FALSE}
fit_2b_binom = glm(phat_2b ~ dose, family=binomial, weight=rep(10,24), data=scenario_2)
summary(fit_2b_binom)
exp(confint(fit_2b_binom))
```
If we use a quasibinomial model, we find overdispersion paramater $\widehat{\phi} = 1.93$, yielding 95\% confidence interval for $e^{\beta_1}$: $(2.74, 7.35)$.
```{r scenario2bQuasi, include=FALSE}
fit_2b_quasi = glm(phat_2b ~ dose, family=quasibinomial, weight=rep(10,24),
data=scenario_2)
summary(fit_2b_quasi)
exp(confint(fit_2b_quasi))
```
**Thought Questions**
> 9. Describe how the quasibinomial analysis of Scenario 2b differs from the binomial analysis of the same simulated data. Refer to Table \@ref(tab:simulationTableCh7) when answering this question; you will need to run the R code in the R markdown file for this chapter to completely fill out the table. Do confidence intervals contain the true model parameters?
> 10. Why are differences between quasibinomial and binomial models of Scenario 2a less noticeable than the differences in Scenario 2b?
> 11. Why does Scenario 2b contain correlated data that we must account for, while Scenario 2a does not?
## Case Study: Tree Growth
A student research team at St. Olaf College contributed to the efforts of biologist Kathy Shea to investigate a rich data set concerning forestation in the surrounding land. [@Eisinger2011] Here is a paragraph from the introduction to their project report:
> Much of south-central Minnesota was comprised of maple-basswood forest prior to agricultural development and settlement. Currently, the forests that remain are highly fragmented and disturbed by human activities. Original land surveys describe the forest, also known as the Big Woods ecosystem, as dominated by elm, maple, oak, and basswood. In order to recover the loss of ecosystem services that forests provide, many new forested areas have been established through restoration.
Tubes were placed on trees in some locations or *transects* but not in others. One research question is whether tree growth in the first year is affected by the presence of tubes. This analysis has a structure similar to the dams and pups; the two study designs are depicted below.
![Data structures in the Dams and Pups (left) and Tree Growth (right) case studies](data/DamsTreesStructure.png)
Some transects were assigned to have tubes on all of their trees and other transects had no tubes on all of their trees, just as every dam assigned to a certain group received the same dose. Within a transect, each tree's first year of growth was measured, much like the presence or absence of a defect was noted for every pup within a dam. Although the response in the tree tube study is continuous (and somewhat normally distributed) rather than binary as in the dams and pups study, we can use methods to account for correlation of trees within a transect, just as we accounted for correlation of pups within dams.
### Format of the data set
We will consider a subset of the full data set in `treetube.csv` for illustration purposes here: the 382 trees with heights recorded in both 1990 and 1991. Thus, we will consider the following variables:
- `id` = a unique identifier for each tree
- `transect` = a unique identifier for each transect containing several trees
- `species` = tree species
- `tubes` = an indicator variable for the presence or absence of tubes for a given transect
- `height91` = first year height for each tree in meters
- `height90` = baseline height for each tree in meters
- `growth_yr1` = height91 - height90, in meters
A sample of 10 observations are displayed in Table \@ref(tab:treeTubeTab).
```{r treeTubesSetup, include=FALSE}
treetubes_yr1 <- read_csv("data/treetube.csv") %>%
filter(YEAR == 1990 | YEAR == 1991) %>%
spread(key = YEAR, value = HEIGHT) %>%
mutate(growth_yr1 = `1991` - `1990`) %>%
filter(!is.na(growth_yr1)) %>%
rename(tubes = TUBEX, id = ID, transect = TRANSECT, species = SPECIES,
height91 = `1991`, height90 = `1990`) %>%
dplyr::select(id, transect, species, tubes, height91, height90, growth_yr1)
```
```{r treeTubeTab, echo=FALSE}
kable(treetubes_yr1[c(137:146),], booktabs = T, caption="A sample of 10 trees and their growth from 1990 to 1991")
```
This portion of the data indicates that the four trees in transect 18 have tubes, while the other six trees listed do not. The concern with this kind of data structure is that trees from the same transect may be more similar or correlated with one another in contrast to trees from other transects. This could be true for a variety of reasons: some transects may receive less sun than others, or irrigation of the soil may differ from transect to transect. These unmeasured but possibly influential factors may imply a correlation among trees within transects. In that case, we would not have independent pieces of information, so that the number of trees within a transect would overstate the amount of independent information. To prepare for an analysis of this potentially correlated data, we examine the sources of variability in first-year tree growth.
### Sources of variability
First year tree growth may vary because of:
__Tube effects__ A purpose of this analysis is to determine whether tubes affect first year tube growth. Differences in mean growth based on the presence or absence of tubes would be of interest to researchers, and they would be included in a publication for this analysis. For this reason, tube effects are referred to as __fixed effects__. This is analogous to the dose effect in the dams and pups example.
__Transect effects__ For some of the factors previously mentioned such as sun exposure or water availability, first year growth may vary by transect. Knowing which specific transects produce greater growth is not of interest and would not appear in a publication of this study. These __random effects__ are analogous to dam effects which were not of inherent interest, but which we nevertheless wished to account for.
__Tree-to-tree variability within transects__ There is inherent variability in tree growth even when they are subject to the same transect and treatment effects. This variability remains unexplained in our model.
Data sets with this kind of structure are often referred to as __multilevel data__, and the remaining chapters delve into models for multilevel data in gory detail. With a continuous response variable, we will actually add random effects for transects to a more traditional OLS regression model rather than estimate an overdispersion parameter as with a binary response. Either way, if observations are really correlated, proper accounting will lead to larger standard errors for model coefficients and larger (but more appropriate) p-values for testing the significance of those coefficients.
### Analysis preview: accounting for correlation within transect
Attempting to model the effects of tubes on tree growth, we could use OLS regression which yields the model:
\begin{equation}
\textrm{Growth} = 0.106 - 0.040\; (\textrm{Tube}) + \epsilon
\end{equation}
```{r, treeTubesLinear, echo=FALSE, comment=NA}
tube_linear <- lm(growth_yr1 ~ tubes, data=treetubes_yr1)
# print(summary(tube_linear), show.resids=FALSE)
tmp <- capture.output(summary(tube_linear))
cat(tmp[c(3, 8:12)], sep='\n')
```
However, the OLS regression model assumes that all observations are independent, including trees in the same transect. One way to account for potential correlation is to estimate an additional parameter for the transect-to-transect variance. In other words, we are allowing for a random effect that each transect contributes to the overall variability. The multilevel model below does just that.
```{r treeTubesMulti1, echo=FALSE, comment=NA}
tube_multi1 <- lmer(growth_yr1 ~ tubes + (1|transect),
data = treetubes_yr1)
# print(summary(tube_multi1), correlation=FALSE, show.resids=FALSE)
tmp <- capture.output(summary(tube_multi1))
cat(tmp[c(2, 10:20)], sep='\n')
```
Like we saw in the case of the binary outcomes, the standard error for the coefficients is larger when we take correlation into account. The t-statistic for tubes is smaller, reducing our enthusisam for the tubes effect. This conservative approach occurs because the observations within a transect are correlated and therefore not independent as assumed in the naive model. We will study these models in depth in the remaining chapters.
## Summary
The most important idea from this chapter is that structures of data sets may imply that outcomes are correlated. Correlated outcomes provide less information than independent outcomes, resulting in effective sample sizes that are less than the total number of observations. Neglecting to take into account correlation may lead to underestimating standard errors of coefficients, overstating significance and precision. Correlation is likely and should be accounted for if basic observational units (e.g., pups, trees) are aggregated in ways that would lead us to expect units within groups to be similar.
We have mentioned two ways to account for correlation: incorporate a dispersion parameter or include random effects. In the following chapters we will primarily focus on models with random effects. In fact, there are even more ways to account for correlation, including inflating the variance using Huber-White estimators (aka Sandwich estimators), and producing corrected variances using bootstrapping. These are beyond the scope of this text.
## Exercises
### Conceptual Exercises
1. __Examples with Correlated Data__ For each of the following studies:
- Identify the most basic observational units
- Identify the grouping units (could be multiple levels of grouping)
- State the response(s) measured and variable type (normal, binary, Poisson, etc.)
- Write a sentence describing the within-group correlation.
- Identify fixed and random effects
a. *Nurse Stress Study.* Four wards were randomly selected at each of 25 hospitals and randomly assigned to offer a stress reduction program for nurses on the ward or to serve as a control. At the conclusion of the study period, a random sample of 10 nurses from each ward completed a test to measure job-related stress. Factors assumed to be related include nurse experience, age, hospital size and type of ward.
b. *Epilepsy Study.* Researchers conducted a randomized controlled study where patients were randomly assigned to either an anti-epileptic drug or a placebo. For each patient, the number of seizures at baseline was measured over a 2 week period. For four consecutive visits the number of seizures were determined over the past 2 week period. Patient age and gender along with visit number were recorded.
c. *Cockroaches!* For a study of cockroach infestation, traps were set up in the kitchen, bathroom and bedroom in a random sample of 100 NYC apartments. The goal is to estimate cockroach infestation levels given tenant income and age of the building.
d. *Prairie Restoration.* Researchers at a small Midwestern college decided to experimentally explore the underlying causes of variation in reconstruction projects in order to make future projects more effective. Introductory ecology classes were organized to collect weekly data on plants in pots containing soil samples. Data will be examined to compare:
- germination and growth of two species of prairie plants—--leadplants (*Amorpha canescens*) and coneflowers (*Ratibida pinnata*)
- soil from a cultivated (agricultural) field, a natural prairie, and a restored (reconstructed) prairie.
- the effect of sterilization, since half of the sampled soil was sterilized to determine if rhizosphere differences were responsible for the observed variation.
e. *Radon in Minnesota.* Radon is a carcinogen – a naturally occurring radioactive gas whose decay products are also radioactive – known to cause lung cancer in high concentrations. The EPA sampled more than 80,000 homes across the US. Each house came from a randomly selected county and measurements were made on each level of each home. Uranium measurements at the county level were included to improve the radon estimates.
f. *Teen Alcohol Use.* @Curran1997 collected data on 82 adolescents at three time points starting at age 14 to assess factors that affect teen drinking behavior. Key variables in the data set `alcohol.csv` (source: @Singer2003) are as follows:
- `id` = numerical identifier for subject
- `age` = 14, 15, or 16
- `coa` = 1 if the teen is a child of an alcoholic parent; 0 otherwise
- `male` = 1 if male; 0 if female
- `peer` = a measure of peer alcohol use, taken when each subject was 14. This is the square root of the sum of two 6-point items about the proportion of friends who drink occasionally or regularly.
- `alcuse` = the primary response. Four items—(a) drank beer or wine, (b) drank hard liquor, (c) 5 or more drinks in a row, and (d) got drunk—were each scored on an 8-point scale, from 0=”not at all” to 7=”every day”. Then `alcuse` is the square root of the sum of these four items.
Primary research questions included:
- do trajectories of alcohol use differ by parental alcoholism?
- do trajectories of alcohol use differ by peer alcohol use?
2. __More dams and pups__ Describe how to generalize the pup and dam example by allowing for different size litters.
### Guided Exercises
1. __Exploring Beta Distributions.__ In the Dams and Pups Case Study, we use the beta distribution to randomly select the probability that a dam produces a defective pup. While the beta distribution is described in Chapter \@ref(ch-distthry), it can be valuable to play with the parameters $a$ and $b$ to see what distribution ranges and shapes are possible. Here is some basic code for plotting a beta density curve:
```{r, eval=FALSE}
# inputs for the beta distribution must be between 0 and 1
p <- seq(0,1,by=.05)
# To plot a beta density use dbeta; here I selected a=5, b=1
density <- dbeta(p,5,1)
plot(p, density, type = "l")
```
a. What values do beta random variables take on?
b. What do these values represent for the dams and pups simulation?
c. Do the possible values depend on $a$ or $b$?
d. What is a feature of the beta density when $a=b$?
e. What happens to the density when $a \neq b$?
f. How does the magnitude of $a$ or $b$ affect the density?
g. How does the difference between $a$ and $b$ affect the density?
h. If you wanted to simulate dams with mostly low probabilities of defects and a few with very high probabilities, how would you do it?
i. If you wanted to simulate dams with mostly high probabilities of defects and a few with very low probabilities, how would you do it?
j. If you wanted to simulate a population of dams where half of the probabilities of defects are very high and half are very low, how would you do it?
k. How might you decide on values for $a$ and $b$ if you have run a preliminary experiment and gathered data on the number of dams with deformed pups?
2. **Dams and Pups (continued)**. Modify the dams and pups simulation in the following ways. In each case, produce plots and describe the results of your modified simulation.
a. Pick a different beta distribution for Scenario 1b.
b. Center the beta distributions in Scenarios 1a and 1b somewhere other than 0.5.
c. Repeat Scenario 2a with 3 doses and an underlying logistic model of your own choosing. Then create beta distributions as in Scenario 2b to match your 3 doses.
### Note on Correlated Binary Outcomes
The correlated binomial counts simulated in the Dams and Pups Case Study are in fact beta-binomial random variables like those simulated in the Guided Exercises from Chapter \@ref(ch-distthry). In fact, we could use the form of a beta-binomial pdf to model overdispersed binomial variables. Unlike the more generic form of accounting for correlation using dispersion parameter estimates, beta-binomial models are more specific and highly parameterized. This approach involves more assumptions but may also yield more information than the quasi-likelihood approach. If the beta-binomial model is incorrect, however, our results may be misleading. That said, the beta-binomial structure is quite flexible and conforms to many situations.