-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathgimenez_practical_with_solutions.R
338 lines (306 loc) · 14.7 KB
/
gimenez_practical_with_solutions.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
#' ---
#' title: "Practicals (with solutions)"
#' author: "Olivier Gimenez"
#' output:
#' html_document: default
#' pdf_document: default
#' ---
#'
#' # Practical #1
#'
#' 1. Use simulation to generate a data set of 100 covariate values from $X \sim \text{Normal}(20, \sigma = 2)$. Use a histogram to explore the distribution of $X$.
#'
## ----------------------------------------------------------------------------
X <- rnorm(100, 20, 2)
X
hist(X)
#'
#' OK, looks normal to me, with mean 20 and some variability around it.
#'
#' 2. Generate a Poisson rate from these values according to the relationship lambda=exp(-3+0.2*X). Have a look graphically to the relationship between $\lambda$ and $X$. Hint: You might want to order the values of $X$ using the R function `sort()`.
#'
## ----------------------------------------------------------------------------
lambda <- exp(-3 + 0.2 * X)
plot(X,lambda, type="l")
#'
#' Something went wrong... Yes, we need to order the X values!
#'
## ----------------------------------------------------------------------------
X <- sort(X)
lambda <- exp(-3 + 0.2 * X)
plot(X,lambda, type="l")
#'
#' 3. Use simulation to generate a data set of 100 response values from a Poisson distribution according to this rate $Y \sim \text{Poisson}(\lambda(X))$. Explore the distribution of $Y$ using a bar plot (`?barplot`).
#'
## ----------------------------------------------------------------------------
Y <- rpois(100, lambda)
Y
#'
#' Explore the distribution of Y
#'
## ----------------------------------------------------------------------------
tab.Y <- table(Y)
tab.Y
barplot(tab.Y,col=rainbow(20),cex.axis=1.2,cex.names=1.2,main='')
#'
#'
#' 4. Construct a data frame containing the covariate (X) and response (Y) data. Have a look to the first and last rows.
#'
## ----------------------------------------------------------------------------
dat <- data.frame(cbind(X,Y))
head(dat)
tail(dat)
#'
#' 5. Use a GLM with the appropriate structure to try and retrieve the parameters -3 and 0.2 from question 2.
#'
## ----------------------------------------------------------------------------
m <- glm(Y~X, family=poisson, dat)
broom::tidy(m)
#'
#' It looks as though the intercept and slope estimates are close to the values -3 and 0.2 we used to simulate the data.
#'
#' 6. Increase sample size (multiply by 1000) and comment on the parameter estimates.
#'
## ----------------------------------------------------------------------------
N <- 100000
X <- sort(rnorm(N, 20, 2))
lambda <- exp(-3 + 0.2 * X)
Y <- rpois(N, lambda)
dat <- data.frame(cbind(X,Y))
m <- glm(Y~X, family=poisson, dat)
summary(m)
#'
#' With increasing sample size, the estimates get much closer to the values we used to generate the data (i.e. the truth). That's an appealing property of maximum likelihood estimates.
#'
#' 7. Overdispersion. We would like to do the same exercise with an overdispersed Poisson distribution. To do so, we need to generate data from a quasi-Poisson distribution. Interestingly, the negative-binomial (NB) distribution allows relaxing the Poisson assumption $E(Y)=V(Y)=\lambda$. There are several parameterizations for the NB distribution. Here, we will use $W \sim NB(\lambda,\phi)$ where $\lambda$ is the expected value as in the Poisson distribution and $\phi$ is the overdispersion parameter. From the NB properties, we have that the mean of $W$ is $\lambda$ while its variance is $\lambda + \lambda^2/\phi$. Therefore, a small value of $\phi$ means a large deviation from a Poisson distribution (the variance of $W$ is much larger than $\lambda$, which would also be the mean for a Poisson distribution), while as $\phi$ gets larger the NB looks more and more like a Poisson distribution (the term $\lambda^2/\phi$ tends to 0, and the mean and variance looks more and more alike). In R, we will specify:
#'
## ----------------------------------------------------------------------------
lambda = 2
phi = 5
n = 1000
# simulate the response
w_nb <- rnbinom(n,size=phi,mu=lambda)
# recover the NB parameters from its mean and variance
mean(w_nb) # lambda
mean(w_nb)^2/(var(w_nb)-mean(w_nb)) # phi
#'
#' Adopt the same approach as above to simulate data from a Poisson distribution with overdispersion parameter $\phi=0.1$. Inspect the residuals. Do they look ok to you? If not, fit a GLM with an appropriate structure. Have a look to the relationship between the response and the predictor for both models, without and with the overdispersion parameter (use the `visreg` from the package `visreg`). Is there any difference?
#'
## ----------------------------------------------------------------------------
N <- 1000
X <- sort(rnorm(N, 20, 2))
lambda <- exp(-3 + 0.2 * X)
phi <- 0.1
Y <- rnbinom(N,size=phi,mu=lambda)
dat <- data.frame(cbind(X,Y))
m <- glm(Y~X, family=poisson, dat)
DHARMa::simulateResiduals(m, plot = TRUE)
#'
#' Something is weird in the residuals, clearly the Poisson distribution does not seem to be appropriate for these data. Is it an overdispersion issue?
#'
## ----------------------------------------------------------------------------
simres <- DHARMa::simulateResiduals(m, refit = TRUE)
DHARMa::testOverdispersion(simres)
#'
#' Yes, overdispersion is significant. Let's fit a model with overdispersion (quasiPoisson distribution).
#'
## ----------------------------------------------------------------------------
m_quasi <- glm(Y~X, family=quasipoisson, dat)
#'
## ----pois_overdisp_eff1, echo=FALSE------------------------------------------
par(mfrow=c(1,2))
visreg::visreg(m, scale = "response",main='Poisson')
visreg::visreg(m_quasi, scale = "response",main='QuasiPoisson')
#'
#' The uncertainty is bigger when overdispersion is accounted for. Overdispersion is often related to pseudoreplication. In brief, it means that there is some dependence among the statistical units (e.g., individuals). As a consequence, while you might think you have, say, a hundred individuals, it is actually (much) less in a statistical sense because some of them bring the same information due to their dependence (ressemblance) with each other. In turn, the confidence intervals are narrower than they should be. This issue is addressed by accounting for overdispersion with a quasiPoisson distribution, therefore explaining the wider confidence intervals.
#'
#'
#' # Practical #2
#'
#' 1. Relationship between tree diameter and height and LMM. You have data on 1000 trees from 10 plots, with 4 up to 392 trees per plot and several measurements taken on each tree.
#'
## ----------------------------------------------------------------------------
# dbh is diameter at breast height (diameter of the trunk)
trees <- read.table("trees.txt", header=TRUE)
str(trees)
head(trees)
#'
#' Fit a linear model with height as the response variable and dbh as a predictor. Plot the data on a graph, and add the regression line (use `abline`).
#'
## ----------------------------------------------------------------------------
lm.simple <- lm(height ~ dbh, data = trees)
summary(lm.simple)
#'
## ----------------------------------------------------------------------------
plot(height ~ dbh, data=trees, las=1, xlab="DBH (cm)", ylab="Height (m)", ylim = c(0, 50), main = "Single intercept")
abline(lm(height ~ dbh, data=trees), lwd=4, col="red")
#'
#' Now we have only one intercept. What if allometry varies among plots? Fit a linear model with a different intercept for each plot, and inspect the results with `broom::tidy`. Try and represent the 20 regression lines on the same graph.
#'
## ----------------------------------------------------------------------------
lm.interc <- lm(height ~ factor(plot) + dbh, data = trees)
broom::tidy(lm.interc)
#'
## ----------------------------------------------------------------------------
plot(trees$dbh[trees$plot==1], trees$height[trees$plot==1],
pch=20, las=1, xlab="DBH (cm)", ylab="Height (m)", col=1,
ylim=c(0,50), main = "Different intercept for each plot")
abline(a=coef(lm.interc)[1], b=coef(lm.interc)[11], col=1, lwd=2) # plot 1
for(i in 2:10){
points(trees$dbh[trees$plot==i], trees$height[trees$plot==i], pch=20, col=i)
abline(a=coef(lm.interc)[1] + coef(lm.interc)[i], b=coef(lm.interc)[11], col=i, lwd=2) # plot 2-10
}
#'
#' How many parameters are estimated in the model pooling all plots together vs. the model considering a different intercept for each plot?
#'
#' The answer is 3 (intercept, slope, residual variance) vs. 12 (10 intercepts, 1 slope, 1 residual variance).
#'
#' Mixed models enable us to make a compromise between the two models by accounting for plot-to-plot variability.
#'
#' Recall that:
#' $$
#' \begin{aligned}
#' y_{ij} \sim N(\mu_{ij},\sigma^2) \\
#' \mu{ij} = a_{j} + b x_{i} \\
#' a_{j} \sim N\left( \mu_a,\tau^2 \right) \\
#' \end{aligned}
#' $$
#'
#' How does this GLMM formulation translate into in our example? In our example, it translates into:
#'
#' $$
#' \begin{aligned}
#' \text{Height}_{ij} \sim N(\mu_{ij},\sigma^2) \\
#' \mu_{ij} = a_{j} + b \;DBH_{i} \\
#' a_{j} \sim N\left( \mu_a,\tau^2 \right) \\
#' j = 1, \ldots, 10 \; \text{(plots)} \\
#' \end{aligned}
#' $$
#'
#' We have the gradient:
#'
#' - **complete pooling**: Single overall intercept.
#' - `lm (height ~ dbh)`
#'
#' - **no pooling**: One *independent* intercept for each plot.
#' - `lm (height ~ dbh + factor(plot))`
#'
#' - **partial pooling**: Inter-related intercepts.
#' - `lmer(height ~ dbh + (1 | plot))`
#'
#'
#' Fit model with plot as a random effect. Comment on the outputs.
#'
## ----------------------------------------------------------------------------
library(lme4)
mixed <- lmer(height ~ dbh + (1|plot), data = trees)
summary(mixed)
#'
#' The following are the important pieces:
#'
#' ```{}
#' Random effects:
#' Groups Name Variance Std.Dev.
#' plot (Intercept) 19.834 4.454
#' Residual 8.325 2.885
#' Number of obs: 1000, groups: plot, 10
#' ```
#'
#' This tells us the variance and standard deviation (Std.Dev. = sqrt(Variance)) of our random intercept on `plot` ($\tau^2$) and of the residual variation `Residual` ($\sigma^2$). We are also told how many rows of data we have `Number of obs: 1000` and how many random intercept groups we have `plot, 10`.
#'
#' ```{}
#' Fixed effects:
#' Estimate Std. Error t value
#' (Intercept) 14.79816 1.43742 10.29
#' dbh 0.60566 0.00704 86.03
#' ```
#'
#' This tells us our fixed effect estimates and their standard errors. The row with (Intercept) has to do with $\mu_a$, while that with dbh has to do with $b$.
#' Assuming normality, a 95% confidence interval on those coefficients can be obtained with their estimate +/- 1.96 the standard error. In other words approximately estimate +/- 2*SE.
#'
#' Visualise the effect using `allEffects` and/or `visreg`.
#'
## ----------------------------------------------------------------------------
effects::allEffects(mixed)
#'
## ----------------------------------------------------------------------------
plot(effects::allEffects(mixed))
#'
#'
## ----------------------------------------------------------------------------
visreg::visreg(mixed,xvar='dbh')
#'
#'
#'
#'
#'
#'
#'
#' 2. Longitudinal study on coral reef and GLMM. A survey of a coral reef uses 10 predefined linear transects covered by divers once every week. The response variable of interest is the abundance of a particular species of anemone as a function of water temperature. Counts of anemones are recorded at 20 regular line segments along the transect. The following piece of code will generate a data set with realistic properties according to the above design. Make sure you understand what it is doing. You might want to explain the script to the colleague next to you.
#'
## ----------------------------------------------------------------------------
transects <- 10
data <- NULL
for (tr in 1:transects){
ref <- rnorm(1,0,.5) # random effect (intercept)
t <- runif(1, 18,22) + runif(1,-.2,0.2)*1:20 # water temperature gradient
ans <- exp(ref -14 + 1.8 * t - 0.045 * t^2) # Anemone gradient (expected response)
an <- rpois(20, ans) # Actual observations: counts on 20 segments of the current transect
data <- rbind(data,cbind(rep(tr, 20), t, an))
}
#' str(data)
#' To try and make sense of the code of others, it is always good to plot and/or run small sections of the code.
## ----------------------------------------------------------------------------
ref <- rnorm(1,0,.5) # random effect (intercept)
t <- runif(1, 18,22) + runif(1,-.2,0.2)*1:20 # water temperature gradient
plot(t,type='l')
#'
## ----------------------------------------------------------------------------
ans <- exp(ref -14 + 1.8 * t - 0.045 * t^2) # Anemone gradient (expected response)
plot(t,log(ans),type='l')
#'
#' Generate a data set using the anemone code and fit to it the following mixed effects models:
#'
#' - m1: GLM with temperature
#'
#' - m2: GLM with quadratic formula in temperature
#'
#' - m3: GLMM with temperature and random intercept
#'
#' - m4: GLMM with quadratic temperature and random intercept
#'
#' Carry out model selection from this set of models. Inspect the temperature effect with `visreg`.
#'
## ----------------------------------------------------------------------------
data <- data.frame(Transect=data[,1],Temperature=data[,2],Anemones=data[,3])
plot(data$Temperature, data$Anemones)
#'
## ----------------------------------------------------------------------------
library(lme4)
m1 <- glm(Anemones ~ Temperature, data=data, family=poisson)
m2 <- glm(Anemones ~ Temperature + I(Temperature^2), data=data, family=poisson)
m3 <- glmer(Anemones ~ Temperature + (1 | Transect), data=data, family=poisson)
m4 <- glmer(Anemones ~ Temperature + I(Temperature^2) + (1 | Transect), data=data, family=poisson)
AIC(m1,m2,m3,m4)
#'
#' There seems to be issues of convergence, and the warning message is quite clear in that we should standardize our Temperature covariate. Let's do it then:
#'
## ----------------------------------------------------------------------------
data$Temp <- (data$Temperature - mean(data$Temperature))/sd(data$Temperature)
head(data)
#'
#' A column Temp has been added to our dataframe.
#'
## ----------------------------------------------------------------------------
library(lme4)
m1 <- glm(Anemones ~ Temp, data=data, family=poisson)
m2 <- glm(Anemones ~ Temp + I(Temp^2), data=data, family=poisson)
m3 <- glmer(Anemones ~ Temp + (1 | Transect), data=data, family=poisson)
m4 <- glmer(Anemones ~ Temp + I(Temp^2) + (1 | Transect), data=data, family=poisson)
AIC(m1,m2,m3,m4)
#'
#' The model ranking is the same, and we do no longer get warnings. Let's have a look to the relationship between the number of anemones and water temperature.
#'
## ----------------------------------------------------------------------------
visreg::visreg(m4,xvar='Temp')
#'