forked from eric-pedersen/mgcv-esa-workshop
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathexample-spatial-mexdolphins-solutions.Rmd
491 lines (350 loc) · 19.6 KB
/
example-spatial-mexdolphins-solutions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
---
title: "Further analysis of pantropical spotted dolphins in the Gulf of Mexico"
output:
html_document:
toc: true
toc_float: true
theme: readable
highlight: haddock
---
# Preamble
This exercise is based on the [Appendix of Miller et al 2013](http://distancesampling.org/R/vignettes/mexico-analysis.html). In this example we're ignoring all kinds of important things like detectability and availability. This should not be treated as a serious analysis of these data! For a more complete treatment of detection-corrected abundance estimation via distance sampling and generalized additive models, see Miller et al. (2013).
From that appendix:
*The analysis is based on a dataset of observations of pantropical dolphins in the Gulf of Mexico (shipped with Distance 6.0 and later). For convenience the data are bundled in an `R`-friendly format, although all of the code necessary for creating the data from the Distance project files is available [on github](http://github.com/dill/mexico-data). The OBIS-SEAMAP page for the data may be found at the [SEFSC GoMex Oceanic 1996](http://seamap.env.duke.edu/dataset/25) survey page.*
## Doing these exercises
Probably the easiest way to do these exercises is to open this document in RStudio and go through the code blocks one by one (hitting the "play" button in the editor window), filling in the code where necessary and executing the commands one-by-one. You can then compile the document once you're done to check that everything works.
# Data format
The data are provided in the `data/mexdolphins` folder as the file `mexdolphins.RData`. Loading this we can see what is provided:
```{r loaddata}
load("data/mexdolphins/mexdolphins.RData")
ls()
```
- `mexdolphins` the `data.frame` containing the observations and covariates, used to fit the model.
- `pred_latlong` an `sp` object that has the shapefile for the prediction grid, used for fancy graphs
- `preddata` prediction grid without any fancy spatial stuff
Looking further into the `mexdolphins` frame we see:
```{r frameinspect}
str(mexdolphins)
```
A brief explanation of each entry:
- `Sample.Label` identifier for the effort "segment" (approximately square sampling area)
- `Transect.Label` identifier for the transect that this segment belongs to
- `longitude`, `latitude` location in lat/long of this segment
- `x`, `y` location in projected coordinates (projected using the [North American Lambert Conformal Conic projection](https://en.wikipedia.org/wiki/Lambert_conformal_conic_projection))
- `Effort` the length of the current segment
- `depth` the bathymetry at the segment's position
- `count` number of dolphins observed in this segment
- `segment.area` the area of the segment (`Effort` multiplied by the width of the segment
- `off.set` the logarithm of the `segment.area` multiplied by a correction for detectability (see link to appendix above for more information on this)
# Modelling
Our objective here is to build a spatially explicit model of abundance of the dolphins. In some sense this is a kind of species distribution model.
Our possible covariates to model abundance are location and depth. These are fairly good predictors of the abundance (SPOILER ALERT), though we could probably improve the model further by including things like sea surface temperature and chlorophyll *a*.
## A simple model to start with
We can begin with the model we showed in the first lecture:
```{r simplemodel}
library(mgcv)
dolphins_depth <- gam(count ~ s(depth) + offset(off.set),
data = mexdolphins,
family = quasipoisson(),
method = "REML")
```
That is we fit the counts as a function of depth, using the offset to take into account effort. We use a quasi-Poisson response (i.e., modelling just the mean-variance relationship such that the variance is proportional to the mean) and use REML for smoothness selection.
We can check the assumptions of this model by using `gam.check`:
```{r simplemodel-check}
gam.check(dolphins_depth)
```
As is usual for count data, these plots are a bit tricky to interpret. For example the residuals vs linear predictor plot in the top right has that nasty line through it that makes looking for pattern tricky. We can see easily that the line equates to the zero count observations:
```{r zeroresids, fig.width=7, fig.height=7}
# code from the insides of mgcv::gam.check
resid <- residuals(dolphins_depth, type="deviance")
linpred <- napredict(dolphins_depth$na.action, dolphins_depth$linear.predictors)
plot(linpred, resid, main = "Resids vs. linear pred.",
xlab = "linear predictor", ylab = "residuals")
# now add red dots corresponding to the zero counts
points(linpred[mexdolphins$count==0],resid[mexdolphins$count==0],
pch=19, col="red", cex=0.5)
```
We can use randomised quantile residuals instead of deviance residuals to get around this in some cases (though not quasi-Poisson, as we don't have a proper likelihood!).
Ignoring the plots for now (as we'll address them in the next section), let's look at the text output. It seems that the `k` value we set (or rather the default of 10) seems to have been adequate.
We could increase the value of `k` by replacing the `s(...)` with, for example, `s(depth, k=25)` (for a possibly very wiggly function) or `s(depth, k=3)` (for a much less wiggly function). Making `k` big will create a bigger design matrix and penalty matrix.
**Exercise**
Look at the differences in the size of the design and penalty matrices by using `dim(odel.matrix(...))` and `dim(model$smooth[[1]]$S[[1]])`, replacing `...` and `model` appropriately for models with `k=3` and `k=30`.
```{r simplemodel-bigsmall}
dolphins_depth_bigk <- gam(count ~ s(depth, k=30) + offset(off.set),
data = mexdolphins,
family = quasipoisson(),
method = "REML")
dim(model.matrix(dolphins_depth_bigk))
dim(dolphins_depth_bigk$smooth[[1]]$S[[1]])
dolphins_depth_smallk <- gam(count ~ s(depth, k=3) + offset(off.set),
data = mexdolphins,
family = quasipoisson(),
method = "REML")
dim(model.matrix(dolphins_depth_smallk))
dim(dolphins_depth_smallk$smooth[[1]]$S[[1]])
```
(Don't worry about the many square brackets etc to get the penalty matrix!)
### Plotting
We can plot the smooth we fitted using `plot`.
**Exercise**
Compare the first model we fitted with the two using different `k` values above. Use `par(mfrow=c(1,3))` to put them all in one graphic. Look at `?plot.gam` and plot the confidence intervals as a filled "confidence band". Title the plots appropriately so you can check which is which.
```{r plotk, }
par(mfrow=c(1,3))
plot(dolphins_depth, shade=TRUE, main="Default k")
plot(dolphins_depth_bigk, shade=TRUE, main="big k")
plot(dolphins_depth_smallk, shade=TRUE, main="small k")
```
## Count distributions
In general quasi-Poisson doesn't seem to do too great a job at modelling data with many zeros. Luckily we have a few tricks up our sleeves...
### Tweedie
Adding a smooth of `x` and `y` to our model with `s(x,y)`, we can then switch the `family=` argument to use `tw()` for a Tweedie distribution.
```{r tw}
dolphins_xy_tw <- gam(count ~ s(x,y) + s(depth) + offset(off.set),
data = mexdolphins,
family = tw(),
method = "REML")
```
More information on Tweedie distributions can be found in Foster & Bravington (2012) and Shono (2008).
### Negative binomial
**Exercise**
Now do the same using the negative binomial distribution (`nb()`).
```{r nb}
dolphins_xy_nb <- gam(count ~ s(x,y) + s(depth) + offset(off.set),
data = mexdolphins,
family = nb(),
method = "REML")
```
Looking at the quantile-quantile plots only in the `gam.check` output for these two models, which do you prefer? Why?
*Looks like Tweedie is better here as the points are closer to the x=y line in the Q-Q plot. Also the histogram of residuals looks more (though not very) normal.*
Look at the results of calling `summary` on both models and note that there are differences in the resulting models, due to the differing mean-variance relationships.
## Smoothers
Now let's move onto using different bases for the smoothers. We have a couple of different options here.
### Thin plate splines with shrinkage
By default we use the `"tp"` basis. This is just plain thin plate regression splines (as defined in Wood, 2003). We can also use the `"ts"` basis, which is the same but with extra shrinkage on the usually unpenalised parts of model. In the univariate case this is the linear slope term of the smooth.
**Exercise**
Compare the results from one of the models above with a version using the thin plate with shrinkage using the `bs="ts"` argument to `s()` for both terms.
```{r tw-ts}
dolphins_xy_tw_ts <- gam(count ~ s(x,y, bs="ts") + s(depth, bs="ts") +
offset(off.set),
data = mexdolphins,
family = tw(),
method = "REML")
```
What are the differences (use `summary`)?
*EDFs are different for both terms*
What are the visual differences (use `plot`)?
*Not really much difference here!*
### Soap film smoother
We can use a soap film smoother (Wood, 2008) to take into account a complex boundary, such as a coastline or islands.
Here I've built a simple coastline of the US states bordering the Gulf of Mexico (see the `soap_pred.R` file for how this was constructed). We can load up this boundary and the prediction grid from the following `RData` file:
```{r soapy}
load("data/mexdolphins/soapy.RData")
```
Now we need to build knots for the soap film, for this we simply create a grid, then find the grid points inside the boundary. We don't need too many of them.
```{r soapknots}
soap_knots <- expand.grid(x=seq(min(xy_bnd$x), max(xy_bnd$x), length.out=10),
y=seq(min(xy_bnd$y), max(xy_bnd$y), length.out=10))
x <- soap_knots$x; y <- soap_knots$y
ind <- inSide(xy_bnd, x, y)
rm(x,y)
soap_knots <- soap_knots[ind, ]
## inSide doesn't work perfectly, so if you get an error like:
## Error in crunch.knots(ret$G, knots, x0, y0, dx, dy) :
## knot 54 is on or outside boundary
## just remove that knot as follows:
soap_knots <- soap_knots[-8, ]
soap_knots <- soap_knots[-54, ]
```
We can now fit our model. Note that we specify a basis via `bs=` and the boundary via `xt` (for e`xt`ra information) in the `s()` term. We also include the knots as a `knots=` argument to `gam`.
```{r soapmodel}
dolphins_soap <- gam(count ~ s(x,y, bs="so", xt=list(bnd=list(xy_bnd))) +
offset(off.set),
data = mexdolphins,
family = tw(),
knots = soap_knots,
method = "REML")
```
**Exercise**
Look at the `summary` output for this model and compare it to the other models.
```{r soap-summary}
summary(dolphins_soap)
```
The plotting function for soap film smooths looks much nicer by default than for other 2D smooths -- try it out.
```{r soap-plot}
plot(dolphins_soap)
```
# Predictions
As we saw in the intro to GAMs slides, `predict` is your friend when it comes to making predictions for the GAM.
We can do this very simply, calling predict as one would with a `glm`. For example:
```{r pred-qp}
pred_qp <- predict(dolphins_depth, preddata, type="response")
```
Now, this just gives a long vector of numbers for the predicted number of animals per cell. We can find the total abundance using `sum`.
**Exercise**
How many dolphins are there in the total area? What is the maximum in a given cell? What is the minimum?
```{r total-max-min}
sum(pred_qp)
range(pred_qp)
```
## Plotting predictions
*Note that this section requires quite a few additional packages to run the examples, so may not run the first time. You can use* `install.packages` *to grab the packages you need.*
Plotting predictions in projected coordinate systems is tricky. I'll show to methods here but not go into too much detail, as that's not the aim of this workshop.
For the non-soap models, we'll use the below helper function to put the predictions into a bunch of squares and then return an appropriate `ggplot2` object for us to plot:
```{r gridplotfn}
library(plyr)
# fill must be in the same order as the polygon data
grid_plot_obj <- function(fill, name, sp){
# what was the data supplied?
names(fill) <- NULL
row.names(fill) <- NULL
data <- data.frame(fill)
names(data) <- name
spdf <- SpatialPolygonsDataFrame(sp, data)
spdf@data$id <- rownames(spdf@data)
spdf.points <- fortify(spdf, region="id")
spdf.df <- join(spdf.points, spdf@data, by="id")
# seems to store the x/y even when projected as labelled as
# "long" and "lat"
spdf.df$x <- spdf.df$long
spdf.df$y <- spdf.df$lat
geom_polygon(aes_string(x="x",y="y",fill=name, group="group"), data=spdf.df)
}
```
Then we can plot the predicted abundance:
```{r plotpred}
library(ggplot2)
library(viridis)
library(sp)
library(rgeos)
library(rgdal)
library(maptools)
# projection string
lcc_proj4 <- CRS("+proj=lcc +lat_1=20 +lat_2=60 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs ")
# need sp to transform the data
pred.polys <- spTransform(pred_latlong, CRSobj=lcc_proj4)
p <- ggplot() +
# abundance
grid_plot_obj(pred_qp, "N", pred.polys) +
# survey lines
geom_line(aes(x, y, group=Transect.Label), data=mexdolphins) +
# observations
geom_point(aes(x, y, size=count),
data=mexdolphins[mexdolphins$count>0,],
colour="red", alpha=I(0.7)) +
# make the coordinate system fixed
coord_fixed(ratio=1, ylim = range(mexdolphins$y),
xlim = range(mexdolphins$x)) +
# use the viridis colourscheme, which is more colourblind-friendly
scale_fill_viridis() +
# labels
labs(fill="Predicted\ndensity",x="x",y="y",size="Count") +
# keep things simple
theme_minimal()
print(p)
```
If you have the `maps` package installed you can also try the following plot the coastline too:
```{r maptoo}
library(maps)
map_dat <- map_data("usa")
map_sp <- SpatialPoints(map_dat[,c("long","lat")])
# give the sp object a projection
proj4string(map_sp) <-CRS("+proj=longlat +datum=WGS84")
# re-project
map_sp.t <- spTransform(map_sp, CRSobj=lcc_proj4)
map_dat$x <- map_sp.t$long
map_dat$y <- map_sp.t$lat
p <- p + geom_polygon(aes(x=x, y=y, group = group), fill = "#1A9850", data=map_dat)
print(p)
```
## Comparing predictions from soap and thin plate splines
The `soap_preddata` frame has a much larger prediction grid that covers a much wider area.
Predicting using the soap film smoother is exactly the same as for the other smoothers:
```{r soap-pred}
soap_pred_N <- predict(dolphins_soap, soap_preddata, type="response")
soap_pred_N <- cbind.data.frame(soap_preddata, N=soap_pred_N, model="soap")
```
Use the above as a template (along with `ggplot2::facet_wrap()`) to build a comparison plot with the predictions of the soap film and another model.
```{r xy-soap-pred}
dolphins_xy_q <- gam(count ~ s(x,y, bs="ts") + offset(off.set),
data = mexdolphins,
family = tw(),
method = "REML")
xy_pred_N <- predict(dolphins_xy_q, soap_preddata, type="response")
xy_pred_N <- cbind.data.frame(soap_preddata, N=xy_pred_N, model="xy")
all_pred_N <- rbind(soap_pred_N, xy_pred_N)
p <- ggplot() +
# abundance
geom_tile(aes(x=x, y=y, width=sqrt(area), height=sqrt(area), fill=N), data=all_pred_N) +
# facet it!
facet_wrap(~model) +
# make the coordinate system fixed
coord_fixed(ratio=1, ylim = range(mexdolphins$y),
xlim = range(mexdolphins$x)) +
# use the viridis colourscheme, which is more colourblind-friendly
scale_fill_viridis() +
# labels
labs(fill="Predicted\ndensity", x="x", y="y") +
# keep things simple
theme_minimal() +
geom_polygon(aes(x=x, y=y, group = group), fill = "#1A9850", data=map_dat)
print(p)
```
## Plotting uncertainty
**Exercise**
We can use the `se.fit=TRUE` argument to get the per-cell standard errors, then divide these through by the abundances to get a coefficient of variation (CV) per cell. We can then plot that using the same technique as above. Try this out below.
Note that the resulting object from setting `se.fit=TRUE` is a list with two elements.
```{r CVmap}
pred_se_qp <- predict(dolphins_depth, preddata, type="response", se.fit=TRUE)
cv <- pred_se_qp$se.fit/pred_se_qp$fit
p <- ggplot() +
# abundance
grid_plot_obj(cv, "CV", pred.polys) +
# survey lines
geom_line(aes(x, y, group=Transect.Label), data=mexdolphins) +
# make the coordinate system fixed
coord_fixed(ratio=1, ylim = range(mexdolphins$y),
xlim = range(mexdolphins$x)) +
# use the viridis colourscheme, which is more colourblind-friendly
scale_fill_viridis() +
# labels
labs(fill="CV",x="x",y="y",size="Count") +
# keep things simple
theme_minimal()
print(p)
```
Compare the uncertainties of the different models you've fitted so far.
## `lpmatrix` magic
Now for some `lpmatrix` magic. We said before that the `lpmatrix` maps the parameters onto the predictions. Let's show that's true:
```{r lppred}
# make the Lp matrix
lp <- predict(dolphins_depth, preddata, type="lpmatrix")
# get the linear predictor
lin_pred <- lp %*% coef(dolphins_depth)
# apply the link and multiply by the offset as lpmatrix ignores this
pred <- preddata$area * exp(lin_pred)
# all the same?
all.equal(pred[,1], as.numeric(pred_qp), check.attributes=FALSE)
```
What else can we do? We can also grab the uncertainty for the sum of the predictions:
```{r lpNvar}
# extract the variance-covariance matrix
vc <- vcov(dolphins_depth)
# reproject the var-covar matrix to be on the linear predictor scale
lin_pred_var <- tcrossprod(lp %*% vc, lp)
# pre and post multiply by the derivatives of the link, evaluated at the
# predictions -- since the link is exp, we just use the predictions
pred_var <- matrix(pred,nrow=1) %*% lin_pred_var %*% matrix(pred,ncol=1)
# we can then calculate a total CV
sqrt(pred_var)/sum(pred)
```
As you can see, the `lpmatrix` can be very useful!
# Extra credit
- Experiment with `vis.gam` (watch out for the `view=` option) and plot the 2D smooths you've fitted. Check out the `too.far=` agument.
- Use the randomized quantile residuals plot to inspect the models you fitted above. What are the differences you see between them and the deviance residuals you see above? Which model would you choose?
- Redo the side-by-side soap film and another smoother plot, but showing the coefficient of variation. What difference does the soap film make?
# References
- Foster, S. D., & Bravington, M. V. (2012). A Poisson–Gamma model for analysis of ecological non-negative continuous data. Environmental and Ecological Statistics, 20(4), 533–552. http://doi.org/10.1007/s10651-012-0233-0
- Miller, D. L., Burt, M. L., Rexstad, E. A., & Thomas, L. (2013). Spatial models for distance sampling data: recent developments and future directions. Methods in Ecology and Evolution, 4(11), 1001–1010. http://doi.org/10.1111/2041-210X.12105
- Shono, H. (2008). Application of the Tweedie distribution to zero-catch data in CPUE analysis. Fisheries Research, 93(1-2), 154–162. http://doi.org/10.1016/j.fishres.2008.03.006
- Wood, S. N. (2003). Thin plate regression splines. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(1), 95–114.
- Wood, S. N., Bravington, M. V., & Hedley, S. L. (2008). Soap film smoothing. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 931–955. http://doi.org/10.1111/j.1467-9868.2008.00665.x