-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtwo_way_anova.qmd
455 lines (342 loc) · 18.8 KB
/
two_way_anova.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
# Two-way ANOVA
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("complete")
```
## Overview
In the last chapter, we learnt how to use and interpret the general
linear model when the *x* variable was categorical with two or more groups.
This procedure is known as one-way ANOVA. We will now incorporate a *second*
categorical variable. This is often known as the two-way or two-factor ANOVA
(**an**alysis **o**f **var**iance). It might also be described as a "factorial
design".
In the last chapter we conducted a one-way ANOVA to test whether there was an
of media on the diameter of bacterial colonies. Suppose we had run this
experiment on more than one bacterial species. We would then have two
explanatory variables: media and species. Perhaps we should do two one-way
ANOVAs, one for each explanatory variable? This would tell us whether each
explanatory variable had an effect on the response variable. However, it would
not tell us whether there was an **interaction** between the two variables. An
interaction is when the effect of one variable depends on the level of another
and in some ways it is the most interesting result we could reveal. For example,
we might find that the optimal media for one species was not the optimal media
for the other species. A two-way ANOVA allows us to test for the effects of each
explanatory variable *and* whether they interact. A two-way ANOVA has three null
hypotheses:
1. There is no effect of media on diameter, or there is no difference between
the mean diameter of colonies grown on different media.
2. There is no effect of species on diameter, or there is no difference between
bacteria in colony diameter.
3. The effect of media is the same for all species, or the effect of media does
not depend on species.
As a General linear model, two-way ANOVA is a parametric test with assumptions
based on the normal distribution which need to be met for the *p*-values
generated to be accurate. We use `lm()` to fit the model, `anova()` to test
the significance of the explanatory variables and `emmeans()` and pairs() for
post-hoc testing.
### Model assumptions
We examine the model assumptions in the same way as we did for single linear
regression, two-sample tests and one-way ANOVA: we apply the general linear
model with `lm()` and *then* check the assumptions using diagnostic plots and
the Shapiro-Wilk normality test. We also use common sense:
- the response should be continuous (or nearly continuous, see
[Ideas about data: Theory and practice](ideas_about_data.html#theory-and-practice))
and we consider whether we would expect the response to be normally
distributed.
- we expect decimal places and few repeated values.
### Reporting
In reporting the result of two-way ANOVA, we
include:
1. the significance of each of the explanatory variables and the interaction
between them.
- The *F*-statistic and *p*-value for each explanatory variable and the
interaction term.
2. the direction of effect - which of the means is greater
- Post-hoc test
3. the magnitude of effect - how big is the difference between the
means
- the means and standard errors for each group
Figures should reflect what you have said in the statements. Ideally
they should show both the raw data and the statistical model: means and standard errors
We will explore all of these ideas with some examples.
## 🎬 Your turn!
If you want to code along you will need to start a new [RStudio
project](workflow_rstudio.html#rstudio-projects), add a `data-raw`
folder and open a new script. You will also need to load the
**`tidyverse`** package [@tidyverse].
## Two-way ANOVA
Researchers have collected live specimens of two species of periwinkle (See @fig-periwinkles) from sites in northern England in the Spring and Summer. They took a measure of the gut parasite load by examining a slide of gut contents. The data are in [periwinkle.txt](data-raw/periwinkle.txt). The data were collected to determine whether there was an effect of season or species on parasite load and whether these effects were independent.
![Periwinkles are marine gastropod molluscs (slugs and snails). A) *Littorina brevicula* (PD files - Public Domain, https://commons.wikimedia.org/w/index.php?curid=30577419) B) *Littorina littorea*. (photographed by Guttorm Flatabø (user:dittaeva). - Photograph taken with an Olympus Camedia C-70 Zoom digital camera. Metainformation edited with Irfanview, possibly cropped with jpegcrop., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=324769](images/Littorina.jpg){#fig-periwinkles fig-alt=""}
### Import and explore
Import the data:
```{r}
periwinkle <- read_delim("data-raw/periwinkle.txt", delim = "\t")
```
```{r}
#| echo: false
knitr::kable(periwinkle) |>
kableExtra::kable_styling() |>
kableExtra::scroll_box(height = "200px")
```
The Response variable is parasite load and it appears to be continuous. The Explanatory variables are species and season and each has two levels. It is known “two-way ANOVA” or “two-factor ANOVA” because there are two explanatory variables.
These data are in tidy format [@Wickham2014-nl] - all the parasite load
values are in one column (`para`) with the other columns indicating the `species` and the `season`. This means they are well formatted for analysis and plotting.
In the first instance it is sensible to create a rough plot of our data.
This is to give us an overview and help identify if there are any issues
like missing or extreme values. It also gives us idea what we are
expecting from the analysis which will make it easier for us to identify
if we make some mistake in applying that analysis.
Violin plots (`geom_violin()`, see @fig-periwinkle-rough), box plots
(`geom_boxplot()`) or scatter plots (`geom_point()`) all make good
choices for exploratory plotting and it does not matter which of these
you choose.
```{r}
#| label: fig-periwinkle-rough
#| fig-cap: "The parasite load for two species of Littorina indicated by the fill colour, in the Spring and Summer. Parasite load seems to be higher for both species in the summer and that effect looks bigger in L.brevicula - it has the lowest spring mean but the highest summer mean."
ggplot(data = periwinkle,
aes(x = season, y = para, fill = species)) +
geom_violin()
```
R will order the groups alphabetically by default.
The figure suggests that parasite load is higher for both species in the summer and that effect looks bigger in *L.brevicula* - it has the lowest spring mean but the highest summer mean.
Summarising the data for each species-season combination is the next sensible step. The most
useful summary statistics are the means, standard deviations, sample
sizes and standard errors. I recommend the `group_by()` and
`summarise()` approach:
```{r}
peri_summary <- periwinkle |>
group_by(season, species) |>
summarise(mean = mean(para),
sd = sd(para),
n = length(para),
se = sd / sqrt(n))
```
We have save the results to `peri_summary` so that we can use the
means and standard errors in our plot later.
```{r}
peri_summary
```
The summary confirms both species have a higher mean in the summer and that the difference between the species is reversed - *L.brevicula* minus *L.littorea* is `r peri_summary$mean[1]-peri_summary$mean[2]` in the spring but `r peri_summary$mean[3]-peri_summary$mean[4]` in summer.
### Apply `lm()`
We can create a two-way ANOVA model like this:
```{r}
mod <- lm(data = periwinkle, para ~ species * season)
```
And examine the model with:
```{r}
summary(mod)
```
The Estimates in the Coefficients table give:
- `(Intercept)` known as $\beta_0$. This the mean of the group with the first
level of *both* explanatory variables, *i.e.*, the mean of *L.brevicula*
group in Spring. Just as the intercept is the
value of *y* (the response) when the value of *x* (the explanatory) is zero
in a simple linear regression, this is the value of `para` when both
`season` and `species` are at their first levels. The order of the levels
is alphabetical by default.
- `speciesLittorina littorea` known as $\beta_1$. This is what needs to be
added to the *L.brevicula*
Spring mean (the intercept) when the `species` goes from its first level
to its second. That is, it is the difference between the *L.brevicula*
and *L.littorea* means in Spring. The
`speciesLittorina littorea` estimate is positive so the the *L.littorea*
mean is higher than the *L.brevicula* mean in Spring. If we had more
species we would have more estimates beginning `species...` and all would
be comparisons to the intercept.
- `seasonSummer` known as $\beta_2$. This is what needs to be added to the
*L.brevicula* Spring mean
(the intercept) when the `season` goes from its first level to its second.
That is, it is the difference between the Spring and Summer means for
*L.brevicula*. The `seasonSummer` estimate
is positive so the the Summer mean is higher than the Spring mean. If we had
more seasons we would have more estimates beginning `season...` and all would
be comparisons to the intercept.
- `speciesLittorina littorea:seasonSummer` known as $\beta_3$. This is
interaction effect. It is an additional effect. Going from *L.brevicula* to
*L.littorea* adds $\beta_1$ to the intercept. Going from Spring to Summer
adds $\beta_2$ to the intercept. Going from *L.brevicula* in Spring to
*L.littorea* in Summer adds $\beta_1 + \beta_2 + \beta_3$ to the intercept.
If $\beta_3$ is zero then the effect of `species` is the same in both
`season`s. If $\beta_3$ is not zero then the effect of `species` is different
in the two `season`s.
The *p*-values on each line are tests of whether that coefficient is
different from zero.
The *F* value and *p*-value in the last line are a test of whether the
model as a whole explains a significant amount of variation in the
response variable. The model of season and species overall explains a
significant amount of the variation in parasite load (`p-value: 3.043e-06`).
To see which of the three effects are significant we can use the `anova()`
function on our model.
Determine which effects are significant:
```{r}
anova(mod)
```
The parasite load is significantly greater in Summer (*F* = 25.9; *d.f.* = 1, 96; *p* < 0.0001) but this effect differs between species (*F* = 6.2; *d.f.* = 1,96; *p* = 0.014) with a greater increase in parasite load in *L.brevicula* than in *L.littorea*.
We need a post-hoc test to see which comparisons are significant and can again use then **`emmeans`** [@emmeans] package.
Load the package
```{r}
library(emmeans)
```
Carry out the post-hoc test
```{r}
emmeans(mod, ~ species * season) |> pairs()
```
Each row is a comparison between the two means in the 'contrast' column. The 'estimate' column is the difference between those means and the 'p.value' indicates whether that difference is significant.
A plot can be used to visualise the result of the post hoc which can be especially useful when there are very many comparisons.
Plot the results of the post-hoc test:
```{r}
emmeans(mod, ~ species * season) |> plot()
```
We have significant differences between:
- *L.brevicula* in the Spring and Summer `p <.0001`
- *L.brevicula* in the Spring and *L.littorea* in the Summer `p = 0.0004`
- *L.littorea* in the Spring *L.brevicula* in the Summer `p = 0.0172`
### Check assumptions
Check the assumptions: All general linear models assume the "residuals"
are normally distributed and have "homogeneity" of variance.
Our first check of these assumptions is to use common sense: diameter is
a continuous and we would expect it to be normally distributed thus we
would expect the residuals to be normally distributed thus we would
expect the residuals to be normally distributed
We then proceed by plotting residuals. The `plot()` function can be used
to plot the residuals against the fitted values (See @fig-anova2-plot1).
This is a good way to check for homogeneity of variance.
```{r}
#| label: fig-anova2-plot1
#| fig-cap: "A plot of the residuals against the fitted values shows whether the points are distributed similarly in each group. Any difference seems small but perhaps the residuals are less variable for the lowest mean."
plot(mod, which = 1)
```
We can also use a histogram to check for normality (See @fig-anova2-plot2).
```{r}
#| label: fig-anova2-plot2
#| fig-cap: "A histogram of residuals is symetrical and seems consistent with a normal distribution. This is a good sign for the assumption of normally distributed residuals."
ggplot(mapping = aes(x = mod$residuals)) +
geom_histogram(bins = 10)
```
Finally, we can use the Shapiro-Wilk test to test for normality.
```{r}
shapiro.test(mod$residuals)
```
The p-value is greater than 0.05 so this test of the normality
assumption is not significant.
Taken together, these results suggest that the assumptions of normality
and homogeneity of variance are probably not violated.
### Report
We might report this result as:
The parasite load was significantly greater in Summer (*F* = 25.9; *d.f.* = 1,
96; *p* < 0.0001) but this effect differed between species (*F* = 6.2;
*d.f.* = 1,96; *p* = 0.014) with a greater increase in parasite load in
*L.brevicula* (from $\bar{x} \pm s.e$: 57.0 $\pm$ 1.77 units to 73.5 $\pm$ 2.24
units) than in *L.littorea* (from 64.2 $\pm$ 2.38 units to 69.9 $\pm$ 2.30
units). See @fig-para.
::: {#fig-para}
```{r}
#| code-fold: true
ggplot() +
geom_point(data = periwinkle, aes(x = season,
y = para,
shape = species),
position = position_jitterdodge(dodge.width = 1,
jitter.width = 0.4,
jitter.height = 0),
size = 3,
colour = "gray50") +
geom_errorbar(data = peri_summary,
aes(x = season, ymin = mean - se, ymax = mean + se, group = species),
linewidth = 0.4, size = 1,
position = position_dodge(width = 1)) +
geom_errorbar(data = peri_summary,
aes(x = season, ymin = mean, ymax = mean, group = species),
linewidth = 0.3, size = 1,
position = position_dodge(width = 1) ) +
scale_x_discrete(name = "Season") +
scale_y_continuous(name = "Number of parasites",
expand = c(0, 0),
limits = c(0, 140)) +
scale_shape_manual(values = c(19, 1),
name = NULL,
labels = c(bquote(italic("L.brevicula")),
bquote(italic("L.littorea")))) +
# *L.brevicula* in the Spring and Summer `p <0.0001`
annotate("segment",
x = 0.75, xend = 1.75,
y = 115, yend = 115,
colour = "black") +
annotate("text",
x = 1.25, y = 119,
label = "p < 0.0001") +
# # *L.brevicula* in the Spring and *L.littorea* in the Summer `p = 0.0004`
annotate("segment",
x = 0.75, xend = 2.25,
y = 125, yend = 125,
colour = "black") +
annotate("text", x = 1.5, y = 129,
label = "p = 0.0004") +
# *L.littorea* in the Spring *L.brevicula* in the Summer `p = 0.0172`
annotate("segment",
x = 1.25, xend = 1.75,
y = 105, yend = 105,
colour = "black") +
annotate("text", x = 1.5, y = 109,
label = "p = 0.0172") +
theme_classic() +
theme(legend.title = element_blank(),
legend.position = c(0.85, 0.15))
```
**Parasite load is greater in Summer especially for *L.brevicula*.** Live
specimens of two species of periwinkle, *L.brevicula* and *L.littorea* were
collected from sites in northern England in the Spring and Summer and gut
parasite load was determined. Error bars are means $\pm$ 1 standard error.
Parasite load was significantly greater in Summer (*F* = 25.9; *d.f.* = 1, 96;
*p* < 0.0001) but this effect differed between species (*F* = 6.2;
*d.f.* = 1,96; *p* = 0.014) with a greater increase in parasite load in
*L.brevicula* than in *L.littorea*. Data analysis was conducted in R [@R-core]
with tidyverse packages [@tidyverse]. Post-hoc analysis was carried out with
Tukey's Honestly Significant Difference test [@tukey1949].
:::
## Summary
1. A linear model with two explanatory variable with two or more groups
each is also known as a **two-way ANOVA**.
2. We estimate the **coefficients** (also called the **parameters**) of
the model. For a two-way ANOVA with two groups in each explanatory
variable these are
- the mean of the group with the first level of
both explanatory variables $\beta_0$,
- what needs to be added to $\beta_0$ when one of the explanatory
variables goes from its first level to its second, $\beta_1$
- what needs to be added to $\beta_0$ when the other of the
explanatory variables goes from its first level to its second,
$\beta_2$
- $\beta_3$, the interaction effect: what additionally needs to
be added to $\beta_0$ + $\beta_1$ + $\beta_2$ when both of the
explanatory variables go from their first level to their second
3. We can use `lm()` to two-way ANOVA in R.
4. In the output of `lm()` the coefficients are listed in a table in the
Estimates column. The *p*-value for each coefficient is in the test
of whether it differs from zero. At the bottom of the output there
is an $F$ test of the model *overall*. The R-squared value is the
proportion of the variance in the response variable that is
explained by the model.
5. To see which of the three effects are significant we can use the
`anova()` function on our model.
6. To find out which means differ, we need a **post-hoc** test. Here
we use Tukey’s HSD applied with the `emmeans()` and `pairs()`
functions from the **`emmeans`** package. Post-hoc tests make
adjustments to the *p*-values to account for the fact that we are
doing multiple tests.
7. The assumptions of the general linear model are that the residuals
are normally distributed and have homogeneity of variance. A residual
is the difference between the predicted value and the observed value.
8. We examine a histogram of the residuals and use the Shapiro-Wilk
normality test to check the normality assumption. We check the
variance of the residuals is the same for all fitted values with
a residuals vs fitted plot.
9. When reporting the results of a test we give the significance,
direction and size of the effect. We use means and
standard errors for parametric tests like two-way ANOVA. We also
give the test statistic, the degrees of freedom (parametric) or
sample size (non-parametric) and the p-value. We annotate our
figures with the p-value, making clear which comparison it applies
to.