-
Notifications
You must be signed in to change notification settings - Fork 33
/
Copy pathlec07-linear-modelling.Rmd
484 lines (374 loc) · 17.6 KB
/
lec07-linear-modelling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
---
title: "Linear models and statistical modelling"
author: 'Lindsay Coome and Zoe Humphries'
---
# Lesson preamble
> ### Lesson objectives
>
> - Learn how to apply and interpret linear regression for a variety of data
> - Understand probability and the importance of presenting confidence intervals
> - Learn the importance of visualizing your data when doing any analyses or
> statistics
>
> ### Lesson outline
>
> - Basic descriptive and inferential statistics (20 min)
> - Generalized linear models (50-60 min)
> - Linear regression
> - ANOVA
> - Logistic regression
>
# Setup
Load appropriate libraries
```{r, message=FALSE, warning=FALSE}
library(tidyverse)
library(car)
library(multcomp)
```
If you need to re-download the survey data
```{r, eval=FALSE}
# download.file("https://ndownloader.figshare.com/files/2292169", "survey.csv")
# survey <- read_csv("survey.csv")
```
Load the data and change all character vectors into factors
```{r}
survey <- read_csv("data/survey.csv.gz")
# mutate_if is a very handy function!
survey <- survey %>%
mutate_if(is.character, as.factor)
```
# Statistics and probability
Theoretical models are powerful tools for explaining and understanding the
world. However, they are limited in that the real-world often doesn't perfectly
fit these models. The real world is messy and noisy. We can't always blindly
trust our data as there are inherent biases and errors in it. Measuring it,
collecting it, recording it, and inputting it are some of the possible sources
of error and bias. We use statistics and probability to determine whether the
way we understand and conceptualize the world (as models) matches reality (even
with the error and bias).
A reason we use statistical methods in R compared to writing up the formulas and
equations ourselves is that we can focus on answering questions without worrying
about whether we are doing the math or computation wrong. This is of course
dependent on the type of research questions you may be interested in (e.g. for
more theoretical questions, doing the math and computation yourself is probably
a goal!) and on the type of data you are using/collecting. There is a lot of
complexity that has already been taken care of in the available R packages and
functions. For example, the function `lm` that we will use in this lesson uses
the ordinary least squares (OLS) method, which is a common method for
determining fit for linear models. That way, you can answer your research
questions and not worry too much about the exact math involved and instead worry
about the specifics of your field (e.g. Are you measuring the right thing? Are
you collecting the right data? Are you asking the right questions? Is there an
ecological or biological aspect you are missing in your analysis?).
# Descriptive statistics
This is simply a way to numerically describe patterns in data, very similar to
the visual descriptions we made last lecture. It summarises information and can
facilitate comparisons among and within experimental conditions. The
distribution properties we discussed (averages, spread, skew, kurtosis) fall
under this umbrella.
Often, the dplyr function `summarise()` is very helpful for creating these
descriptions.
## Count occurrences
Base R's `table()` function displays counts of identical observations for either
a single data vector or a dataframe. With the survey data, we can see how many
records there are for a given genus (such as Dipodomys). This is the 'single
data vector' case.
```{r}
# Count records for all genera
table(survey$genus)
# Count the number of records for Dipodomys
sum( survey$genus=="Dipodomys" ) #most straightforward
```
We can also provide two different variables from the same data set to _cross_
_tabulate_ the number of records that have each value. Looking for the
intersection of values: how many records have the genus "Baiomys" and also are
labelled with the taxa "rodent"?
```{r}
table(survey$genus, survey$taxa) # note: first variable is set as the rows
```
### Check your understanding
1. Would you expect to see one genus with counts in two different taxa columns?
Why or why not?
2. Create a plot that shows the number of records for each genus.
3. There are some unexpected genus values in the table: what are they?
4. How can you count the number of records for each genus using `dplyr()`?
```{r, include=FALSE}
## Answer 1
#No,
## Answer 2
ggplot(survey, aes(x=genus)) +
geom_bar()
## Answer 3
#some of them do not correspond to actual genus names: "lizard", "sparrow",
#"rodent"
## Answer 4
survey %>% count(genus)
survey %>% group_by(genus) %>%
tally()
```
## Averages
The dplyr `summarise()` function is extremely customizable. You can use it to
summarise the averages (mean, median, and mode), standard deviation, inter-
quartile range, and counts of a data set. It can also provide you with the min
and max value. Its documentation page lists is many outputs:
https://dplyr.tidyverse.org/reference/summarise.html
Several of these summary functions (mean, sd, etc. that you can apply within
your call to summarise) will not work as you expect if there are any missing
values. You may need to pass "na.rm=TRUE" in order to obtain the summary value.
```{r}
survey %>%
summarise(mean=mean(hindfoot_length, na.rm=TRUE),
count=n(),
unqiue=n_distinct(hindfoot_length),
median=median(hindfoot_length, na.rm=TRUE),
min=min(hindfoot_length, na.rm=TRUE),
max=max(hindfoot_length, na.rm=TRUE),
sd=sd(hindfoot_length, na.rm=TRUE),
IQR=IQR(hindfoot_length, na.rm=TRUE))
```
When you think that categorical factors (such as sex, genus) will have an impact
on your dependent variable, it's important to `group_by()` those factors so you
can compare the variable's value among each of the categories.
```{r}
# To avoid na.rm calls, let's temporarily ignore any missing weight data
weightHere <- survey %>%
filter(!is.na(weight))
weightHere %>%
group_by(genus) %>%
summarise(mean=mean(weight),
count=n(),
sd=sd(weight))
```
### Check your understanding
1. Create a plot that visualizes the median and IQR of the summarised weightHere
data (grouped by genus).
2. Recall another way we dealt with missing values. Use this method to create a
newWeightHere table and compare the values.
3. Create another version of Q1's plot with the newWeightHere data and compare.
Explain why the plots differ. Hint: try incorporating taxa into your plot.
4. Based on Q3, what kinds of predictions can you make with this data?
```{r, include=FALSE}
## Answer 1
ggplot(weightHere, aes(x=genus, y=weight)) +
geom_boxplot()
## Answer 2
newWeightHere <- survey %>%
mutate(adjWeight =
if_else(is.na(weight), median(weight, na.rm=TRUE), weight))
## Answer 3
ggplot(newWeightHere, aes(x=genus, y=adjWeight, colour=taxa)) +
geom_boxplot()
#Only the rodents have weight values, so all of the other genera just = median.
## Answer 4
#can't ask any weight-based questions about differences among taxa. Only within
#the rodent taxa! Can ask about sex/genus/species differences in this category.
```
# Inferential statistics
Descriptive statistics is describing the data you collected from a
**representative sample** of your **true population**. If you want to ask
questions about weight differences between _Perognathus flavus_ males and
females, you can't just stop at the point where you've generated medians for the
two groups. This is just a statement about the samples you've taken.
## Classic t-test
You've indubitably learned about t-tests in your past stats class(es).
Hopefully, the _extreme importance_ of validating the **assumptions** of a given
statistical test has been etched upon your brain in indelible ink. Because
t-tests are parametric, it's important that your data is normally distributed.
### One-sample t-test
With a one-sample t-test, you're asking whether the mean of your sample differs
significantly from a mean value that you expect it to have. You might want to
use this if you have an expected population mean value from the literature. You
may have even amassed your own dataset that you think is representative of a
population's true mean & you want to compare a newly collected sample.
With our survey data, we might want to test whether the 2002 mean weight of
_Perognathus flavus_ differs from what we believe the true population mean to
be. Our null hypothesis is that the 2002 mean does not differ significantly from
the population mean. Our alternative hypothesis is that it is significantly
different.
```{r}
# Filter the survey data for only the Perognathus flavus data
flavusOnly <- survey %>%
filter(species=="flavus") %>%
# create an adjusted weight variable that will hold the calculated median
# weight if the unadjusted weight == NA & the original weight value otherwise
mutate(adjWeight =
if_else(is.na(weight), median(weight, na.rm=TRUE), weight))
#verify that you didn't get a different genus with the same species name
table(flavusOnly$genus) #remember you can use table() to view unique values
# Test whether the weight data is normally distributed
#quick visual check using a histogram
ggplot(flavusOnly, aes(x=adjWeight)) +
geom_histogram()
# Shapiro-Wilk's test: does the distribution differ significantly from normal
shapiro.test(flavusOnly$adjWeight)
shapiro.test(flavusOnly$weight)
```
With a p-value << 0.05 can reject the null hypothesis of the Shapiro-Wilk's
test, so our weight data is not normally distributed. We're going to ignore that
for the purposes of instruction, but _do not do this at home_!!
```{r}
# Calculate the estimated true population mean (from the years before 2002)
flavusOnly %>%
filter(year<2002) %>%
summarise(mean=mean(adjWeight)) #evaluates to 7.9 (sig figs!)
# Extract the measurements taken during 2002 and put in a new data frame
newest <- flavusOnly %>%
filter(year==2002)
# Test
t.test(newest$adjWeight, mu=7.9) #note it's not a double ==
```
Based on a p-value of 0.05, we _fail to reject_ the null hypothesis. It is
important to report your results with appropriate notation so your readers can
understand how much credence to lend to your conclusions.
The weight of _Perognathus flavus_ individuals sampled in 2002 does not differ
significantly from the general population one-sample t~17~ = -1.89, p=0.075.
#### Check your understanding
1. What are the assumptions of a one-sample t-test? Recall from past classes or
ask the almighty Google.
2. How can you extract the year==2002 weight values and perform a t.test using
one line of code?
3. Test whether the flavus hindfoot_length data from 2001 differs from your
calculated population mean (be sure to test for normality!). Report your results
using appropriate notation.
```{r, include=FALSE}
## Answer 2
t.test( subset(flavusOnly,year==2002)$adjWeight, mu=7.9 )
t.test( flavusOnly$adjWeight[flavusOnly$year==2002], mu=7.9 )
## Answer 3
flavusOnly <- survey %>%
filter(species=="flavus") %>%
mutate(foot = if_else(is.na(hindfoot_length),
median(hindfoot_length, na.rm=TRUE),
hindfoot_length))
shapiro.test(flavusOnly$foot)
shapiro.test(flavusOnly$hindfoot_length)
#reject the null - it's not normally distributed
estMean = mean( filter(flavusOnly, year<2002)$foot)
t.test( subset(flavusOnly,year==2002)$foot, estMean=7.9 )
#The hind foot length of _Perognathus flavus_ individuals sampled in 2002 is
#significantly different from the general population one-sample t(17) = 36.3,
#p<0.05.
```
### Unpaired two-sample t-test
Compare two _independent_ sample means against each other. With our survey data,
we might want to test whether males have a significantly different mean weight
than females.
There are two assumptions we have to test now (once we've determined that the
two samples we're looking at are independent):
1. The dependent variable must be normally distributed in both samples
2. The variance of the dependent variable must be approximately equal between
the two samples
```{r}
shapiro.test( subset(flavusOnly, sex=="M")$weight ) #test male weight
shapiro.test( subset(flavusOnly, sex=="F")$weight ) #test female weight
# F-test to test for homogeneity in variances
var.test(weight ~ sex, data=survey)
```
Our data fails both of these tests, but we're going to pretend everything is
fine... again, do NOT do this at home!!
```{r}
t.test(weight~sex, data=survey)
```
#### Check your understanding
1. Would you want to compare the sample rodent mean weight against the mean
sample weight of _Perognathus flavus_? What about rodent mean weight against
rodent mean hindfoot_length?
2. Repeat this t-test between males and females, but compare mean hind foot
length instead of mean weight (be sure to test for normality!). Report your
results using appropriate notation.
```{r, include=FALSE}
## Answer 2
#NO! Not independent!!
## Answer 3
#do the thing
```
### Paired sample t-test
Compare the mean values of a dependent variable between two categories that were
paired _by design_. You can't do this for males and females in our survey data
set because the measurements weren't taken at the same time, in the same place,
from one male and one female that were ostensibly similar in all other respects.
However, this test would be appropriate if you wanted to compare the mean
reaction time of the right and left hands from a sample of undergrads.
```{r}
#t.test(object1, object1, paired=TRUE)
```
## Correlations
Bivariate correlation (``r``) gives us the strength and direction of relationship between two variables (linear). Say we find that the correlation ($r$) between hindfoot length and weight is $r$ = .609. This means that .6092^2 = .371 of the variance in $y$ (hindfoot length) is common to the variance in $x$ (weight). Alternatively, we can say that these two variables share 37.1% of the variance in common. In the case of the Pearson correlation, this will be true whether we consider weight or hindfoot length to be the dependent variable.
```{r}
cor.test(survey$weight, survey$hindfoot_length)
```
## General linear model
A version of GLM that uses continuous $y$ values is called linear regression,
which I'm going to focus on. The formula for linear regression (or GLM in
general) is:
$$ Y = \alpha + X\beta + \varepsilon $$
Or, a simplified, alphabetized version is:
$$ y = a + Xb + e $$
Where $a$ is the intercept (where the line crosses the vertical, i.e. y, axis of
the graph), $X$ is the predictor variable, $b$ is the slope/coefficient, and $e$
is the error term.
We construct these regression models for several reasons. Sometimes we want to
infer how some variables ($x$) cause or influence another variable ($y$). Or
maybe we know that $y$ has a lot of error in the measurement or is difficult to
measure, so we want to derive a formula in order to predict $y$ based on more
accurately measured variables. In R we can run linear regression either using
`lm` or using `glm()`. `lm()` assumes a Gaussian (normal) distribution of your
data. On the other hand, when you use ``glm()`` in R, you can specifiy the
data's distribution with the parameter ``model =``. This allows you to construct
a linear model when your data don't follow a Gaussian distribution.
Regression requires a predictor (independent) variable and a predicted
(dependent) variable. Changing which variable is the predictor/predicted gives
you a different regression line with a different regression equation. The
function we are going to use today, ``lm``, uses the OLS method by default, as
mentioned above. The least squares line is the line chosen by ``lm`` to fit your
data. The goal of OLS is to choose a line that minimizes prediction error. With
any other line, errors of prediction would be greater. Note, the best fitting
model given your data (i.e. OLS) does *not* equal the best model period. We must
pay attention to fit statistics like R^2, the amount (%) of variance in the
outcome explained by our predictor (i.e., model), to determine how well our
model is doing.
So how do we use these functions? In R, dependent variables are predicted by a
tilde $~$.
The formula to regress $y$ on $x$ is ``y ~ x``:
```{r}
result <- lm(weight~sex, data=survey)
summary(result)
```
```{r}
#multiple predictors with interaction terms
result <- lm(weight~sex*hindfoot_length, data=survey)
summary(result)
```
```{r}
#use + for main effect of predictor only
result <- lm(weight~sex + hindfoot_length, data=survey)
summary(result)
```
## Analysis of Variance
ANOVA is simply a different way of evaluating explained variance in linear
modelling. Anova is a special case of linear modelling. You must always wrap the
``anova()`` function around a ``lm()`` function.
```{r}
anova(lm(weight~sex*genus, data=survey))
```
However, R uses type II sums of squares by default. ``Anova()`` from the ``car``
package can give you “Type III Sums of Squares”. This matters when you have more
than one predictor (e.g. taxa x sex). Asking for type III sums of squares will
match what you get from SPSS or SAS.
```{r}
result <- lm(weight~sex*genus, data=survey)
Anova(result, type=3)
```
## Post hoc tests
R comes with a default pairwise t-test (``pairwise.t.test()``). However, ``multcomp`` offers more posthoc options than base R:
```{r}
result <- lm(weight~genus, data=survey)
postHocs<-glht(result, linfct = mcp(genus = "Tukey"))
#summary function gives results of multiple comparisons
summary(postHocs)
```
## Logistic regression
Normally-distributed dependent variable assumption is violated in logistic regression, where we want to predict a binary outcome. So, you must use the ``glm()`` function rather than ``lm()``. We only have one binary variable in our dataset: sex.
```{r}
summary(glm(survey$sex ~ survey$weight*survey$hindfoot_length, family=binomial))
```