-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathassociation.qmd
430 lines (294 loc) · 13.1 KB
/
association.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
# Association: Correlation and Contingency
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("drafting")
```
## Overview
no explanatory variables
### Correlation
- two continuous variables
- neither is an explanatory variable, i.e., there is not a causal relationship
between the two variables
- you want to know if there is a correlation between the two variables
- the correlation coefficient is a measure of the strength and direction of
the linear association between two variables and ranges from -1 to 1
- A correlation coefficient of 1 indicates a perfect positive association
between the variables meaning the highest score in one variable are
associated with the *highest* scores in the other.
- A correlation coefficient of -1 indicates a perfect negative association
between the variables meaning the highest scores in one variable are
associated with the *lowest* scores in the other.
TODO Add a figure of the different correlation coefficients
- we estimate the population correlation coefficient, $\rho$ (pronounced rho),
with the sample correlation coefficient, $r$ and test whether it
is significantly different from zero because a correlation coefficient of
0 means association.
- Pearson's correlation coefficient, is a parametric measure which
assumes the variables are normally distributed and that any correlation
is linear.
- Spearman's rank correlation coefficient, is a non-parametric measure
which does not assume the variables are normally distributed
- we use `cor.test()` in R.
### Contingency Chi-squared
- two categorical variables
- neither is an explanatory variable, i.e., there is not a causal relationship
between the two variables
- we count the number of observations in each caetory of each variable
- you want to know if there is an association between the two variables
- another way of describing this is that we test whether the proportion of
observations falling in to each category of one variable is the same for
each category of the other variable.
- we use a chi-squared test to test whether the observed counts are significantly
different from the expected counts if there was no association between the
variables.
### Reporting
1. the significance of effect - whether the association is significant
different from zero
2. the direction of effect
- correlation whether $r$ is positive or negative
- which categories have higher than expected values
3. the magnitude of effect
We do not put a line of best fit on a scatter plot accompanying a correlation
because such a line implies a causal relationship.
We will explore all of these ideas with an examples.
## 🎬 Your turn!
If you want to code along you will need to start a new [RStudio
project](workflow_rstudio.html#rstudio-projects), add a `data-raw`
folder and open a new script. You will also need to load the
**`tidyverse`** package [@tidyverse].
## Correlation
High quality images of the internal structure of wheat seeds were taken with a
soft X-ray technique. Seven measurements determined from the images:
- Area
- Perimeter
- Compactness
- Kernel length
- Kernel width
- Asymmetry coefficient
- Length of kernel groove
Research were interested in the correlation between compactness and
kernel width
The data are in [seeds-dataset.xlsx](data-raw/seeds-dataset.xlsx).
### Import and explore
The data are in an excel file so we will need the **`readxl`** [@readxl] package.
Load the package:
```{r}
library(readxl)
```
Find out the names of the sheets in the excel file:
```{r}
excel_sheets("data-raw/seeds-dataset.xlsx")
```
There is only one sheet called `seeds_dataset`.
Import the data:
```{r}
seeds <- read_excel("data-raw/seeds-dataset.xlsx",
sheet = "seeds_dataset")
```
Note that we could have omitted the `sheet` argument because there is only one
sheet in the Excel file. However, it is good practice to be explicit.
```{r}
#| echo: false
knitr::kable(seeds) |>
kableExtra::kable_styling() |>
kableExtra::scroll_box(height = "200px")
```
These data are formatted usefully for analysis using correlation and plotting.
Each row is a different seed and the columns are the variables we are interested.
This means that the first values in each column are related - they are all from
the same seed
In the first instance, it is sensible to create a rough plot of our data
(See @fig-seeds-rough). Plotting data early helps us in multiple ways:
- it helps identify whether there extreme values
- it allows us to see if the association is roughly linear
- it tells us whether any association positive or negative
Scatter plots (`geom_point()`) are a good choice for exploratory
plotting with data like these.
```{r}
#| label: fig-seeds-rough
#| fig-cap: "A default scatter plot of compactness and kernel width in seeds demonstrates an approximately linear positive association them the variables."
ggplot(data = seeds,
aes(x = compactness, y = kernel_width)) +
geom_point()
```
The figure suggests that there is a positive correlation between `compactness` and
`kernel_width` and that correlation is roughly linear.
### Check assumptions
We can see from @fig-seeds-rough that the relationship between `compactness` and
`kernel_width` is roughly linear. This is a good sign for using Pearson's
correlation coefficient.
Our next check is to use common sense: both variables are continuous and
we would expect them to be normally distributed. We can then plot histograms
to examine the distributions (See @fig-hist).
```{r}
#| label: fig-hist
#| layout-ncol: 2
#| fig-cap: "The distributions of the two variables are slightly skewed but do not seem too different from a normal distribution. We will continue with the Pearson's correlation coefficient."
#| fig-subcap:
#| - "Compactness"
#| - "Kernel width"
ggplot(data = seeds, aes(x = compactness)) +
geom_histogram(bins = 12)
ggplot(data = seeds, aes(x = kernel_width)) +
geom_histogram(bins = 12)
```
The distributions of the two variables are slightly skewed but do not seem too
different from a normal distribution.
Finally, we can use the Shapiro-Wilk test to test for normality.
```{r}
shapiro.test(seeds$compactness)
```
```{r}
shapiro.test(seeds$kernel_width)
```
The *p*-values are greater than 0.05 so these tests of the normality
assumption are not significant. Note that "not significant" means not
significantly different from a normal distribution. It does not mean
definitely normally distributed.
It is not unreasonable to continue with the Pearson's correlation coefficient.
However, later we will also use the Spearman's rank correlation coefficient, a non-parametric method which has fewer assumptions. Spearman's rank correlation
is a more conservative approach.
### Do a Pearson's correlation test with `cor.test()`
We can carry out a Pearson's correlation test with `cor.test()` like this:
```{r}
cor.test(data = seeds, ~ compactness + kernel_width)
```
A variable is not being explained in this case so we do not need to include
a response variable. Pearson’s is the default test so we do not need to specify
it.
What do all these results mean?
The last line gives the correlation coefficient, $r$, that has been estimated
from the data. Here, $r$ = `r cor.test(data = seeds, ~ compactness + kernel_width)$estimate |> round(2)`
which is a moderate positive correlation.
The fifth line tells you what test has been done:
`alternative hypothesis: true correlation is not equal to 0`. That is $H_0$ is
that $\rho = 0$
The forth line gives the test result:
`t = 7.3738, df = 68, p-value = 2.998e-10`
The $p$-value is very much less than 0.05 so we can reject the null hypothesis
and conclude there is a significant positive correlation between `compactness`
and `kernel_width`.
### Report
There was a significant positive correlation ($r$ = `r cor.test(data = seeds, ~ compactness + kernel_width)$estimate |> round(2)`) between compactness and kernel width (Pearson's: *t* = 7.374; *df* = 68; *p*-value < 0.0001). See @fig-correlation-pears.
::: {#fig-correlation-pears}
```{r}
#| code-fold: true
ggplot(data = seeds,
aes(x = compactness, y = kernel_width)) +
geom_point() +
scale_y_continuous(name = "Diameter (mm)") +
scale_x_continuous(name = "Compactness") +
annotate("text", x = 0.85, y = 3.6,
label = expression(italic(r)~"= 0.67; "~italic(p)~"< 0.001")) +
theme_classic()
```
**Correlation between compactness and kernel width of wheat seeds**. High
quality images of the internal structure of wheat seeds were taken with a
soft X-ray technique and the compactness and kernel width of the seeds were
determined. There was a significant positive correlation ($r$ = `r cor.test(data = seeds, ~ compactness + kernel_width)$estimate |> round(2)`) between compactness
and kernel width (Pearson's: *t* = 7.374; *df* = 68; *p*-value < 0.0001). Note:
axes do not start at 0. Data analysis was conducted in R [@R-core] with
tidyverse packages [@tidyverse].
:::
### Spearman's rank correlation coefficient
TODO
## Contingency Chi-squared test
Researchers were interested in whether different pig breeds had the same
food preferences. They offered individuals of three breads, Welsh, Tamworth
and Essex a choice of three foods: cabbage, sugar beet and swede and recorded
the number of individuals that chose each food. The data are shown in @tbl-food-pref.
```{r}
#| echo: false
# create the data
food_pref <- matrix(c(11, 19, 22,
21, 16, 8,
7, 12, 11),
nrow = 3,
byrow = TRUE)
# make a list object to hold two vectors
# in a list the vectors can be of different lengths
vars <- list(food = c("cabbage",
"sugarbeet",
"swede"),
breed = c("welsh",
"tamworth",
"essex"))
dimnames(food_pref) <- vars
```
```{r}
#| echo: false
#| label: tbl-food-pref
knitr::kable(food_pref,
caption = "Food preferences of three pig breeds") |>
kableExtra::kable_styling()
```
We don’t know what proportion of food are expected to be preferred but do
expect it to be same for each breed if there is no association between breed
and food preference. The null hypothesis is that the proportion of foods taken
by each breed is the same.
For a contingency chi squared test, the inbuilt chi-squared test can be used
but we need to to structure our data as a 3 x 3 table. The `matrix()` function
is useful here and we can label the rows and columns to help us interpret the
results.
Put the data into a matrix:
```{r}
# create the data
food_pref <- matrix(c(11, 19, 22,
21, 16, 8,
7, 12, 11),
nrow = 3,
byrow = TRUE)
food_pref
```
The `byrow` and `nrow` arguments allow us to lay out the data in the matrix as
we need.
To name the rows and columns we can use the `dimnames()` function. We need
to create a "list" object to hold the names of the rows and columns and then
assign this to the matrix object. The names of rows are columns are called the
"dimension names" in a matrix.
Make a list for the two vectors of names:
```{r}
#
vars <- list(food = c("cabbage",
"sugarbeet",
"swede"),
breed = c("welsh",
"tamworth",
"essex"))
```
The vectors can be of different lengths in a list which would be important if
we had four breeds and only two foods, for example.
Now assign the list to the dimension names in the matrix:
```{r}
dimnames(food_pref) <- vars
food_pref
```
The data are now in a form that can be used in the `chisq.test()` function:
```{r}
chisq.test(food_pref)
```
The test is significant since the *p*-value is less than 0.05. We have evidence
of a preference for particular foods by different breeds. But in what way? We need to know the “direction of the effect” *i.e.,* Who likes what?
The `chisq.test()` function has a `residuals` argument that can be used to
calculate the residuals. These are the differences between the observed and
expected values. The expected values are the values that would be expected if
there was no association between the rows and columns. The residuals are
standardised.
```{r}
chisq.test(food_pref)$residuals
```
Where the residuals are positive, the observed value is greater than the
expected value and where they are negative, the observed value is less than the
expected value. Our results show the Welsh pigs much prefer sugarbeet and strongly
dislike cabbage. The Essex pigs prefer cabbage and dislike sugarbeet and the
Essex pigs slightly prefer swede but have less strong likes and dislikes.
The degrees of freedom are: (rows - 1)(cols - 1) = 2 * 2 = 4.
### Report
Different pig breeds showed a significant preference for the different
food types ($\chi^2$ = 10.64; *df* = 4; *p* = 0.031) with Essex much preferring
cabbage and disliking sugarbeet, Welsh showing a strong preference for
sugarbeet and a dislike of cabbage and Tamworth showing no clear preference.
## Summary
TODO