two_sample_tests.qmd

# Two-Sample tests

```{r}
#| results: "asis"
#| echo: false

source("_common.R")
status("complete")
```

## Overview

In the last chapter we learned about single linear regression, a
technique used when the explanatory variable is continuous. We now turn
our attention to cases where our explanatory variable is categorical and
has two groups. For example, we might want to know if there is a
difference in mass between two subspecies of chaffinch, or if marks in
two subjects are the same.

We use `lm()` or `wilcox.test()` to carry out a two-sample test
depending on whether the assumptions of `lm()` are met. The general
linear models applied with `lm()` are based on the normal distribution
and known as parametric tests because they use the parameters of the
normal distribution (the mean and standard deviation) to determine if an
effect is significant. Null hypotheses are about a mean or difference
between means. The assumptions need to be met for the *p*-values
generated to be accurate.

If the assumptions are not met, we can use alternatives known as
non-parametric tests. Non-parametric tests are based on the ranks of
values rather than the actual values themselves. Null hypotheses are
about the mean rank rather than the mean. These tests have fewer
assumptions and can be used in more situations but the downside is they
tend to be less powerful. This means they are less able to detect a
difference where one exists.

### Independent samples and paired samples.

An important consideration in conducting tests is whether the values in
one group are independent of the values in another group.
Non-independence occurs when the two measures are linked in some way.
The link could be that they are from the same individual, the same time
or the same location. For example, if we want to evaluate a treatment
for high blood pressure we might measure blood pressure before and after
the treatment on the same individuals. This would mean the before and
after measures were not independent. If pairs of observations in
different groups have something in common that make them more similar to
each other than to other observations, then those observations are not
independent. We use a different testing approach for independent and
non-independent samples.

### *T*-tests

A linear model with one explanatory variable with two groups is also
known as a **two-sample *t*-test** when the samples are independent and
as a **paired-samples *t*-test** when they are not. R does have a
`t.test()` function which allows you to fit a linear model with just two
groups. However, here we teach you to use and interpret the `lm()`
function because it is more generalisable. You can use `lm()` when you
have three or more groups or additional explanatory variables. The
output of `lm()` is also in the same form as many other statistical
functions in R. This means what you learn in performing *t*-tests with
`lm()` will help you learn other methods more easily. However, it is
definitely not wrong to use `t.test()` rather than `lm()` for two-group
situations - the procedures are identical and the *p*-values will be the
same.

### Model assumptions

The assumptions of the general linear model are that the residuals are
normally distributed and have homogeneity of variance. A residual is the
difference between the predicted value and the observed value.

If we have a continuous response and a categorical explanatory variable
with two groups, we usually apply the general linear model with `lm()`
and *then* check the assumptions, however, we can sometimes tell when a
non-parametric test would be more appropriate before that:

-   Use common sense - the response should be continuous (or nearly
    continuous, see [Ideas about data: Theory and
    practice](ideas_about_data.html#theory-and-practice)). Consider
    whether you would expect the response to be continuous.

-   We expect decimal places and few repeated values.

To examine the assumptions after fitting the linear model, we plot the
residuals and test them against the normal distribution in the same way
as we did for single linear regression.

### Reporting

In reporting the result of two-sample test we give:

1.  the significance of effect - whether there is there a difference
    between the groups

    -   parametric: whether there is there a difference between the
        groups means
    -   non-parametric: whether there is there a difference between the
        group medians

2.  the direction of effect - which of the means/medians is greater

3.  the magnitude of effect - how big is the difference between the
    means/medians

    -   parametric: the means and standard errors for each group or the
        mean difference for paired samples
    -   non-parametric: the medians for each group or the median
        difference for paired samples

Figures should reflect what you have said in the statements. Ideally
they should show both the raw data and the statistical model:

-   parametric: means and standard errors
-   non-parametric: boxplots with medians and interquartile range

We will explore all of these ideas with some examples.

## 🎬 Your turn!

If you want to code along you will need to start a new [RStudio
project](workflow_rstudio.html#rstudio-projects), add a `data-raw`
folder and open a new script. You will also need to load the
**`tidyverse`** package [@tidyverse].

## Two independent samples, parametric

A number of [subspecies of the common chaffinch, *Fringilla coelebs*](https://en.wikipedia.org/wiki/Common_chaffinch), have been described based principally on the differences in the pattern and colour of the adult male plumage [@suárez2009]. Two of groups of these subspecies are:

-   the "coelebs group" (@fig-coelebs) that occurs in Europe and Asia 
-   the "canariensis group" (@fig-canariensis) that occurs on the Canary Islands. 

::: {#fig-subspecies layout-ncol="2"}
![*F. c.
coelebs*](images/512px-Chaffinch_(Fringilla_coelebs).jpg){#fig-coelebs}

![*F. c.
palmae*](images/512px-Fringilla_coelebs_palmae_-_Los_Tilos.jpg){#fig-canariensis}

Adult male *Fringilla coelebs* of the coelebs group on the left (Andreas
Trepte, CC BY-SA 2.5 <https://creativecommons.org/licenses/by-sa/2.5>,
via Wikimedia Commons) and of the canariensis group on the right (H.
Zell, CC BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0>, via
Wikimedia Commons).
:::

The data in [chaff.txt](data-raw/chaff.txt) give the masses of twenty
individuals from each subspecies. We want to know if the subspecies
differ in mass. These groups are independent - there is no link between
values in one group and any value in the other group. In this scenario 
our null hypothesis, $H_0$, is that there is no difference between the
two subspecies in mass or that subspecies has no effect on mass. 
This is written as: $H_0: \beta_1 = 0$. 

### Import and explore

Import the data:

```{r}
chaff <- read_table("data-raw/chaff.txt")
```

```{r}
#| echo: false

knitr::kable(chaff) |> 
  kableExtra::kable_styling() |> 
  kableExtra::scroll_box(height = "200px")
```

These data are in tidy format [@Wickham2014-nl] - all the mass values
are in one column with another column indicating the subspecies. This
means they are well formatted for analysis and plotting.

In the first instance, it is always sensible to create a rough plot of
our data. This is to give us an overview and help identify if there are
any issues like missing or extreme values. It also gives us idea what we
are expecting from the analysis which will make it easier for us to
identify if we make some mistake in applying that analysis.

Violin plots (`geom_violin()`), box plots (`geom_boxplot()`, see
@fig-chaff-rough) or scatter plots (`geom_point()`) all make good
choices for exploratory plotting and it does not matter which of these
you choose.

```{r}
#| label: fig-chaff-rough
#| fig-cap: "The mass of two subspecies of chaffinch. A boxplot is a useful way to get an overview of the data and helps us identify any issues such as missing or extreme values. It also tells us what to expect from the analysis."

ggplot(data = chaff,
       aes(x = subspecies, y = mass)) +
  geom_boxplot()
```

R will order the groups alphabetically by default.

The figure suggests that the canariensis group is heavier than the
coelebs group.

Summarising the data for each subspecies group is the next sensible
step. The most useful summary statistics are the means, standard
deviations, sample sizes and standard errors. I recommend the
`group_by()` and `summarise()` approach:

```{r}
chaff_summary <- chaff |> 
  group_by(subspecies) |> 
  summarise(mean = mean(mass),
            std = sd(mass),
            n = length(mass),
            se = std/sqrt(n))
```

We have save the results to `chaff_summary` so that we can use the means
and standard errors in our plot later.

```{r}
chaff_summary
```

### Apply `lm()`

We can create a two-sample model like this:

```{r}
mod <- lm(data = chaff, mass ~ subspecies)
```

And examine the model with:

```{r}
summary(mod)
```

The Estimates in the Coefficients table give:

-   `(Intercept)` known as $\beta_0$. The mean of the canariensis group
    (@fig-two-sample-lm-model). Just as the intercept is the value of
    the *y* (the response) when the value of *x* (the explanatory) is
    zero in a simple linear regression, this is the value of `mass` when
    the `subspecies` is at its first level. The order of the levels is
    alphabetical by default.

-   `subspeciescoelebs`, known as $\beta_1$, is what needs to be added
    to the mean of the canariensis group to get the mean of the coelebs
    group (@fig-two-sample-lm-model). Just as the slope is amount of *y*
    that needs to be added for each unit of *x* in a simple linear
    regression, this is the amount of `mass` that needs to be added when
    the `subspecies` goes from its first level to its second level
    (*i.e.*, one unit). The `subspeciescoelebs` estimate is negative so
    the the coelebs group mean is lower than the canariensis group mean

The *p*-values on each line are tests of whether that coefficient is
different from zero. Thus it is:

`subspeciescoelebs  -1.7950     0.6781  -2.647   0.0118 *`

that tells us the difference between the means is significant.

The *F* value and *p*-value in the last line are a test of whether the
model as a whole explains a significant amount of variation in the
response variable. For a two-sample test, just like a regression, this
is exactly equivalent to the test of the slope against zero and the two
*p*-values will be the same.

```{r}
#| echo: false
#| label: fig-two-sample-lm-model
#| fig-cap: "In a two-sample linear model, the first estimate is the intercept which is the mean of the first group. The second estimate is the 'slope' which is what has to added to the intercept to get the second group mean. Note that y axis starts at 15 to create more space for the annotations."
ggplot() +
  geom_errorbar(data = chaff_summary, 
                aes(x = subspecies, ymin = mean, ymax = mean), 
                colour = pal3[1], linewidth = 1,
                width = 1) +
  scale_x_discrete(expand = c(0,0)) +
  scale_y_continuous(expand = c(0,0), 
                     limits = c(15, 30),
                     name = "Mass (g)") +
  geom_segment(aes(x = 0.8,
                   xend = -Inf,
                   y = mod$coefficients[1] + 3,
                   yend = mod$coefficients[1]),
               colour = pal3[2]) +
  annotate("text",
           x = 1.1, 
           y = mod$coefficients[1] + 4,
           label = glue::glue("Intercept (β0) is mean of\n{ chaff_summary$subspecies[1] }: { mod$coefficients[1] |> round(2) }"), 
           colour = pal3[2],
           size = 4) +
  geom_segment(aes(x = 1.5,
                   xend = 1.5,
                   y = mod$coefficients[1],
                   yend = mod$coefficients[1] + mod$coefficients[2]),
               colour = pal3[3],
               arrow = arrow(length = unit(0.03, "npc"), 
                             ends = "both")) +
  geom_segment(aes(x = 1.7,
                   xend = 1.5,
                   y = mod$coefficients[1] + 3,
                   yend = mod$coefficients[1] - 1),
               colour = pal3[2]) +
  annotate("text",
           x = 2, 
           y = mod$coefficients[1] + 5,
           label = glue::glue("subspeciescoelebs (β1) is the\ndifference between mean of\n{ chaff_summary$subspecies[1] } and { chaff_summary$subspecies[2] }: { mod$coefficients[2] |> round(2) }"), 
           colour = pal3[2],
           size = 4) +
  theme_classic()

```

### Check assumptions

Check the assumptions: All general linear models assume the "residuals"
are normally distributed and have "homogeneity" of variance.

Our first check of these assumptions is to use common sense: mass is a
continuous variable and we would expect it to be normally distributed
thus we would also expect the residuals to be normally distributed.

We then plot the residuals. The `plot()` function can be used to plot
the residuals against the fitted values (See @fig-lm-plot1). This is a
good way to check for homogeneity of variance.

```{r}
#| label: fig-lm-plot1
#| fig-cap: "A plot of the residuals against the fitted values shows the points are distributed similarly in each group. This is a good sign for the assumption of homogeneity of variance."

plot(mod, which = 1)
```

We can also use a histogram to check for normality (See @fig-lm-plot2).

```{r}
#| label: fig-lm-plot2
#| fig-cap: "A histogram of residuals is symetrical and seems consistent with a normal distribution. This is a good sign for the assumption of normally distributed residuals."
ggplot(mapping = aes(x = mod$residuals)) + 
  geom_histogram(bins = 10)
```

Finally, we can use the Shapiro-Wilk test to test for normality.

```{r}
shapiro.test(mod$residuals)
```

The p-value is greater than 0.05 so this test of the normality
assumption is not significant.

Taken together, these results suggest that the assumptions of normality
and homogeneity of variance are not violated.

### Report

Canariensis chaffinches ($\bar{x} \pm s.e$: 22.48 $\pm$ 0.48) were
significantly heavier than Coelebs (20.28 $\pm$ 0.48 ) (*t* = 2.65;
*d.f.* = 38; *p* = 0.012). See @fig-chaff.

::: {#fig-chaff}
```{r}
#| code-fold: true

ggplot() +
  geom_point(data = chaff, aes(x = subspecies, y = mass),
             position = position_jitter(width = 0.1, height = 0),
             colour = "gray50") +
  geom_errorbar(data = chaff_summary, 
                aes(x = subspecies, ymin = mean - se, ymax = mean + se),
                width = 0.3) +
  geom_errorbar(data = chaff_summary, 
                aes(x = subspecies, ymin = mean, ymax = mean),
                width = 0.2) +
  scale_y_continuous(name = "Mass (g)", 
                     limits = c(0, 30), 
                     expand = c(0, 0)) +
  scale_x_discrete(name = "Subspecies", 
                   labels = c("Canariensis", "Coelebs")) +
  annotate("segment", x = 1, xend = 2, 
           y = 28, yend = 28,
           colour = "black") +
  annotate("text", x = 1.5,  y = 29, 
           label = expression(italic(p)~"= 0.012")) +
  theme_classic()
```

**Canariensis chaffinches are heavier than Coelebs chaffinches**. The
mean mass of 20 randomly sampled males from each subspecies was
determined. Error bars are $\pm$ 1 standard error. Canariensis
chaffinches were significantly heavier than Coelebs (*t* = 2.65; *d.f.*
= 38; *p* = 0.012). Data analysis was conducted in R [@R-core] with
tidyverse packages [@tidyverse].
:::

## Two independent samples, non-parametric

The non-parametric equivalent of the linear model with two independent
samples is the "Wilcoxon rank sum test" [@wilcoxon1945]. It is commonly
also known as the Mann-Whitney or Wilcoxon–Mann–Whitney.

The general question you have about your data - are these two groups
different - is the same, but one of more of the following is true:

-   the response variable is not continuous
-   the residuals are not normally distributed
-   the sample size is too small to tell if they are normally
    distributed.
-   the variance is not homogeneous

The test is a applied in R with the `wilcox.test()` function.


The data in [arabidopsis.txt](data-raw/arabidopsis.txt) give the number
of leaves on eight wildtype and eight mutant *Arabidopsis thaliana*
plants. We want to know if the two types of plants have differing
numbers of leaves. These are counts, so they are not continuous and the
sample sizes are quite small. A non-parametric test is a safer option.

### Import and explore

```{r}
arabidopsis <- read_table("data-raw/arabidopsis.txt")
```

These data are in tidy format [@Wickham2014-nl] - the numbers of leaves
are in one column with another column indicating whether the observation
comes from a wildtype or mutant *Arabidopsis*. This means they are well
formatted for analysis and plotting.

Create a quick plot of the data:

```{r}
#| label: fig-arabid-rough
#| fig-cap: "The number of leaves on mutant and wildtype plants. A boxplot is a useful way to get an overview of the data and helps us identify any issues such as missing or extreme values. It also tells us what to expect from the analysis."

ggplot(data = arabidopsis, 
       aes(x = type, y = leaves)) +
  geom_boxplot()
```

Our rough plot shows that the mutant plants have fewer leaves than the
wildtype plants.

Summarising the data using the median and interquartile range is more
aligned to the type of data and the type of analysis than using means
and standard deviations:

```{r}
arabidopsis_summary <- arabidopsis |> 
  group_by(type) |> 
  summarise(median = median(leaves),
            interquartile  = IQR(leaves),
            n = length(leaves))
```

View the results:

```{r}
arabidopsis_summary
```

### Apply `wilcox.test()`

We pass the dataframe and variables to `wilcox.test()` in the same way
as we did for `lm()`. We give the data argument and a "formula" which
says `leaves ~ type` meaning "explain leaves by type".

```{r}
wilcox.test(data = arabidopsis, leaves ~ type)
```

The warning message "Warning: cannot compute exact p-value with ties" is
not something to worry about too much. It is a warning rather than an
indication that your results are incorrect. It means the *p* -value is
based on an approximation rather than being exact because there are ties
(some values are the same).

The result of the test is given on this line:
`W = 5, p-value = 0.005051`. `W` is the test statistic. The *p*-value is
less than 0.05 meaning there is a significant difference in the number
of leaves on wildtype and mutant plants.

### Report

There are significantly more leaves on wildtype (median = 8.5) than
mutant (median = 5) plants (Wilcoxon rank sum test: *W* = 5, $n_1$ = 8,
$n_2$ = 8, *p* = 0.005). See @fig-arabid.

::: {#fig-arabid}

```{r}
#| code-fold: true

ggplot(data = arabidopsis, 
       aes(x = type, y = leaves)) +
  geom_boxplot() +
  scale_y_continuous(name = "Number of leaves", 
                     limits = c(0, 12), 
                     expand = c(0, 0)) +
  scale_x_discrete(name = "", 
                   labels = c("Mutatnt", "Wildtype")) +
  annotate("segment", x = 1, xend = 2, 
           y = 10.5, yend = 10.5,
           colour = "black") +
  annotate("text", x = 1.5,  y = 11, 
           label = expression(italic(p)~"= 0.005")) +
  theme_classic()
```

**Mutant *Arabidopsis thaliana* have fewer leaves**. There are
significantly more leaves on wildtype than mutant plants (Wilcoxon rank
sum test: *W* = 5, $n_1$ = 8, $n_2$ = 8, *p* = 0.005). The heavy lines
indicate the median of leaves, boxes indicate the interquartile range
and whiskers the range. Data analysis was conducted in R [@R-core] with
tidyverse packages [@tidyverse].
:::

## Two paired-samples, parametric

The data in [marks.csv](data-raw/marks.csv) give the marks for ten
students in two subjects: Data Analysis and Biology. These data are
paired because we have two marks from one student so that a mark in one
group has a closer relationship with one of the marks in the other group
than with any of the other values. We want to know if students do equally
well in both subjects.

### Import and explore

Import the data:

```{r}
marks <- read_csv("data-raw/marks.csv")
```

Since these data are paired, it makes sense to highlight how the marks
differ for each student. One way of doing that is to draw a line linking
their marks in each subject. This is known as a spaghetti plot. We can
use two geoms: `geom_point()` and `geom_line()`. To join a student's
marks, we need to set the `group` aesthetic to `student`.[^two_sample_tests-1]

[^two_sample_tests-1]: You might like to try removing
    `aes(group = student)` to see what ggplot does when the lines are
    *not* grouped by student.

```{r}
ggplot(data = marks, aes(x = subject, y = mark)) +
  geom_point() +
  geom_line(aes(group = student))
```

Summarise the data so that we can use the means in plots later:

```{r}
marks_summary <- marks |>
  group_by(subject) |>
  summarise(mean = mean(mark))
```

A paired test requires us to into account the variation between 
students.

### Apply `lm()`

We can create a paired-sample model with the `lm()`
function[^two_sample_tests-2] like this:

[^two_sample_tests-2]: This is not the only way to apply a paired test.
    When there are only two groups and no other explanatory variables, we
    can use `t.test(data = marks, mark ~ subject, paired = TRUE)`. A
    more general method that works when you have two or more
    non-independent values (e.g., more than two subjects) or additional
    explanatory variables is to create a "linear mixed model" with
    `lmer()`.

```{r}
mod <- lm(mark ~ subject + factor(student), data = marks)
```

- `mark` is the dependent variable (response).
- `subject` is the independent variable (explanatory factor).
- `factor(student)` accounts for the pairing by treating student as
   another explanatory variable. We have used factor because the values
   in students are the numbers 1 to 10 and we want `student` to be
   treated as a category not a number

```{r}
summary(mod)
```

The coefficient for `(Intercept)` gives the mean Biology mark and that
for `subjectDataAnalysis` is amount that the Data Analysis mark are 
above Biology marks in general. The *p*-value tests whether this 
difference is significantly different from zero. The rest of the output
considers how students differ. You can ignore this here.

### Check assumptions

We might expect marks to be normally distributed. However, this is a 
very small sample, and choosing a non-parametric test instead would be 
reasonable. However, we will continue with this example to demonstrate 
how to interpret and report on the result of a parametric paired-samples 
test (paired-samples *t*-test).

A plot the residuals against the fitted values (`plot(mod, which = 1)`) 
is not useful for a paired test. The normality of the residuals should 
be checked.

```{r}
ggplot(mapping = aes(x = mod$residuals)) +
  geom_histogram(bins = 3)
```

We only have 10 values, so the distribution is never going to look
smooth. We can't draw strong conclusions from this, but we do at least
have a peak at 0. Similarly, a normality test is likely to be
non-significant because of the small sample size, meaning the test
is not very powerful. This means a non-significant result
is not strong evidence of the residuals following a normal distribution:

```{r}
shapiro.test(mod$residuals)
```

### Report

Individual students score significantly higher in Data Analysis than in
Biology (*t* = 2.47; *d.f.* = 9; *p* = 0.0355) with an average
difference of 6.5%. See @fig-marks

::: {#fig-marks}

```{r}
#| code-fold: true

ggplot(data = marks, aes(x = subject, y = mark)) +
  geom_point(pch = 1, size = 3) +
  geom_line(aes(group = student), linetype = 3) +
  geom_point(data = marks_summary,
             aes(x = subject, y = mean),
             size = 3) +
  scale_x_discrete(name = "") +
  scale_y_continuous(name = "Mark",
                     expand = c(0, 0),
                     limits = c(0, 110)) +
  annotate("segment", x = 1, xend = 2,
           y = 105, yend = 105,
           colour = "black") +
  annotate("text", x = 1.5,  y = 108,
           label = expression(italic(p)~"= 0.0355")) +
  theme_classic()
```

**Students score higher in Data Analysis than in Biology**. Open circles indicate an individual student's marks in each subject with dashed lines joining their marks in each subject. The filled circles indicate the mean mark for each subject. Individual students score significantly higher in Data Analysis than in
Biology (*t* = 2.47; *d.f.* = 9; *p* = 0.0355) with an average difference of 6.5%. Data analysis was conducted in R [@R-core] with tidyverse packages [@tidyverse].

:::


## Two paired-samples, non-parametric

We have the marks for just 10 students. This sample is too small for us to judge whether the marks are normally distributed. We will use a non-parametric test instead. The "Wilcoxon signed-rank" test is the non-parametric equivalent of the paired-samples *t*-test. This is often referred to as the paired-sample
Wilcoxon test, or just the Wilcoxon test. 

The test is also applied in R with the `wilcox.test()` function but we add the `paired = TRUE` argument. We can also use the tidy format of the data, i.e., the long format with marks in one column and another column indicating the subject.


### Apply `wilcox.test()`

To apply a paired test with `wilcox.test()` we need to use the long
format data (tidy) and add the `paired = TRUE` argument.

```{r}
#wilcox.test(data = marks, mark ~ subject, paired = TRUE)
```

### Report

Individual students score significantly higher in Data Analysis than in
Biology (Wilcoxon signed rank test: *V* = 6.55; $n$ = 10; *p* = 0.036).
See @fig-marks

## Summary

1. A linear model with one explanatory variable with two groups and 
   one continuous response is "a two-sample test". 

2. If pairs of observations in the groups have something in common
   that make them more similar to each other, than to other observations, 
   then those observations are not independent. A **paired-samples test** 
   is used when the observations are not independent. 

3. A linear model with one explanatory variable with two groups and 
   one continuous response is also known as a **two-sample *t*-test** 
   when the samples are independent and as a **paired-samples *t*-test** 
   when they are not

4. We can use `lm()` to do two-sample and paired sample tests. We can
   also use `t.test()` for these but using `lm()` helps us understand
   tests with more groups and/or more variables where we will have to 
   use `lm()`. The output of `lm()` is also more typical of the output 
   of statistical functions in R.

5.  We estimate the **coefficients** (also called the **parameters**) of
    the model. For a two-sample test these are the mean of the first 
    group, $\beta_0$ (which might also be called the intercept) and the
    difference between the means of the first and second groups,
    $\beta_1$ (which might also be called the slope). For a 
    paired-sample test there is just one parameter, the mean difference
    between pairs of values, $\beta_0$ (which might also be called the 
    intercept). We test whether the parameters differ significantly from
    zero

6.  We can use `lm()` to a linear regression.

7.  In the output of `lm()` the coefficients are listed in a table in the
    Estimates column. The *p*-value for each coefficient is in the test 
    of whether it differs from zero. At the bottom of the output there 
    is a test of the model *overall*. In this case, this
    is exactly the same as the test of the $\beta_1$ and the p-values are 
    identical. The R-squared value is the proportion of the variance
    in the response variable that is explained by the model.

8. The assumptions of the general linear model are that the residuals 
   are normally distributed and have homogeneity of variance. A residual
   is the difference between the predicted value and the observed value.
   
9. We examine a histogram of the residuals and use the Shapiro-Wilk 
   normality test to check the normality assumption. We  check the 
   variance of the residuals is the same for all fitted values with 
   a residuals vs fitted plot.

10. If the assumptions are not met, we can use alternatives known as 
    non-parametric tests. These are applied with `wilcox.test()` in R.
   
11. When reporting the results of a test we give the significance, 
    direction and size of the effect. Our figures and the values we give
    should reflect the type of test we have used. We use means and 
    standard errors for parametric tests and medians and interquartile
    ranges for non-parametric tests. We also give the test statistic, the
    degrees of freedom (parametric) or sample size (non-parametric) and
    the p-value. We annotate our figures with the p-value, making clear
    which comparison it applies to.