assignment-04.Rmd

---
title: 'Assignment 4: Exploration, linear and mixed-effects models'
output:
    html_document:
        toc: false
---

*To submit this assignment, upload the full document on blackboard,
including the original questions, your code, and the output. Submit
you assignment as a knitted `.pdf` (prefered) or `.html` file.*

1. Visualization (3 marks)

    Import the tidyverse library. We will be using the same beaver1 dataset that
    we used in last week's assignment.

    ```{r message=FALSE, warning=FALSE}
    library(tidyverse)
    ```

    a. Create a histogram to visualize the distribution of the beavers' body
    temperatures, separating the temperature data based on the beaver's activity level.
    (after transforming it into a categorical variable the way you did for your
    last assignment). Describe the properties of the distribution. When
    creating this plot for the purpose of evaluating temperature, what
    argument did you adjust and why? (1 mark)

    b. What type of variables are temperature and time of day? With this in
    mind, create a visualization that will help you get a better understanding
    of the relationship between these variables. (0.5 mark)

    c. Create a single box plot to simultaneously visualise temperature,
     activity, and day. (0.5 mark)
    
    d. What is one prediction you might make about the relationships among your
    variables (based on the patterns you observed)? Create a visualization that
    illustrates your prediction, improving on your previous plots in at least one
    way. State why this plot is an improvement. (1 mark)

2. Outliers (2 marks)

    a. In the beaver1 dataset, there are some particularly high/low body
    temperature measurements. Give an example of a systematic or random error
    (state which) that could have influenced these values. (0.5 marks)

    b. Consider whether these values would affect your ability to test whether
    temperature varies by activity level. You should generate plots and/or
    perform statistical tests with and without these points, and make an
    informed decision about whether they should be kept or dropped (Hint: you
    may want to either create a second data set or get creative with colour.)
    State whether you would remove the points and why. (1.5 marks)
    
3. Linear models (3 marks)

Run the following code to load the CO2 dataset.

    ```{r}
    co2_df <- as_data_frame(as.matrix(CO2)) %>% 
        mutate(conc = as.integer(conc),
               uptake = as.numeric(uptake))
    ```

    a. Look through the help documentation (?CO2) to understand what each
    variable means. Imagine you were running a statistical model to assess the
    effects of chilling on plant CO2 uptake. What would the $y$ and $x$
    variables be in such a model? What about if you were trying to assess the
    relationship between ambient CO~2~ concentrations and plant uptake? Briefly
    defend these choices. (1 mark)

    b. How much does `uptake` change if `conc` goes up by 10 mL/L? Write out the
    interpretation as a simple statement of this contribution of `conc` on
    `uptake`. How much CO2 would you predict plants to uptake if atmospheric
    concentrations were 2,450 mL/L?. Show your work. (2 marks)
        
4. Linear mixed-effects models (4 marks).

    Santangelo _et al._ (2018) were interested in understanding how plant
    defenses, herbivores, and pollinators influence the expression of plant
    floral traits (e.g. flower size). Their experiment had 3 treatments, each
    with 2 levels: Plant defense (2 levels: defended vs. undefended), herbivory
    (2 levels: reduced vs. ambient) and pollination (2 levels: open vs.
    supplemental). These treatments were fully crossed for a total of 8
    treatment combinations. In each treatment combination, they grew 4
    individuals from each of 25 plant genotypes for a total of 800 plants (8
    treatment combinations x 25 genotypes x 4 individuals per genotype). Plants
    were grown in a common garden at the Koffler Scientific Reserve (UofT's field
    research station) and 6 floral traits were measured on all plants throughout
    the summer. We will analyze how the treatments influenced one of these
    traits in this exercise. Run the code chunk below to download the data,
    which includes only a subset of the columns from the full dataset:
    
    ```{r}
    library(tidyverse)
    
    plant_data <- "https://uoftcoders.github.io/rcourse/data/Santangelo_JEB_2018.csv"
    download.file(plant_data, "Santangelo_JEB_2018.csv")
    plant_data <- read_csv("Santangelo_JEB_2018.csv", 
                           col_names = TRUE)
    glimpse(plant_data)
    head(plant_data)
    ```

    You can see that the data contain 792 observations (i.e. plants, 8 died
    during the experiment). There are 50 genotypes across 3 treatments:
    Herbivory, Pollination, and HCN (i.e. hydrogen cyanide, a plant defense).
    There are 6 plant floral traits: Number of days to first flower, banner
    petal length, banner petal width, plant biomass, number of flowers, and
    number of inflorescences. Finally, since plants that are closer in space in
    the common garden may have similar trait expression due to more similar
    environments, the authors included 6 spatial "blocks" to account for this
    environmental variation (i.e. Plant from block A "share" an environment and
    those from block B "share" an environment, etc.). Also keep in mind that
    each treatment combination contains 4 individuals of each genotype, which
    are likely to have similar trait expression due simply to shared genetics.
    
    a. Use the `lme4` and `lmerTest` R packages to run a linear mixed-effects
    model examining how herbivores (`Herbivory`), Pollinators (`Pollination`),
    plant defenses (`HCN`) _and all interactions_ influences the width of
    banner petals (`Avg.Bnr.Wdth`) produced by plants while accounting for
    variation due to spatial block and plant genotype. Also allow the intercept
    for `Genotype` to vary across the levels of the herbivory treatment. (1
    mark: 0.5 for correct fixed effects specification and 0.5 for correct random
    effects structure). You only need to specify the model for this part of the
    question.

    b. Summarize (i.e. get the output) the model that you ran in part (a). Did
    any of the treatments have a significant effect on banner petal length? If
    so, which ones? Based on your examination of the model output, how can you
    tell which level of the significant treatments resulted in longer or shorter
    mean banner petal widths? Make a statement for each significant **main**
    effects in the model (i.e. not interactions). If none of the main effects
    are significant, then simply write "there are no significant main effects in
    the model" (0.5 marks).

    c. Using `dplyr` and `gglot2`, plot the mean banner width for one of the
    significant interactions in the model above (whichever you choose). The idea
    is to show how both treatments interact to influence the mean length of
    banner petals using a combination of different colours, linetypes, shapes,
    etc. on the same plot (i.e., no faceting). Avoid overlapping points in the
    figure. Also include error bars/bands with one standard error around the
    mean. As a reminder, I have included the formula to calculate the standard
    error of the mean below. (1.5 marks).
    
    $$ SE = \frac{sd}{\sqrt{n}}  $$

    d. After accounting for the fixed effects, what percentage of the variation
    in banner petal width was explained by each of the random effects in the
    model? Show yor work. (0.5 marks).

    e. Descibe the pattern you see in the figure generated in part (c). Why do
    you think the interaction you plotted was significant in the model? (0.5 marks)