forked from jmledford3115/datascibiol
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlab5_1.Rmd
178 lines (143 loc) · 8.28 KB
/
lab5_1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
title: "Data Visualization 1"
author: "Joel Ledford"
date: "Winter 2019"
output:
html_document:
theme: spacelab
toc: yes
toc_float: yes
pdf_document:
toc: yes
---
## Where have we been, and where are we going?
At this point you should feel reasonably comfortable working in RStudio and using dplyr and tidyr. You also know how to produce statistical summaries of data and deal with NA's. It is OK if you need to go back through the labs and find bits of code that work for you, but try and force yourself to originate new chunks.
## Group Project
Meet with your group and decide on a data set that you will use for your project. Be prepared to discuss these data, where you found them, and what you hope to learn.
##Resources
- [ggplot2 cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf)
- [R for Data Science](https://r4ds.had.co.nz/)
- [R Cookbook](http://www.cookbook-r.com/)
## Learning Goals
*At the end of this exercise, you will be able to:*
1. Understand and apply the syntax of building plots using `ggplot2`.
2. Build a boxplot using `ggplot2`.
3. Build a scatterplot using `ggplot2`.
4. Build a barplot using `ggplot2` and show the difference between `stat=count` and `stat=identity`.
## Load the libraries
```{r message=FALSE, warning=FALSE}
library(tidyverse)
library(skimr)
```
## Grammar of Graphics
The ability to quickly produce and edit beautiful graphs and charts is a strength of R. These data visualizations are produced by the package `ggplot2` and it is a core part of the tidyverse. The syntax for using ggplot is specific and common to all of the plots. This is what Hadley Wickham calls a [Grammar of Graphics](http://vita.had.co.nz/papers/layered-grammar.pdf). The "gg" in `ggplot` stands for grammar of graphics.
## Philosophy
What makes a good chart? In my opinion a good chart is elegant in its simplicity. It provides a clean, clear visual of the data without being overwhelming to the reader. This can be hard to do and takes some careful thinking. Always keep in mind that the reader will almost never know the data as well as you do so you need to be mindful about presenting the facts.
## Data Types
While this isn't a statistics class, we need to define some of the data types we will use to build plots.
+ `discrete` quantitative data that only contains integers
+ `continuous` quantitative data that can take any numerical value
+ `categorical` qualitative data that can take on a limited number of values
## Basics
The syntax used by ggplot takes some practice to get used to, especially for customizing plots, but the basic elements are the same. It is helpful to think of plots as being built up in layers. In short, **plot= data + geom_ + aesthetics**.
We start by calling the ggplot function, identifying the data, and specifying the axes. We then add the `geom` type to describe which type of plot we want to make. Each `geom_` works with specific types of data and R is capable of building plots of single variables, multiple variables, and even maps. Lastly, we add aesthetics.
ggplot works best with tidy data, so it is sometimes necessary to tidy data before plotting. We will start with tidy data for simplicity.
## Example
To make things easy, let's start with some built in data.
```{r}
?iris
names(iris)
```
To make a plot, we need to first specify the data and map the aesthetics. The aesthetics include how each variable in our dataset will be used. In the example below, I am using the aes() function to identify the x and y variables in the plot.
```{r}
ggplot(data = iris, mapping = aes(x = Species, y = Petal.Length))
```
Notice that we have a nice background, labeled axes, and even values of our variables- but no plot. This is because we need to tell ggplot what type of plot we want to make. This is called the geometry or `geom()`. There are many types of `geom`, see the ggplot [cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf).
Here we specify that we want a boxplot, indicated by `geom_boxplot()`.
```{r}
ggplot(data = iris, mapping = aes(x = Species, y = Petal.Length)) +
geom_boxplot()
```
## Practice
Take a moment to practice. Use the iris data to build a scatterplot that compares sepal length vs. sepal width. Use the cheatsheet to find the correct `geom_` for a scatterplot.
```{r}
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
```
## Scatterplots, barplots, and boxplots
Now that we have a general idea of the syntax, let's start by working with two standard plots: 1) scatterplots and 2) barplots.
## Data
For the following examples, I am going to use data about vertebrate home range sizes.
**Database of vertebrate home range sizes.**
Reference: Tamburello N, Cote IM, Dulvy NK (2015) Energy and the scaling of animal space use. The American Naturalist 186(2):196-211. http://dx.doi.org/10.1086/682070.
Data: http://datadryad.org/resource/doi:10.5061/dryad.q5j65/1
```{r message=FALSE, warning=FALSE}
homerange <-
readr::read_csv("data/Tamburelloetal_HomeRangeDatabase.csv")
```
As a follow up to lab 4, view the homerange data frame by opening it in R (click on it in the Environment tab). Are there grey italicized NA values? The NAs in the csv file are indicated by NA, which the `read_csv()` function uses as a default NA indicator. If our data file used something else, we would need to specify it above.
### 1. Scatter Plots
Scatter plots are good at revealing relationships that are not readily visible in the raw data. For now, we will not add regression lines or calculate any r^2^ values.
In the case below, we are exploring whether or not there is a relationship between animal mass and homerange. We are using the log transformed values because there is a large difference in mass and homerange among the different species in the data.
```{r}
ggplot(data = homerange, mapping = aes(x = log10.mass, y = log10.hra)) +
geom_point()
```
In big data sets with lots of similar values, overplotting can be an issue. `geom_jitter()` is similar to `geom_point()` but it helps with overplotting by adding some random noise to the data and separating some of the individual points.
```{r}
ggplot(data = homerange, mapping = aes(x = log10.mass, y = log10.hra)) +
geom_jitter()
```
You want to see the regression line, right?
```{r}
ggplot(data=homerange, mapping=aes(x=log10.mass, y=log10.hra)) +
geom_jitter()+
geom_smooth(method=lm, se=FALSE) #adds the regression line, `se=TRUE` will add standard error
```
### Practice
1. What is the relationship between log10.hra and log10.preymass? What do you notice about how ggplot treats NA's?
```{r}
ggplot(data = homerange, mapping = aes(x = log10.hra, y = log10.preymass)) +
geom_point()+
geom_smooth(method = lm, se = FALSE)
```
### 2A. Bar Plot: `stat="count"`
When making a bar graph, the default is to count the number of observations in the specified column. This is best for categorical data. Here, I want to know how many carnivores vs. herbivores are in the data.
Notice that we can use pipes! Also, the `mapping=` function is implied by `aes` and so is often left out.
```{r}
homerange %>%
ggplot(aes(x = trophic.guild))+
geom_bar(stat = "count")
```
### 2B. Bar Plot: `stat="identity"`
`stat="identity"` allows us to map a variable to the y axis so that we aren't restricted to counts. In this example, I start by summarizing mean body weight by taxonomic class and then use pipes to build the plot.
```{r}
homerange %>%
group_by(class) %>%
summarize(mean_body_wt = mean(log10.mass)) %>%
ggplot(aes(x = class, y = mean_body_wt)) +
geom_bar(stat = "identity")
```
## Practice
1. Filter the `homerange` data to include `mammals` only.
```{r}
mammals <-
homerange %>%
filter(taxon == "mammals")
```
2. Are there more herbivores or carnivores in mammals? Make a bar plot that shows their relative numbers.
```{r}
mammals %>%
ggplot(aes(x = trophic.guild)) +
geom_bar(stat = "count")
```
3. Make a bar plot that shows the masses of the top 10 smallest mammals in terms of mass. Be sure to use `stat'="identity"`.
```{r}
mammals %>%
top_n(-10, log10.mass) %>%
ggplot(aes(x = reorder(common.name, log10.mass), y = log10.mass)) +
geom_bar(stat = "identity") +
coord_flip()
```
## That's it, let's take a break!
--> On to [part 2](https://jmledford3115.github.io/datascibiol/lab5_2.html)