forked from dcossyleon/basic-course-website
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata-summary.Rmd
149 lines (109 loc) · 5.22 KB
/
data-summary.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
title: "Lecture 2: Summary and visualization"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = TRUE)
```
# Summarizing Data
## Summary Statistics
`R` has built in functions for a large number of summary statistics. For numeric variables, we can summarize data with the center and spread. We'll again look at the `mpg` dataset from the `ggplot2` package.
```{r, message = FALSE, warning = FALSE}
library(ggplot2)
```
### Central Tendency {-}
| Measure | `R` | Result |
|---------|-------------------|---------------------|
| Mean | `mean(mpg$cty)` | `r mean(mpg$cty)` |
| Median | `median(mpg$cty)` | `r median(mpg$cty)` |
### Spread {-}
| Measure | `R` | Result |
|--------------------|------------------|--------------------|
| Variance | `var(mpg$cty)` | `r var(mpg$cty)` |
| Standard Deviation | `sd(mpg$cty)` | `r sd(mpg$cty)` |
| IQR | `IQR(mpg$cty)` | `r IQR(mpg$cty)` |
| Minimum | `min(mpg$cty)` | `r min(mpg$cty)` |
| Maximum | `max(mpg$cty)` | `r max(mpg$cty)` |
| Range | `range(mpg$cty)` | `r range(mpg$cty)` |
### Categorical {-}
For categorical variables, counts and percentages can be used for summary.
```{r}
table(mpg$drv)
table(mpg$drv) / nrow(mpg)
```
## Plotting
Now that we have some data to work with, and we have learned about the data at the most basic level, our next tasks is to visualize the data. Often, a proper visualization can illuminate features of the data that can inform further analysis.
We will look at four methods of visualizing data that we will use throughout the course:
- Histograms
- Barplots
- Boxplots
- Scatterplots
<-- ·Stem and Leaf Plots,
Stacked and Grouped Bar Charts and Mosaic Plots,
pie-->
### Histograms
When visualizing a single numerical variable, a **histogram** will be our go-to tool, which can be created in `R` using the `hist()` function.
```{r}
hist(mpg$cty)
```
The histogram function has a number of parameters which can be changed to make our plot look much nicer. Use the `?` operator to read the documentation for the `hist()` to see a full list of these parameters.
```{r}
hist(mpg$cty,
xlab = "Miles Per Gallon (City)",
main = "Histogram of MPG (City)",
breaks = 12,
col = "dodgerblue",
border = "darkorange")
```
Importantly, you should always be sure to label your axes and give the plot a title. The argument `breaks` is specific to `hist()`. Entering an integer will give a suggestion to `R` for how many bars to use for the histogram. By default `R` will attempt to intelligently guess a good number of `breaks`, but as we can see here, it is sometimes useful to modify this yourself.
### Barplots
Somewhat similar to a histogram, a barplot can provide a visual summary of a categorical variable, or a numeric variable with a finite number of values, like a ranking from 1 to 10.
```{r}
barplot(table(mpg$drv))
```
```{r}
barplot(table(mpg$drv),
xlab = "Drivetrain (f = FWD, r = RWD, 4 = 4WD)",
ylab = "Frequency",
main = "Drivetrains",
col = "dodgerblue",
border = "darkorange")
```
### Boxplots
To visualize the relationship between a numerical and categorical variable, we will use a **boxplot**. In the `mpg` dataset, the `drv` variable takes a small, finite number of values. A car can only be front wheel drive, 4 wheel drive, or rear wheel drive.
```{r}
unique(mpg$drv)
```
First note that we can use a single boxplot as an alternative to a histogram for visualizing a single numerical variable. To do so in `R`, we use the `boxplot()` function.
```{r}
boxplot(mpg$hwy)
```
However, more often we will use boxplots to compare a numerical variable for different values of a categorical variable.
```{r}
boxplot(hwy ~ drv, data = mpg)
```
Here we used the `boxplot()` command to create side-by-side boxplots. However, since we are now dealing with two variables, the syntax has changed. The `R` syntax `hwy ~ drv, data = mpg` reads "Plot the `hwy` variable against the `drv` variable using the dataset `mpg`." We see the use of a `~` (which specifies a formula) and also a `data = ` argument. This will be a syntax that is common to many functions we will use in this course.
```{r}
boxplot(hwy ~ drv, data = mpg,
xlab = "Drivetrain (f = FWD, r = RWD, 4 = 4WD)",
ylab = "Miles Per Gallon (Highway)",
main = "MPG (Highway) vs Drivetrain",
pch = 20,
cex = 2,
col = "darkorange",
border = "dodgerblue")
```
Again, `boxplot()` has a number of additional arguments which have the ability to make our plot more visually appealing.
### Scatterplots
Lastly, to visualize the relationship between two numeric variables we will use a **scatterplot**. This can be done with the `plot()` function and the `~` syntax we just used with a boxplot. (The function `plot()` can also be used more generally; see the documentation for details.)
```{r}
plot(hwy ~ displ, data = mpg)
```
```{r}
plot(hwy ~ displ, data = mpg,
xlab = "Engine Displacement (in Liters)",
ylab = "Miles Per Gallon (Highway)",
main = "MPG (Highway) vs Engine Displacement",
pch = 20,
cex = 2,
col = "dodgerblue")
```