-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrobust_summaries_20191014_152_chang.Rmd
232 lines (188 loc) · 9.77 KB
/
robust_summaries_20191014_152_chang.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
---
title: "R Notebook"
output: html_notebook
---
# Chapter 12 Robust summaries
```{r}
library(tidyverse)
library(dslabs)
```
# 12.1 Outliers
boxplots show outliers, but we did not provide a precise definition.
Outliers are very common in data science. Data recording can be complex and it is common to observe data points generated in error.
How do we distinguish an outlier from measurements that were too big or too small simply due to expected variability?
Suppose a colleague is charged with collecting demography data for a group of males. The data report height in feet and are stored in the object:
```{r}
library(dslabs)
data(outlier_example)
str(outlier_example)
```
Our colleague uses the fact that heights are usually well approximated by a normal distribution and summarizes the data with average and standard deviation:
```{r}
mean(outlier_example)
sd(outlier_example)
```
and writes a report on the interesting fact that this group of males is much taller than usual. The average height is over six feet tall! Using your data science skills, however, you notice something else that is unexpected: the standard deviation is over 7 inches. Adding and subtracting two standard deviations, you note that 95% of this population will have heights between -9.489, 21.697 feet, which does not make sense. A quick plot reveals the problem:
```{r}
qplot(outlier_example)
```
There appears to be at least one value that is nonsensical, since we know that a height of 180 feet is impossible. The boxplot detects this point as an outlier:
```{r}
boxplot(outlier_example)
```
# 12.2 Median
When we have an outlier like this, the average can become very large. Mathematically, we can make the average as large as we want by simply changing one number: with 500 data points, we can increase the average by any amount Δ by adding Δ×500 to a single number. The median, defined as the value for which half the values are smaller and the other half are bigger, is robust to such outliers. No matter how large we make the largest point, the median remains the same.
With this data the median is:
```{r}
median(outlier_example)
```
which is about 5 feet and 9 inches.
The median is what boxplots display as a horizontal line.
# 12.3 The inter quartile range(IQR)
The box in boxplots are defined by the first and third quartile. These are meant to provide an idea of the variability in the data: 50% of the data is within this range. __The difference between the 3rd and 1st quartile (or 75th and 25th percentiles) is referred to as the inter quartile range (IQR).__ We can do some math to see that for normally distributed data, the IQR / 1.349 approximates the standard deviation of the data had an outlier not been present. We can see that this works well in our example since we get a standard deviation estimate of:
```{r}
IQR(outlier_example) / 1.349
```
which is about 3 inches.
# 12.4 Turkey's definition of an outlier
In R, points falling outside the whiskers of the boxplot are referred to as outliers. This definition of outlier was introduced by Tukey. The top whisker ends at the 75th percentile plus 1.5×IQR. Similarly the bottom whisker ends at the 25th percentile minus 1.5×IQR. If we define the first and third quartiles as Q1 and Q3 respectively, then an outlier is anything outside the range:
$[Q1=1.5 * (Q3-Q1), Q3 + 1.5*(Q3-Q1)]$
When the data is normally distributed, the standard units of these values are:
```{r}
q3<-qnorm(0.75)
q1<-qnorm(0.25)
iqr<-q3-q1
r<-c(q1-1.5*iqr, q3+1.5*iqr)
r
```
Using the `pnorm` function, we see that 99.3% of the data falls in this interval.
Keep in mind that this is not such an extreme event: if we have 1000 data points that are normally distributed, we expect to see about 7 outside of this range. But these would not be outliers since we expect to see them under the typical variation.
If we want an outlier to be rarer, we can increase the 1.5 to a larger number. Tukey also used 3 and called these far out outliers. With a normal distribution, 100% of the data falls in this interval. This translates into about 2 in a million chance of being outside the range. In the `geom_boxplot` function, this can be controlled by the `outlier.size` argument, which defaults to 1.5.
The 180 inches measurement is well beyond the range of the height data:
```{r}
max_height<-quantile(outlier_example, 0.75)+3*IQR(outlier_example)
max_height
```
If we take this value out, we can see that the data is in fact normally distributed as expected:
```{r}
x<-outlier_example[outlier_example < max_height]
qqnorm(x)
qqline(x)
```
# 12.5 Median absolute deviation
Another way to robustly estimate the standard deviation in the presence of outliers is to use the median absolute deviation (MAD). To compute the MAD, we first compute the median, and then for each value we compute the distance between that value and the median. The MAD is defined as the median of these distances. For technical reasons not discussed here, this quantity needs to be multiplied by 1.4826 to assure it approximates the actual standard deviation. The `mad` function already incorporates this correction. For the height data, we get a MAD of:
```{r}
mad(outlier_example)
```
which is about 3 inches.
# 12.6 Exercises
We are going to use the HistData package.
Load the height data set and create a vector `x` with just the male heights used in Galton’s data on the heights of parents and their children from his historic research on heredity.
```{r}
library(HistData)
data(Galton)
x<-Galton$child
```
1. Compute the average and median of these data.
```{r}
mean(x)
median(x)
```
2. Compute the median and median absolute deviation of these data.
```{r}
median(x)
mad(x)
```
3. Now suppose Galton made a mistake when entering the first value and forgot to use the decimal point. You can imitate this error by typing:
```{r}
x_with_error <- x
x_with_error[1]<-x_with_error[1]*10
```
How many inches does the average grow after this mistake?
```{r}
mean(x_with_error)-mean(x)
```
4. How many inches does the SD grow after this mistake?
```{r}
sd(x_with_error)-sd(x)
```
5. How many inches does the median grow after this mistake?
```{r}
median(x_with_error)-median(x)
```
6. How many inches does the MAD grow after this mistake?
```{r}
mad(x_with_error)-mad(x)
```
7. How could you use exploratory data analysis to detect that an error was made?
A. Since it is only one value out of many, we will not be able to detect this.
B. We would see an obvious shift in the distribution.
__C. A boxplot, histogram, or qq-plot would reveal a clear outlier.__
D. A scatter plot would show high levels of measurement error.
8. How much can the average accidentally grow with mistakes like this? Write a function called `error_avg` that takes a value `k` and returns the average of the vector `x` after the first entry changed to `k`. Show the results for `k=10000` and `k=-10000`.
```{r}
error_avg<-function(k){
x_with_error <- x
x_with_error[1] = k
mean(x_with_error)
}
```
```{r}
error_avg(10000)
error_avg(-10000)
```
# 12.7 Case study: self-reported student heights
The heights we have been looking at are not the original heights reported by students. The original reported heights are also included in the dslabs package and can be loaded like this:
```{r}
library(dslabs)
data("reported_heights")
head(reported_heights)
```
Height is a character vector so we create a new column with the numeric version:
```{r}
reported_heights <- reported_heights %>%
mutate(original_heights = height, height = as.numeric(height))
```
Note that we get a warning about NAs. This is because some of the self reported heights were not numbers. We can see why we get these:
```{r}
reported_heights %>%
filter(is.na(height)) %>%
head()
```
Some students self-reported their heights using feet and inches rather than just inches. Others used centimeters and others were just trolling. For now we will remove these entries:
```{r}
reported_heights<-filter(reported_heights, !is.na(height))
```
If we compute the average and standard deviation, we notice that we obtain strange results. The average and standard deviation are different from the median and MAD:
```{r}
reported_heights %>%
group_by(sex) %>%
summarise(average=mean(height), sd=sd(height), median= median(height), MAD=mad(height))
```
This suggests that we have outliers, which is confirmed by creating a boxplot:
```{r}
reported_heights %>%
ggplot(aes(sex, height))+
geom_boxplot()
```
We can see some rather extreme values. To see what these values are, we can quickly look at the largest values using the `arrange` function:
```{r}
reported_heights %>%
arrange(desc(height)) %>%
top_n(10, height)
```
The first seven entries look like strange errors. However, the next few look like they were entered as centimeters instead of inches. Since 184 cm is equivalent to six feet tall, we suspect that 184 was actually meant to be 72 inches.
We can review all the nonsensical answers by looking at the data considered to be far out by Tukey:
```{r}
max_height<-quantile(reported_heights$height, .75) + 3*IQR(reported_heights$height)
min_height <- quantile(reported_heights$height, .25) - 3*IQR(reported_heights$height)
c(min_height, max_height)
```
```{r}
reported_heights %>%
filter(!between(height, min_height, max_height)) %>%
select(original_heights) %>%
head(n=10)
```
Examining these heights carefully, we see two common mistakes: entries in centimeters, which turn out to be too large, and entries of the form `x.y` with `x` and `y` representing feet and inches respectively, which turn out to be too small. Some of the even smaller values, such as 1.6, could be entries in meters.
In the Data Wrangling part of this book we will learn techniques for correcting these values and converting them into inches. Here we were able to detect this problem using careful data exploration to uncover issues with the data: the first step in the great majority of data science projects.