-
Notifications
You must be signed in to change notification settings - Fork 28
/
Copy path02-statistical-learning.Rmd
509 lines (410 loc) · 16.7 KB
/
02-statistical-learning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
# Statistical Learning
## Conceptual
### Question 1
> For each of parts (a) through (d), indicate whether we would generally expect
> the performance of a flexible statistical learning method to be better or
> worse than an inflexible method. Justify your answer.
>
> a. The sample size $n$ is extremely large, and the number of predictors $p$ is
> small.
Flexible best - opposite of b.
> b. The number of predictors $p$ is extremely large, and the number of
> observations $n$ is small.
Inflexible best - high chance of some predictors being randomly associated.
> c. The relationship between the predictors and response is highly
> non-linear.
Flexible best - inflexible leads to high bias.
> d. The variance of the error terms, i.e. $\sigma^2 = Var(\epsilon)$, is
> extremely high.
Inflexible best - opposite of c.
### Question 2
> Explain whether each scenario is a classification or regression problem, and
> indicate whether we are most interested in inference or prediction. Finally,
> provide $n$ and $p$.
>
> a. We collect a set of data on the top 500 firms in the US. For each firm
> we record profit, number of employees, industry and the CEO salary. We are
> interested in understanding which factors affect CEO salary.
$n=500$, $p=3$, regression, inference.
> b. We are considering launching a new product and wish to know whether
> it will be a success or a failure. We collect data on 20 similar products
> that were previously launched. For each product we have recorded whether it
> was a success or failure, price charged for the product, marketing budget,
> competition price, and ten other variables.
$n=20$, $p=13$, classification, prediction.
> c. We are interested in predicting the % change in the USD/Euro exchange
> rate in relation to the weekly changes in the world stock markets. Hence we
> collect weekly data for all of 2012. For each week we record the % change
> in the USD/Euro, the % change in the US market, the % change in the British
> market, and the % change in the German market.
$n=52$, $p=3$, regression, prediction.
### Question 3
> We now revisit the bias-variance decomposition.
>
> a. Provide a sketch of typical (squared) bias, variance, training error,
> test error, and Bayes (or irreducible) error curves, on a single plot, as
> we go from less flexible statistical learning methods towards more flexible
> approaches. The x-axis should represent the amount of flexibility in the
> method, and the y-axis should represent the values for each curve. There
> should be five curves. Make sure to label each one.
>
> b. Explain why each of the five curves has the shape displayed in
> part (a).
* (squared) bias: Decreases with increasing flexibility (Generally, more
flexible methods result in less bias).
* variance: Increases with increasing flexibility (In general, more flexible
statistical methods have higher variance).
* training error: Decreases with model flexibility (More complex models will
better fit the training data).
* test error: Decreases initially, then increases due to overfitting (less
bias but more training error).
* Bayes (irreducible) error: fixed (does not change with model).
### Question 4
> You will now think of some real-life applications for statistical learning.
>
> a. Describe three real-life applications in which classification might
> be useful. Describe the response, as well as the predictors. Is the goal of
> each application inference or prediction? Explain your answer.
* Coffee machine cleaned? (day of week, person assigned), inference.
* Is a flight delayed? (airline, airport etc), inference.
* Beer type (IPA, pilsner etc.), prediction.
> b. Describe three real-life applications in which regression might be
> useful. Describe the response, as well as the predictors. Is the goal of
> each application inference or prediction? Explain your answer.
* Amount of bonus paid (profitability, client feedback), prediction.
* Person's height, prediction.
* House price, inference.
> c. Describe three real-life applications in which cluster analysis might be
> useful.
* RNAseq tumour gene expression data.
* SNPs in human populations.
* Frequencies of mutations (with base pair context) in somatic mutation data.
### Question 5
> What are the advantages and disadvantages of a very flexible (versus a less
> flexible) approach for regression or classification? Under what circumstances
> might a more flexible approach be preferred to a less flexible approach? When
> might a less flexible approach be preferred?
Inflexible is more interpretable, fewer observations required, can be biased.
Flexible can overfit (high error variance). In cases where we have high $n$ or
non-linear patterns flexible will be preferred.
### Question 6
> Describe the differences between a parametric and a non-parametric statistical
> learning approach. What are the advantages of a parametric approach to
> regression or classification (as opposed to a non-parametric approach)? What
> are its disadvantages?
Parametric uses (model) parameters. Parametric models can be more interpretable
as there is a model behind how data is generated. However, the disadvantage is
that the model might not reflect reality. If the model is too far from the
truth, estimates will be poor and more flexible models can fit many different
forms and require more parameters (leading to overfitting). Non-parametric
approaches do not estimate a small number of parameters, so a large number
of observations may be needed to obtain accurate estimates.
### Question 7
> The table below provides a training data set containing six observations,
> three predictors, and one qualitative response variable.
>
> | Obs. | $X_1$ | $X_2$ | $X_3$ | $Y$ |
> |------|-------|-------|-------|-------|
> | 1 | 0 | 3 | 0 | Red |
> | 2 | 2 | 0 | 0 | Red |
> | 3 | 0 | 1 | 3 | Red |
> | 4 | 0 | 1 | 2 | Green |
> | 5 | -1 | 0 | 1 | Green |
> | 6 | 1 | 1 | 1 | Red |
>
> Suppose we wish to use this data set to make a prediction for $Y$ when
> $X_1 = X_2 = X_3 = 0$ using $K$-nearest neighbors.
>
> a. Compute the Euclidean distance between each observation and the test
> point, $X_1 = X_2 = X_3 = 0$.
```{r}
dat <- data.frame(
"x1" = c(0, 2, 0, 0, -1, 1),
"x2" = c(3, 0, 1, 1, 0, 1),
"x3" = c(0, 0, 3, 2, 1, 1),
"y" = c("Red", "Red", "Red", "Green", "Green", "Red")
)
# Euclidean distance between points and c(0, 0, 0)
dist <- sqrt(dat[["x1"]]^2 + dat[["x2"]]^2 + dat[["x3"]]^2)
signif(dist, 3)
```
> b. What is our prediction with $K = 1$? Why?
```{r}
knn <- function(k) {
names(which.max(table(dat[["y"]][order(dist)[1:k]])))
}
knn(1)
```
Green (based on data point 5 only)
> c. What is our prediction with $K = 3$? Why?
```{r}
knn(3)
```
Red (based on data points 2, 5, 6)
> d. If the Bayes decision boundary in this problem is highly non-linear, then
> would we expect the best value for $K$ to be large or small? Why?
Small (high $k$ leads to linear boundaries due to averaging)
## Applied
### Question 8
> This exercise relates to the `College` data set, which can be found in
> the file `College.csv`. It contains a number of variables for 777 different
> universities and colleges in the US. The variables are
>
> * `Private` : Public/private indicator
> * `Apps` : Number of applications received
> * `Accept` : Number of applicants accepted
> * `Enroll` : Number of new students enrolled
> * `Top10perc` : New students from top 10% of high school class
> * `Top25perc` : New students from top 25% of high school class
> * `F.Undergrad` : Number of full-time undergraduates
> * `P.Undergrad` : Number of part-time undergraduates
> * `Outstate` : Out-of-state tuition
> * `Room.Board` : Room and board costs
> * `Books` : Estimated book costs
> * `Personal` : Estimated personal spending
> * `PhD` : Percent of faculty with Ph.D.'s
> * `Terminal` : Percent of faculty with terminal degree
> * `S.F.Ratio` : Student/faculty ratio
> * `perc.alumni` : Percent of alumni who donate
> * `Expend` : Instructional expenditure per student
> * `Grad.Rate` : Graduation rate
>
> Before reading the data into `R`, it can be viewed in Excel or a text
> editor.
>
> a. Use the `read.csv()` function to read the data into `R`. Call the loaded
> data `college`. Make sure that you have the directory set to the correct
> location for the data.
```{r}
college <- read.csv("data/College.csv")
```
> b. Look at the data using the `View()` function. You should notice that the
> first column is just the name of each university. We don't really want `R`
> to treat this as data. However, it may be handy to have these names for
> later. Try the following commands:
>
> ```r
> rownames(college) <- college[, 1]
> View(college)
> ```
>
> You should see that there is now a `row.names` column with the name of
> each university recorded. This means that R has given each row a name
> corresponding to the appropriate university. `R` will not try to perform
> calculations on the row names. However, we still need to eliminate the
> first column in the data where the names are stored. Try
>
> ```r
> college <- college [, -1]
> View(college)
> ```
>
> Now you should see that the first data column is `Private`. Note that
> another column labeled `row.names` now appears before the `Private` column.
> However, this is not a data column but rather the name that R is giving to
> each row.
```{r}
rownames(college) <- college[, 1]
college <- college[, -1]
```
> c.
> i. Use the `summary()` function to produce a numerical summary of the
> variables in the data set.
> ii. Use the `pairs()` function to produce a scatterplot matrix of the
> first ten columns or variables of the data. Recall that you can
> reference the first ten columns of a matrix A using `A[,1:10]`.
> iii. Use the `plot()` function to produce side-by-side boxplots of
> `Outstate` versus `Private`.
> iv. Create a new qualitative variable, called `Elite`, by _binning_ the
> `Top10perc` variable. We are going to divide universities into two
> groups based on whether or not the proportion of students coming from
> the top 10% of their high school classes exceeds 50%.
>
> ```r
> > Elite <- rep("No", nrow(college))
> > Elite[college$Top10perc > 50] <- "Yes"
> > Elite <- as.factor(Elite)
> > college <- data.frame(college, Elite)
> ```
>
> Use the `summary()` function to see how many elite universities there
> are. Now use the `plot()` function to produce side-by-side boxplots of
> `Outstate` versus `Elite`.
> v. Use the `hist()` function to produce some histograms with differing
> numbers of bins for a few of the quantitative variables. You may find
> the command `par(mfrow=c(2,2))` useful: it will divide the print
> window into four regions so that four plots can be made
> simultaneously. Modifying the arguments to this function will divide
> the screen in other ways.
> vi. Continue exploring the data, and provide a brief summary of what you
> discover.
```{r}
summary(college)
college$Private <- college$Private == "Yes"
pairs(college[, 1:10], cex = 0.2)
plot(college$Outstate ~ factor(college$Private), xlab = "Private", ylab = "Outstate")
college$Elite <- factor(ifelse(college$Top10perc > 50, "Yes", "No"))
summary(college$Elite)
plot(college$Outstate ~ college$Elite, xlab = "Elite", ylab = "Outstate")
par(mfrow = c(2, 2))
for (n in c(5, 10, 20, 50)) {
hist(college$Enroll, breaks = n, main = paste("n =", n), xlab = "Enroll")
}
chisq.test(college$Private, college$Elite)
```
Whether a college is Private and Elite is not random!
### Question 9
> This exercise involves the Auto data set studied in the lab. Make sure
> that the missing values have been removed from the data.
```{r}
x <- read.table("data/Auto.data", header = TRUE, na.strings = "?")
x <- na.omit(x)
```
> a. Which of the predictors are quantitative, and which are qualitative?
```{r}
sapply(x, class)
numeric <- which(sapply(x, class) == "numeric")
names(numeric)
```
> b. What is the range of each quantitative predictor? You can answer this using
> the `range()` function.
```{r}
sapply(x[, numeric], function(x) diff(range(x)))
```
> c. What is the mean and standard deviation of each quantitative predictor?
```{r}
library(tidyverse)
library(knitr)
x[, numeric] |>
pivot_longer(everything()) |>
group_by(name) |>
summarise(
Mean = mean(value),
SD = sd(value)
) |>
kable()
```
> d. Now remove the 10th through 85th observations. What is the range, mean, and
> standard deviation of each predictor in the subset of the data that
> remains?
```{r}
x[-(10:85), numeric] |>
pivot_longer(everything()) |>
group_by(name) |>
summarise(
Range = diff(range(value)),
Mean = mean(value),
SD = sd(value)
) |>
kable()
```
> e. Using the full data set, investigate the predictors graphically, using
> scatterplots or other tools of your choice. Create some plots highlighting
> the relationships among the predictors. Comment on your findings.
```{r}
pairs(x[, numeric], cex = 0.2)
cor(x[, numeric]) |>
kable()
heatmap(cor(x[, numeric]), cexRow = 1.1, cexCol = 1.1, margins = c(8, 8))
```
Many of the variables appear to be highly (positively or negatively) correlated
with some relationships being non-linear.
> f. Suppose that we wish to predict gas mileage (`mpg`) on the basis of the
> other variables. Do your plots suggest that any of the other variables
> might be useful in predicting `mpg`? Justify your answer.
Yes, since other variables are correlated. However, horsepower, weight and
displacement are highly related.
### Question 10
> This exercise involves the `Boston` housing data set.
>
> a. To begin, load in the `Boston` data set. The `Boston` data set is part of
> the `ISLR2` library in R.
> ```r
> > library(ISLR2)
> ```
> Now the data set is contained in the object `Boston`.
> ```r
> > Boston
> ```
> Read about the data set:
> ```r
> > ?Boston
> ```
> How many rows are in this data set? How many columns? What do the rows and
> columns represent?
```{r}
library(ISLR2)
dim(Boston)
```
> b. Make some pairwise scatterplots of the predictors (columns) in this data
> set. Describe your findings.
```{r, message = FALSE, warning = FALSE}
library(ggplot2)
library(tidyverse)
```
```{r}
ggplot(Boston, aes(nox, rm)) +
geom_point()
ggplot(Boston, aes(ptratio, rm)) +
geom_point()
heatmap(cor(Boston, method = "spearman"), cexRow = 1.1, cexCol = 1.1)
```
> c. Are any of the predictors associated with per capita crime rate? If so,
> explain the relationship.
Yes
> d. Do any of the census tracts of Boston appear to have particularly high
> crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each
> predictor.
```{r}
Boston |>
pivot_longer(cols = 1:13) |>
filter(name %in% c("crim", "tax", "ptratio")) |>
ggplot(aes(value)) +
geom_histogram(bins = 20) +
facet_wrap(~name, scales = "free", ncol = 1)
```
Yes, particularly crime and tax rates.
> e. How many of the census tracts in this data set bound the Charles river?
```{r}
sum(Boston$chas)
```
> f. What is the median pupil-teacher ratio among the towns in this data set?
```{r}
median(Boston$ptratio)
```
> g. Which census tract of Boston has lowest median value of owner-occupied
> homes? What are the values of the other predictors for that census tract,
> and how do those values compare to the overall ranges for those predictors?
> Comment on your findings.
```{r}
Boston[Boston$medv == min(Boston$medv), ] |>
kable()
sapply(Boston, quantile) |>
kable()
```
> h. In this data set, how many of the census tract average more than seven
> rooms per dwelling? More than eight rooms per dwelling? Comment on the
> census tracts that average more than eight rooms per dwelling.
```{r}
sum(Boston$rm > 7)
sum(Boston$rm > 8)
```
Let's compare median statistics for those census tracts with more than eight
rooms per dwelling on average, with the statistics for those with fewer.
```{r}
Boston |>
mutate(
`log(crim)` = log(crim),
`log(zn)` = log(zn)
) |>
select(-c(crim, zn)) |>
pivot_longer(!rm) |>
mutate(">8 rooms" = rm > 8) |>
ggplot(aes(`>8 rooms`, value)) +
geom_boxplot() +
facet_wrap(~name, scales = "free")
```
Census tracts with big average properties (more than eight rooms per dwelling)
have higher median value (`medv`), a lower proportion of non-retail
business acres (`indus`), a lower pupil-teacher ratio (`ptratio`), a lower
status of the population (`lstat`) among other differences.