-
Notifications
You must be signed in to change notification settings - Fork 28
/
Copy path06-linear-model-selection-and-regularization.Rmd
789 lines (612 loc) · 27.4 KB
/
06-linear-model-selection-and-regularization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
# Linear Model Selection and Regularization
## Conceptual
### Question 1
> We perform best subset, forward stepwise, and backward stepwise selection on
> a single data set. For each approach, we obtain $p + 1$ models, containing
> $0, 1, 2, ..., p$ predictors. Explain your answers:
>
> a. Which of the three models with $k$ predictors has the smallest *training*
> RSS?
Best subset considers the most models (all possible combinations of $p$
predictors are considered), therefore this will give the smallest training RSS
(it will at least consider all possibilities covered by forward and backward
stepwise selection). However, all three approaches are expected to give similar
if not identical results in practice.
> b. Which of the three models with $k$ predictors has the smallest *test* RSS?
We cannot tell which model will perform best on the test RSS. The answer will
depend on the tradeoff between fitting to the data and overfitting.
> c. True or False:
> i. The predictors in the $k$-variable model identified by forward stepwise
> are a subset of the predictors in the ($k+1$)-variable model identified
> by forward stepwise selection.
True. Forward stepwise selection retains all features identified in previous
models as $k$ is increased.
> ii. The predictors in the $k$-variable model identified by backward stepwise
> are a subset of the predictors in the $(k+1)$-variable model identified
> by backward stepwise selection.
True. Backward stepwise selection removes features one by one as $k$ is
decreased.
> iii. The predictors in the $k$-variable model identified by backward
> stepwise are a subset of the predictors in the $(k+1)$-variable model
> identified by forward stepwise selection.
False. Forward and backward stepwise selection can identify different
combinations of variables due to differing algorithms.
> iv. The predictors in the $k$-variable model identified by forward stepwise
> are a subset of the predictors in the $(k+1)$-variable model identified
> by backward stepwise selection.
False. Forward and backward stepwise selection can identify different
combinations of variables due to differing algorithms.
> v. The predictors in the $k$-variable model identified by best subset are a
> subset of the predictors in the $(k+1)$-variable model identified by best
> subset selection.
False. Best subset selection can identify different combinations of variables
for each $k$ by considering all possible models.
### Question 2
> For parts (a) through (c), indicate which of i. through iv. is correct.
> Justify your answer.
>
> a. The lasso, relative to least squares, is:
> i. More flexible and hence will give improved prediction accuracy when its
> increase in bias is less than its decrease in variance.
> ii. More flexible and hence will give improved prediction accuracy when its
> increase in variance is less than its decrease in bias.
> iii. Less flexible and hence will give improved prediction accuracy when its
> increase in bias is less than its decrease in variance.
> iv. Less flexible and hence will give improved prediction accuracy when
> its increase in variance is less than its decrease in bias.
iii. By using shrinkage, lasso can reduce the number of predictors so is less
flexible. As a result, it will lead to an increase in bias by approximating the
true relationship. We hope that this increase is small but that we dramatically
reduce variance (i.e. the difference we would see in the model fit between
different sets of training data).
> b. Repeat (a) for ridge regression relative to least squares.
iii. The same is true of ridge regression---shrinkage results in a less
flexible model and can reduce variance.
> c. Repeat (a) for non-linear methods relative to least squares.
ii. Non-linear methods can be more flexible. They can perform better as long as
they don't substantially increase variance.
### Question 3
> Suppose we estimate the regression coefficients in a linear regression model
> by minimizing:
>
> $$
> \sum_{i=1}^n\left(y_i - \beta_0 - \sum_{j=1}^p\beta_jx_{ij}\right)^2
> \textrm{subject to} \sum_{j=1}^p|\beta_j| \le s
> $$
>
> for a particular value of $s$. For parts (a) through (e), indicate
> which of i. through v. is correct. Justify your answer.
>
> a. As we increase $s$ from 0, the training RSS will:
> i. Increase initially, and then eventually start decreasing in an
> inverted U shape.
> ii. Decrease initially, and then eventually start increasing in a U shape.
> iii. Steadily increase.
> iv. Steadily decrease.
> v. Remain constant.
iv. As $s$ increases, the model becomes more flexible (the sum of absolute
coefficients can be higher). With more flexible models, training RSS will
always decrease.
> b. Repeat (a) for test RSS.
ii. With more flexible models, test RSS will decrease (as the fit improves) and
will then increase due to overfitting (high variance).
> c. Repeat (a) for variance.
iii. As $s$ increases, the model becomes more flexible so variance will
increase.
> d. Repeat (a) for (squared) bias.
iv. As $s$ increases, the model becomes more flexible so bias will decrease.
> e. Repeat (a) for the irreducible error.
v. The irreducible error is unchanged.
### Question 4
> Suppose we estimate the regression coefficients in a linear regression model
> by minimizing
>
> $$
> \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p\beta_jx_{ij}\right)^2 +
> \lambda\sum_{j=1}^p\beta_j^2
> $$
>
> for a particular value of $\lambda$. For parts (a) through (e), indicate which
> of i. through v. is correct. Justify your answer.
>
> a. As we increase $\lambda$ from 0, the training RSS will:
> i. Increase initially, and then eventually start decreasing in an
> inverted U shape.
> ii. Decrease initially, and then eventually start increasing in a U shape.
> iii. Steadily increase.
> iv. Steadily decrease.
> v. Remain constant.
iii. As $\lambda$ is increased, more weight is placed on the sum of squared
coefficients and so the model becomes less flexible. As a result, training RSS
must increase.
> b. Repeat (a) for test RSS.
ii. As $\lambda$ increases, flexibility decreases so test RSS will decrease
(variance decreases) but will then increase (as bias increases).
> c. Repeat (a) for variance.
iv. Steadily decrease.
> d. Repeat (a) for (squared) bias.
iii. Steadily increase.
> e. Repeat (a) for the irreducible error.
v. The irreducible error is unchanged.
### Question 5
> It is well-known that ridge regression tends to give similar coefficient
> values to correlated variables, whereas the lasso may give quite different
> coefficient values to correlated variables. We will now explore this property
> in a very simple setting.
>
> Suppose that $n = 2, p = 2, x_{11} = x_{12}, x_{21} = x_{22}$. Furthermore,
> suppose that $y_1 + y_2 =0$ and $x_{11} + x_{21} = 0$ and
> $x_{12} + x_{22} = 0$, so that the estimate for the intercept in a least
> squares, ridge regression, or lasso model is zero: $\hat{\beta}_0 = 0$.
>
> a. Write out the ridge regression optimization problem in this setting.
We are trying to minimize:
$$
\sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p\beta_jx_{ij}\right)^2 +
\lambda\sum_{j=1}^p\beta_j^2
$$
We can ignore $\beta_0$ and can expand the sums since there's only two terms.
Additionally, we can define $x_1 = x_{11} = x_{12}$ and
$x_2 = x_{21} = x_{22}$. We then need to minimize
\begin{align}
f = & (y_1 - \beta_1x_1 - \beta_2x_1)^2 +
(y_2 - \beta_1x_2 - \beta_2x_2)^2 +
\lambda\beta_1^2 + \lambda\beta_2^2 \\
f = & y_1^2 - 2y_1\beta_1x_1 - 2y_1\beta_2x_1 + \beta_1^2x_1^2 + 2\beta_1\beta_2x_1^2 + \beta_2^2x_1^2 + \\
& y_2^2 - 2y_2\beta_1x_2 - 2y_2\beta_2x_2 + \beta_1^2x_2^2 + 2\beta_1\beta_2x_2^2 + \beta_2^2x_2^2 + \\
& \lambda\beta_1^2 + \lambda\beta_2^2 \\
\end{align}
> b. Argue that in this setting, the ridge coefficient estimates satisfy
> $\hat{\beta}_1 = \hat{\beta}_2$
We can find when the above is minimized with respect to each of $\beta_1$ and
$\beta_2$ by partial differentiation.
$$
\frac{\partial}{\partial{\beta_1}} =
- 2y_1x_1 + 2\beta_1x_1^2 + 2\beta_2x_1^2
- 2y_2x_2 + 2\beta_1x_2^2 + 2\beta_2x_2^2
+ 2\lambda\beta_1
$$
$$
\frac{\partial}{\partial{\beta_2}} =
- 2y_1x_1 + 2\beta_1x_1^2 + 2\beta_2x_1^2
- 2y_2x_2 + 2\beta_1x_2^2 + 2\beta_2x_2^2
+ 2\lambda\beta_2
$$
A minimum can be found when these are set to 0.
$$
\lambda\beta_1 = y_1x_1 + y_2x_2 - \beta_1x_1^2 - \beta_2x_1^2 - \beta_1x_2^2 - \beta_2x_2^2 \\
\lambda\beta_2 = y_1x_1 + y_2x_2 - \beta_1x_1^2 - \beta_2x_1^2 - \beta_1x_2^2 - \beta_2x_2^2
$$
Therefore $\lambda\beta_1 = \lambda\beta_2$ and $\beta_1 = \beta_2$, thus
there is only one solution, that is when the coefficients are the same.
> c. Write out the lasso optimization problem in this setting.
We are trying to minimize:
$$
\sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p\beta_jx_{ij}\right)^2 +
\lambda\sum_{j=1}^p |\beta_j|
$$
As above (and defining $x_1 = x_{11} = x_{12}$ and $x_2 = x_{21} = x_{22}$) we simplify to
$$
(y_1 - \beta_1x_1 - \beta_2x_1)^2 +
(y_2 - \beta_1x_2 - \beta_2x_2)^2 +
\lambda|\beta_1| + \lambda|\beta_2|
$$
> d. Argue that in this setting, the lasso coefficients $\hat{\beta}_1$ and
> $\hat{\beta}_2$ are not unique---in other words, there are many possible
> solutions to the optimization problem in (c). Describe these solutions.
We will consider the alternate form of the lasso optimization problem
$$
(y_1 - \hat{\beta_1}x_1 - \hat{\beta_2}x_1)^2 + (y_2 - \hat{\beta_1}x_2 - \hat{\beta_2}x_2)^2 \quad \text{subject to} \quad |\hat{\beta_1}| + |\hat{\beta_2}| \le s
$$
Since $x_1 + x_2 = 0$ and $y_1 + y_2 = 0$, this is equivalent to minimising
$2(y_1 - (\hat{\beta_1} + \hat{\beta_2})x_1)^2$
which has a solution when $\hat{\beta_1} + \hat{\beta_2} = y_1/x_1$.
Geometrically, this is a $45^\circ$ backwards sloping line in the
($\hat{\beta_1}$, $\hat{\beta_2}$) plane.
The constraints $|\hat{\beta_1}| + |\hat{\beta_2}| \le s$ specify a diamond
shape in the same place, also with lines that are at $45^\circ$ centered at the
origin and which intersect the axes at a distance $s$ from the origin.
Thus, points along two edges of the diamond
($\hat{\beta_1} + \hat{\beta_2} = s$ and $\hat{\beta_1} + \hat{\beta_2} = -s$)
become solutions to the lasso optimization problem.
### Question 6
> We will now explore (6.12) and (6.13) further.
>
> a. Consider (6.12) with $p = 1$. For some choice of $y_1$ and $\lambda > 0$,
> plot (6.12) as a function of $\beta_1$. Your plot should confirm that
> (6.12) is solved by (6.14).
Equation 6.12 is:
$$
\sum_{j=1}^p(y_j - \beta_j)^2 + \lambda\sum_{j=1}^p\beta_j^2
$$
Equation 6.14 is:
$$
\hat{\beta}_j^R = y_j/(1 + \lambda)
$$
where $\hat{\beta}_j^R$ is the ridge regression estimate.
```{r}
lambda <- 0.7
y <- 1.4
fn <- function(beta) {
(y - beta)^2 + lambda * beta^2
}
plot(seq(0, 2, 0.01), fn(seq(0, 2, 0.01)), type = "l", xlab = "beta", ylab = "6.12")
abline(v = y / (1 + lambda), lty = 2)
```
> b. Consider (6.13) with $p = 1$. For some choice of $y_1$ and $\lambda > 0$,
> plot (6.13) as a function of $\beta_1$. Your plot should confirm that
> (6.13) is solved by (6.15).
Equation 6.13 is:
$$
\sum_{j=1}^p(y_j - \beta_j)^2 + \lambda\sum_{j=1}^p|\beta_j|
$$
Equation 6.15 is:
$$
\hat{\beta}_j^L = \begin{cases}
y_j - \lambda/2 &\mbox{if } y_j > \lambda/2; \\
y_j + \lambda/2 &\mbox{if } y_j < -\lambda/2; \\
0 &\mbox{if } |y_j| \le \lambda/2;
\end{cases}
$$
For $\lambda = 0.7$ and $y = 1.4$, the top case applies.
```{r}
lambda <- 0.7
y <- 1.4
fn <- function(beta) {
(y - beta)^2 + lambda * abs(beta)
}
plot(seq(0, 2, 0.01), fn(seq(0, 2, 0.01)), type = "l", xlab = "beta", ylab = "6.12")
abline(v = y - lambda / 2, lty = 2)
```
### Question 7
> We will now derive the Bayesian connection to the lasso and ridge regression
> discussed in Section 6.2.2.
>
> a. Suppose that $y_i = \beta_0 + \sum_{j=1}^p x_{ij}\beta_j + \epsilon_i$
> where $\epsilon_1, ..., \epsilon_n$ are independent and identically
> distributed from a $N(0, \sigma^2)$ distribution. Write out the likelihood
> for the data.
\begin{align*}
\mathcal{L}
&= \prod_i^n \mathcal{N}(0, \sigma^2) \\
&= \prod_i^n \frac{1}{\sqrt{2\pi\sigma}}\exp\left(-\frac{\epsilon_i^2}{2\sigma^2}\right) \\
&= \left(\frac{1}{\sqrt{2\pi\sigma}}\right)^n \exp\left(-\frac{1}{2\sigma^2} \sum_i^n \epsilon_i^2\right)
\end{align*}
> b. Assume the following prior for $\beta$: $\beta_1, ..., \beta_p$ are
> independent and identically distributed according to a double-exponential
> distribution with mean 0 and common scale parameter b: i.e.
> $p(\beta) = \frac{1}{2b}\exp(-|\beta|/b)$. Write out the posterior for
> $\beta$ in this setting.
The posterior can be calculated by multiplying the prior and likelihood
(up to a proportionality constant).
\begin{align*}
p(\beta|X,Y)
&\propto \left(\frac{1}{\sqrt{2\pi\sigma}}\right)^n \exp\left(-\frac{1}{2\sigma^2} \sum_i^n \epsilon_i^2\right) \prod_j^p\frac{1}{2b}\exp\left(-\frac{|\beta_j|}{b}\right) \\
&\propto \frac{1}{2b} \left(\frac{1}{\sqrt{2\pi\sigma}}\right)^n \exp\left(-\frac{1}{2\sigma^2} \sum_i^n \epsilon_i^2 -\sum_j^p\frac{|\beta_j|}{b}\right)
\end{align*}
> c. Argue that the lasso estimate is the _mode_ for $\beta$ under this
> posterior distribution.
Let us find the maximum of the posterior distribution (the mode). Maximizing
the posterior probability is equivalent to maximizing its log which is:
$$
\log(p(\beta|X,Y)) \propto \log\left[ \frac{1}{2b} \left(\frac{1}{\sqrt{2\pi\sigma}}\right)^n \right ] - \left(\frac{1}{2\sigma^2} \sum_i^n \epsilon_i^2 + \sum_j^p\frac{|\beta_j|}{b}\right)
$$
Since, the first term is independent of $\beta$, our solution will be when
we minimize the second term.
\begin{align*}
\DeclareMathOperator*{\argmin}{arg\,min} % Jan Hlavacek
\argmin_\beta \left(\frac{1}{2\sigma^2} \sum_i^n \epsilon_i^2 + \sum_j^p\frac{|\beta|}{b}\right)
&= \argmin_\beta \left(\frac{1}{2\sigma^2} \right ) \left( \sum_i^n \epsilon_i^2 +\frac{2\sigma^2}{b}\sum_j^p|\beta_j|\right) \\
&= \argmin_\beta \left( \sum_i^n \epsilon_i^2 +\frac{2\sigma^2}{b}\sum_j^p|\beta_j|\right)
\end{align*}
Note, that $RSS = \sum_i^n \epsilon_i^2$ and if we set $\lambda =
\frac{2\sigma^2}{b}$, the mode corresponds to lasso optimization.
$$
\argmin_\beta RSS + \lambda\sum_j^p|\beta_j|
$$
> d. Now assume the following prior for $\beta$: $\beta_1, ..., \beta_p$ are
> independent and identically distributed according to a normal distribution
> with mean zero and variance $c$. Write out the posterior for $\beta$ in
> this setting.
The posterior is now:
\begin{align*}
p(\beta|X,Y)
&\propto \left(\frac{1}{\sqrt{2\pi\sigma}}\right)^n \exp\left(-\frac{1}{2\sigma^2} \sum_i^n \epsilon_i^2\right) \prod_j^p\frac{1}{\sqrt{2\pi c}}\exp\left(-\frac{\beta_j^2}{2c}\right) \\
&\propto
\left(\frac{1}{\sqrt{2\pi\sigma}}\right)^n
\left(\frac{1}{\sqrt{2\pi c}}\right)^p
\exp\left(-\frac{1}{2\sigma^2} \sum_i^n \epsilon_i^2 - \frac{1}{2c}\sum_j^p\beta_j^2\right)
\end{align*}
> e. Argue that the ridge regression estimate is both the _mode_ and the _mean_
> for $\beta$ under this posterior distribution.
To show that the ridge estimate is the mode we can again find the maximum by
maximizing the log of the posterior. The log is
$$
\log{p(\beta|X,Y)}
\propto
\log{\left[\left(\frac{1}{\sqrt{2\pi\sigma}}\right)^n \left(\frac{1}{\sqrt{2\pi c}}\right)^p \right ]}
- \left(\frac{1}{2\sigma^2} \sum_i^n \epsilon_i^2 + \frac{1}{2c}\sum_j^p\beta_j^2 \right)
$$
We can maximize (wrt $\beta$) by ignoring the first term and minimizing the
second term. i.e. we minimize:
$$
\argmin_\beta \left( \frac{1}{2\sigma^2} \sum_i^n \epsilon_i^2 + \frac{1}{2c}\sum_j^p\beta_j^2 \right)\\
= \argmin_\beta \left( \frac{1}{2\sigma^2} \left( \sum_i^n \epsilon_i^2 + \frac{\sigma^2}{c}\sum_j^p\beta_j^2 \right) \right)
$$
As above, if $RSS = \sum_i^n \epsilon_i^2$ and if we set $\lambda =
\frac{\sigma^2}{c}$, we can see that the mode corresponds to ridge optimization.
## Applied
### Question 8
> In this exercise, we will generate simulated data, and will then use this
> data to perform best subset selection.
>
> a. Use the `rnorm()` function to generate a predictor $X$ of length $n = 100$,
> as well as a noise vector $\epsilon$ of length $n = 100$.
```{r, message = FALSE, warning = FALSE}
library(ISLR2)
library(glmnet)
library(leaps)
library(pls)
```
```{r}
set.seed(42)
x <- rnorm(100)
ep <- rnorm(100)
```
> b. Generate a response vector $Y$ of length $n = 100$ according to the model
> $$Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \epsilon,$$
> where $\beta_0, \beta_1, \beta_2,$ and $\beta_3$ are constants of your
> choice.
```{r}
y <- 2 + 3 * x - 2 * x^2 + 0.5 * x^3 + ep
```
> c. Use the `regsubsets()` function to perform best subset selection in order
> to choose the best model containing the predictors $X, X^2, ..., X^{10}$.
> What is the best model obtained according to $C_p$, BIC, and adjusted
> $R^2$? Show some plots to provide evidence for your answer, and report the
> coefficients of the best model obtained. Note you will need to use the
> `data.frame()` function to create a single data set containing both $X$ and
> $Y$.
```{r}
dat <- data.frame(x, y)
summary(regsubsets(y ~ poly(x, 10, raw = TRUE), data = dat))
```
> d. Repeat (c), using forward stepwise selection and also using backwards
> stepwise selection. How does your answer compare to the results in (c)?
```{r}
summary(regsubsets(y ~ poly(x, 10, raw = TRUE), data = dat, method = "forward"))
summary(regsubsets(y ~ poly(x, 10, raw = TRUE), data = dat, method = "backward"))
```
> e. Now fit a lasso model to the simulated data, again using
> $X, X^2, ..., X^{10}$ as predictors. Use cross-validation to select the
> optimal value of $\lambda$. Create plots of the cross-validation error as a
> function of $\lambda$. Report the resulting coefficient estimates, and
> discuss the results obtained.
```{r}
res <- cv.glmnet(poly(dat$x, 10, raw = TRUE), dat$y, alpha = 1)
(best <- res$lambda.min)
plot(res)
out <- glmnet(poly(dat$x, 10, raw = TRUE), dat$y, alpha = 1, lambda = res$lambda.min)
predict(out, type = "coefficients", s = best)
```
When fitting lasso, the model that minimizes MSE uses three predictors (as per
the simulation). The coefficients estimated (2.9, -1.9 and 0.5) are similar to
those used in the simulation.
> f. Now generate a response vector $Y$ according to the model
> $$Y = \beta_0 + \beta_7X^7 + \epsilon,$$ and perform best subset selection
> and the lasso. Discuss the results obtained.
```{r}
dat$y <- 2 - 2 * x^2 + 0.2 * x^7 + ep
summary(regsubsets(y ~ poly(x, 10, raw = TRUE), data = dat))
res <- cv.glmnet(poly(dat$x, 10, raw = TRUE), dat$y, alpha = 1)
(best <- res$lambda.min)
plot(res)
out <- glmnet(poly(dat$x, 10, raw = TRUE), dat$y, alpha = 1, lambda = best)
predict(out, type = "coefficients", s = best)
```
When fitting lasso, the model does not perfectly replicate the simulation
(coefficients are retained for powers of $x$ that were not simulated).
### Question 9
> In this exercise, we will predict the number of applications received using
> the other variables in the `College` data set.
>
> a. Split the data set into a training set and a test set.
```{r}
set.seed(42)
train <- sample(nrow(College), nrow(College) * 2 / 3)
test <- setdiff(seq_len(nrow(College)), train)
mse <- list()
```
> b. Fit a linear model using least squares on the training set, and report the
> test error obtained.
```{r}
fit <- lm(Apps ~ ., data = College[train, ])
(mse$lm <- mean((predict(fit, College[test, ]) - College$Apps[test])^2))
```
> c. Fit a ridge regression model on the training set, with $\lambda$ chosen by
> cross-validation. Report the test error obtained.
```{r}
mm <- model.matrix(Apps ~ ., data = College[train, ])
fit2 <- cv.glmnet(mm, College$Apps[train], alpha = 0)
p <- predict(fit2, model.matrix(Apps ~ ., data = College[test, ]), s = fit2$lambda.min)
(mse$ridge <- mean((p - College$Apps[test])^2))
```
> d. Fit a lasso model on the training set, with $\lambda$ chosen by cross-
> validation. Report the test error obtained, along with the number of
> non-zero coefficient estimates.
```{r}
mm <- model.matrix(Apps ~ ., data = College[train, ])
fit3 <- cv.glmnet(mm, College$Apps[train], alpha = 1)
p <- predict(fit3, model.matrix(Apps ~ ., data = College[test, ]), s = fit3$lambda.min)
(mse$lasso <- mean((p - College$Apps[test])^2))
```
> e. Fit a PCR model on the training set, with $M$ chosen by cross-validation.
> Report the test error obtained, along with the value of $M$ selected by
> cross-validation.
```{r}
fit4 <- pcr(Apps ~ ., data = College[train, ], scale = TRUE, validation = "CV")
validationplot(fit4, val.type = "MSEP")
p <- predict(fit4, College[test, ], ncomp = 17)
(mse$pcr <- mean((p - College$Apps[test])^2))
```
> f. Fit a PLS model on the training set, with $M$ chosen by cross-validation.
> Report the test error obtained, along with the value of $M$ selected by
> cross-validation.
```{r}
fit5 <- plsr(Apps ~ ., data = College[train, ], scale = TRUE, validation = "CV")
validationplot(fit5, val.type = "MSEP")
p <- predict(fit5, College[test, ], ncomp = 12)
(mse$pls <- mean((p - College$Apps[test])^2))
```
> g. Comment on the results obtained. How accurately can we predict the number
> of college applications received? Is there much difference among the test
> errors resulting from these five approaches?
```{r}
barplot(unlist(mse), ylab = "Test MSE", horiz = TRUE)
```
Ridge and lasso give the lowest test errors but the lowest is generated by
the ridge regression model (in this specific case with this specific seed).
### Question 10
> We have seen that as the number of features used in a model increases, the
> training error will necessarily decrease, but the test error may not. We will
> now explore this in a simulated data set.
>
> a. Generate a data set with $p = 20$ features, $n = 1,000$ observations, and
> an associated quantitative response vector generated according to the model
> $Y =X\beta + \epsilon$, where $\beta$ has some elements that are exactly
> equal to zero.
```{r}
set.seed(42)
dat <- matrix(rnorm(1000 * 20), nrow = 1000)
colnames(dat) <- paste0("b", 1:20)
beta <- rep(0, 20)
beta[1:4] <- c(5, 4, 2, 7)
y <- colSums((t(dat) * beta)) + rnorm(1000)
dat <- data.frame(dat)
dat$y <- y
```
> b. Split your data set into a training set containing 100 observations and a
> test set containing 900 observations.
```{r}
train <- dat[1:100, ]
test <- dat[101:1000, ]
```
> c. Perform best subset selection on the training set, and plot the training
> set MSE associated with the best model of each size.
```{r}
fit <- regsubsets(y ~ ., data = train, nvmax = 20)
summary(fit)
plot(summary(fit)$rss / 100, ylab = "MSE", type = "o")
```
> d. Plot the test set MSE associated with the best model of each size.
```{r}
predict.regsubsets <- function(object, newdata, id, ...) {
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}
mse <- sapply(1:20, function(i) mean((test$y - predict(fit, test, i))^2))
plot(mse, ylab = "MSE", type = "o", pch = 19)
```
> e. For which model size does the test set MSE take on its minimum value?
> Comment on your results. If it takes on its minimum value for a model
> containing only an intercept or a model containing all of the features,
> then play around with the way that you are generating the data in (a) until
> you come up with a scenario in which the test set MSE is minimized for an
> intermediate model size.
```{r}
which.min(mse)
```
The min test MSE is found when model size is 4. This corresponds to the
simulated data which has four non-zero coefficients.
```{r}
set.seed(42)
dat <- matrix(rnorm(1000 * 20), nrow = 1000)
colnames(dat) <- paste0("b", 1:20)
beta <- rep(0, 20)
beta[1:9] <- c(5, 4, 2, 7, 0.01, 0.001, 0.05, 0.1, 0.5)
y <- colSums((t(dat) * beta)) + rnorm(1000)
dat <- data.frame(dat)
dat$y <- y
train <- dat[1:100, ]
test <- dat[101:1000, ]
fit <- regsubsets(y ~ ., data = train, nvmax = 20)
summary(fit)
mse <- sapply(1:20, function(i) mean((test$y - predict(fit, test, i))^2))
plot(mse, ylab = "MSE", type = "o", pch = 19)
which.min(mse)
```
> f. How does the model at which the test set MSE is minimized compare to the
> true model used to generate the data? Comment on the coefficient values.
The min test MSE is found when model size is 5 but there are 9 non-zero
coefficients.
```{r}
coef(fit, id = 5)
```
The coefficient values are well estimated when high, but the smaller
coefficients are dropped.
> g. Create a plot displaying $\sqrt{\sum_{j=1}^p (\beta_j - \hat{\beta}{}_j^r)^2}$
> for a range of values of $r$, where $\hat{\beta}{}_j^r$ is the $j$th
> coefficient estimate for the best model containing $r$ coefficients. Comment
> on what you observe. How does this compare to the test MSE plot from (d)?
```{r}
names(beta) <- paste0("b", 1:20)
b <- data.frame(id = names(beta), b = beta)
out <- sapply(1:20, function(i) {
c <- coef(fit, id = i)[-1]
c <- data.frame(id = names(c), c = c)
m <- merge(b, c)
sqrt(sum((m$b - m$c)^2))
})
plot(out, ylab = "Mean squared coefficient error", type = "o", pch = 19)
```
The error of the coefficient estimates is minimized when model size is 5. This
corresponds to the point when test MSE was minimized.
### Question 11
> We will now try to predict per capita crime rate in the `Boston` data set.
>
> a. Try out some of the regression methods explored in this chapter, such as
> best subset selection, the lasso, ridge regression, and PCR. Present and
> discuss results for the approaches that you consider.
```{r}
set.seed(1)
train <- sample(nrow(Boston), nrow(Boston) * 2 / 3)
test <- setdiff(seq_len(nrow(Boston)), train)
hist(log(Boston$crim))
```
> b. Propose a model (or set of models) that seem to perform well on this data
> set, and justify your answer. Make sure that you are evaluating model
> performance using validation set error, cross-validation, or some other
> reasonable alternative, as opposed to using training error.
We will try to fit models to `log(Boston$crim)` which is closer to a normal
distribution.
```{r}
fit <- lm(log(crim) ~ ., data = Boston[train, ])
mean((predict(fit, Boston[test, ]) - log(Boston$crim[test]))^2)
mm <- model.matrix(log(crim) ~ ., data = Boston[train, ])
fit2 <- cv.glmnet(mm, log(Boston$crim[train]), alpha = 0)
p <- predict(fit2, model.matrix(log(crim) ~ ., data = Boston[test, ]), s = fit2$lambda.min)
mean((p - log(Boston$crim[test]))^2)
mm <- model.matrix(log(crim) ~ ., data = Boston[train, ])
fit3 <- cv.glmnet(mm, log(Boston$crim[train]), alpha = 1)
p <- predict(fit3, model.matrix(log(crim) ~ ., data = Boston[test, ]), s = fit3$lambda.min)
mean((p - log(Boston$crim[test]))^2)
fit4 <- pcr(log(crim) ~ ., data = Boston[train, ], scale = TRUE, validation = "CV")
validationplot(fit4, val.type = "MSEP")
p <- predict(fit4, Boston[test, ], ncomp = 8)
mean((p - log(Boston$crim[test]))^2)
fit5 <- plsr(log(crim) ~ ., data = Boston[train, ], scale = TRUE, validation = "CV")
validationplot(fit5, val.type = "MSEP")
p <- predict(fit5, Boston[test, ], ncomp = 6)
mean((p - log(Boston$crim[test]))^2)
```
In this case lasso (`alpha = 1`) seems to perform very slightly better than
un-penalized regression. Some coefficients have been dropped:
```{r}
coef(fit3, s = fit3$lambda.min)
```
> c. Does your chosen model involve all of the features in the data set? Why or
> why not?
Not all features are included due to the lasso penalization.