-
Notifications
You must be signed in to change notification settings - Fork 44
/
Copy path26-linear-algebra.qmd
465 lines (335 loc) · 13.1 KB
/
26-linear-algebra.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
# Applied Linear Algebra {#sec-matrix-algebra}
```{r echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(dslabs)
if(!exists("mnist")) mnist <- read_mnist()
x <- mnist$train$images[1:300,]
y <- mnist$train$labels[1:300]
```
## Transpose of a matrix
If
$$
\mathbf{X} =
\begin{pmatrix}
x_{1,1}&x_{1,2}&\dots & x_{1,p} \\
x_{2,1}&x_{2,2}&\dots & x_{2,p} \\
\vdots & \vdots & \ddots & \vdots & \\
x_{n,1}&x_{n,2}&\dots & x_{n,p}
\end{pmatrix}
$$
The transponse
$$
\mathbf{X}^\top =
\begin{pmatrix}
x_{1,1}&x_{2,1}&\dots & x_{n,1} \\
x_{1,2}&x_{2,2}&\dots & x_{n,2} \\
\vdots & \vdots & \ddots & \vdots & \\
x_{1,p}&x_{2,p}&\dots & x_{n,p}
\end{pmatrix}
$$
If we are writing out a column, such as $\mathbf{x}_1$ defined above, in a sentence we often use the notation: $\mathbf{x}_1 = ( x_{1,1}, \dots x_{n,1})^\top$ to avoid wasting vertical space in the text.
In R we compute the transpose using the function `t`
```{r}
dim(x)
dim(t(x))
```
## Matrix multiplication
Linear algebra was born from mathematicians developing systematic ways to solve systems of linear equations, for example
$$
\begin{align}
x + 3 y - 2 z &= 5\\
3x + 5y + 6z &= 7\\
2x + 4y + 3z &= 8.
\end{align}
$$
To explain matrix multiplication, define two matrices $\mathbf{A}$ and $\mathbf{B}$
$$
\mathbf{A} =
\begin{pmatrix}
a_{11}&a_{12}&\dots&a_{1n}\\
a_{21}&a_{22}&\dots&a_{2n}\\
\vdots&\vdots&\ddots&\vdots\\
a_{m1}&a_{2}&\dots&a_{mn}
\end{pmatrix}, \,
\mathbf{B} = \begin{pmatrix}
b_{11}&b_{12}&\dots&b_{1p}\\
b_{21}&b_{22}&\dots&b_{2p}\\
\vdots&\vdots&\ddots&\vdots\\
b_{n1}&b_{n2}&\dots&b_{np}
\end{pmatrix}
$$
and define the product of matrices $\mathbf{A}$ and $\mathbf{B}$ as the matrix $\mathbf{C} = \mathbf{A}\mathbf{B}$ that has entries $c_{ij}$ equal to the sum of the component-wise product of the $i$th row of $\mathbf{A}$ with the $j$th column of $\mathbf{B}$. Using R code we can define $\mathbf{C}= \mathbf{A}\mathbf{B}$ as follows:
```{r, eval=FALSE}
m <- nrow(A)
p <- ncol(B)
C <- matrix(0, m, p)
for(i in 1:m){
for(j in 1:p){
C[i,j] <- sum(A[i,] * B[,j])
}
}
```
Because this operation is so common, R includes a mathematical operator `%*%` for matrix multiplication:
```{r, eval=FALSE}
C <- A %*% B
```
Using mathematical notation $\mathbf{C} = \mathbf{A}\mathbf{B}$ looks like this:
$$
\begin{pmatrix}
a_{11}b_{11} + \dots + a_{1n}b_{n1}&
a_{11}b_{12} + \dots + a_{1n}b_{n2}&
\dots&
a_{11}b_{1p} + \dots + a_{1n}b_{np}\\
a_{21}b_{11} + \dots + a_{2n}b_{n1}&
a_{21}b_{n2} + \dots + a_{2n}b_{n2}&
\dots&
a_{21}b_{1p} + \dots + a_{2n}b_{np}\\
\vdots&\vdots&\ddots&\vdots\\
a_{m1}b_{11} + \dots +a_{mn}b_{n1}&
a_{m1}b_{n2} + \dots + a_{mn}b_{n2}&
\dots&
a_{m1}b_{1p} + \dots + a_{mn}b_{np}\\
\end{pmatrix}
$$
Note this definition implies that the multiplication $\mathbf{A}\mathbf{B}$ is only possible when the number of rows of $\mathbf{A}$ matches the number of columns of $\mathbf{B}$.
So how does this definition of matrix multiplication help solve systems of equations? First, any system of equations with unknowns $x_1, \dots x_n$
$$
\begin{align}
a_{11} x_1 + a_{12} x_2 \dots + a_{1n}x_n &= b_1\\
a_{21} x_1 + a_{22} x_2 \dots + a_{2n}x_n &= b_2\\
\vdots\\
a_{n1} x_1 + a_{n2} x_2 \dots + a_{nn}x_n &= b_n\\
\end{align}
$$
can now be represented as matrix multiplication by defining the following matrices:
$$
\mathbf{A} =\begin{pmatrix}
a_{11}&a_{12}&\dots&a_{1n}\\
a_{21}&a_{22}&\dots&a_{2n}\\
\vdots&\vdots&\ddots&\vdots\\
a_{m1}&a_{22}&\dots&a_{nn}
\end{pmatrix}
,\,
\mathbf{b} =
\begin{pmatrix}
b_1\\
b_2\\
\vdots\\
b_n
\end{pmatrix}
,\, \mbox{ and }
\mathbf{x} =
\begin{pmatrix}
x_1\\
x_2\\
\vdots\\
x_n
\end{pmatrix}
$$
and rewriting the equation simply as
$$
\mathbf{A}\mathbf{x} = \mathbf{b}
$$
The linear algebra algorithms listed above, such as Gaussian elimination, provide a way to compute the *inverse* matrix $A^{-1}$ that solves the equation for $\mathbf{x}$:
$$
\mathbf{A}^{-1}\mathbf{A}\mathbf{x} = \mathbf{x} = \mathbf{A}^{-1} \mathbf{b}
$$
To solve the first equation we wrote out in R, we can use the function `solve`:
```{r, eval=FALSE}
A <- matrix(c(1, 3, -2, 3, 5, 6, 2, 4, 3), 3, 3, byrow = TRUE)
b <- matrix(c(5, 7, 8))
solve(A, b)
```
## The identity matrix
The identity matrix, represented with a bold I, is like the number 1, but for matrices: if you multiply a matrix by the identity matrix, you get back the matrix.
$$
\mathbf{I}\mathbf{x} = \mathbf{x}
$$
If you do some math with the definition of matrix multiplication you will realize that $\mathbf{I}$ is a matrix with the same number of rows and columns (refereed to as square matrix) with 0s everywhere except the diagonal:
$$
\mathbf{1}=\begin{pmatrix}
1&0&\dots&0\\
0&1&\dots&0\\
\vdots&\vdots&\ddots&\vdots\\
0&0&\dots&1
\end{pmatrix}
$$
It also implies that due to the definition of an inverse matrix we have
$$
\mathbf{A}^{-1}\mathbf{A} = \mathbf{1}
$$
Because the default for the second argument in `solve` is an identity matrix, if we simply type `solve(A)`, we obtain the inverse $\mathbf{A}^{-1}$. This means we can also obtain a solution to our system of equations with:
```{r, eval=FALSE}
solve(A) %*% b
```
## Distance
To define distance, we introduce another linear algebra concept: the *norm*. Recall that a point in two dimensions can represented in polar coordinates as:
```{r, echo=FALSE, fig.asp=0.7}
draw.circle <- function(angle, start = 0, center = c(0,0), r = 0.25){
th <- seq(start, start + angle, length.out = 100)
x <- center[1] + r*cos(th)
y <- center[2] + r*sin(th)
lines(x, y)
}
rafalib::mypar()
rafalib::nullplot(-0.25, 1.75,-0.25, 1.05, axes = FALSE)
abline(h=0,v=0, col= "grey")
theta <- pi/6
arrows(0, 0, 0.975*cos(theta), 0.975*sin(theta), length = 0.1)
points(cos(theta), sin(theta), pch = 16)
text(0.3*cos(theta/2), 0.3*sin(theta/2), expression(theta), font = 2)
text(cos(theta), sin(theta), expression('(' * x[1] * ',' * x[2] * ') = (' * phantom(.) * 'r' * phantom(.) * 'cos('*theta*'), r' * phantom(.) * 'sin('*theta*')' * phantom(.) * ')' ),
pos=4)
draw.circle(theta)
```
with $\theta = \arctan{\frac{x2}{x1}}$ and $r = \sqrt{x_1^2 + x_2^2}$.
* If we think of the point as two dimensional vector $\mathbf{x} = (x_1, x_2)^\top$, $r$ defines the norm of $\mathbf{x}$.
* The norm can be thought of as the _size_ of the two-dimensional vector disregarding the direction: if we change the angle, the vector changes but the size does not.
* The point of defining the norm is that we can extrapolated the concept of _size_ to higher dimensions.
* Specifically, we write the norm for any vector $\mathbf{x}$ as:
$$
||\mathbf{x}|| = \sqrt{x_1^2 + x_2^2 + \dots + x_n^2}
$$
* We can use the linear algebra concepts we have learned to define the norm like this:
$$
||\mathbf{x}||^2 = \mathbf{x}^\top\mathbf{x}
$$
* To define distance, suppose we have two two-dimensional points $\mathbf{x}_1$ and $\mathbf{x}_2$. We can define how similar they are by simply using euclidean distance.
```{r, echo=FALSE, fig.asp=0.7}
rafalib::mypar()
rafalib::nullplot(-0.5, 4,-0.5, 2, axes = FALSE)
abline(h=0,v=0, col= "grey")
x1 <- 1; x2 <- 1; y1 <- 3; y2 <- 1.5
points(x1, x2, pch = 16)
text(x1-0.35, x2,expression('(' * x[11] * ',' * x[21] * ')'))
points(y1, y2, pch = 16)
text(y1, y2+0.15,expression('(' * x[12] * ',' * x[22] * ')'))
lines(c(x1,y1),c(x2,y2))
pBrackets::brackets(x1, x2-0.05, y1, x2-0.05, h=-0.1)
text((x1+y1)/2, x1-0.2, expression(x[11]-x[12]))
pBrackets::brackets(y1+0.05, x2, y1+0.05, y2, h=-0.1)
text(y1+0.25,(x2+y2)/2, srt = 270, expression(x[21]-x[22]))
```
* We know that the distance is equal to the length of the hypotenuse:
$$
\sqrt{(x_{11} - x_{12})^2 + (x_{21} - x_{22})^2}
$$
* The reason we introduced the norm is because this distance is the size of the vector between the two points and this can be extrapolated to any dimension.
* The distance between two points, regardless of the dimensions, is defined as the norm of the difference:
$$
|| \mathbf{x}_1 - \mathbf{x}_2||.
$$
* If we use the digit data, the distance between the first and second observation will compute distance using all 784 features:
$$
|| \mathbf{x}_1 - \mathbf{x}_2 || = \sqrt{ \sum_{j=1}^{784} (x_{1,j}-x_{2,j })^2 }
$$
* To demonstrate, let's pick the features for three digits:
```{r}
x_1 <- x[6,]
x_2 <- x[17,]
x_3 <- x[16,]
```
* We can compute the distances between each pair using the definitions we just learned:
```{r}
c(sum((x_1 - x_2)^2), sum((x_1 - x_3)^2), sum((x_2 - x_3)^2)) |> sqrt()
```
* In R, the function `crossprod(x)` is convenient for computing norms it multiplies `t(x)` by `x`
```{r}
c(crossprod(x_1 - x_2), crossprod(x_1 - x_3), crossprod(x_2 - x_3)) |> sqrt()
```
* Note `crossprod` takes a matrix as the first argument and therefore the vectors used here are being coerced into single column matrices.
* Also note that `crossprod(x,y)` multiples `t(x)` by `y`.
* We can see that the distance is smaller between the first two.
* This is in agreement with the fact that the first two are 2s and the third is a 7.
```{r}
y[c(6, 17, 16)]
```
* We can also compute **all** the distances at once relatively quickly using the function `dist`, which computes the distance between each row and produces an object of class `dist`:
```{r}
d <- dist(x)
class(d)
```
* To access the entries using row and column indices, we need to coerce it into a matrix.
* We can see the distance we calculated above like this:
```{r}
as.matrix(d)[c(6, 17, 16), c(6, 17, 16)]
```
* We can quickly see an image of these distances using this code:
```{r distance-image, fig.width = 4, fig.height = 4, eval=FALSE}
image(as.matrix(d))
```
* If we order this distance by the labels, we can see yellowish squares near the diagonal.
* This is because observations from the same digits tend to be closer than to different digits:
```{r}
image(as.matrix(d)[order(y), order(y)])
```
## Spaces {#sec-predictor-space}
* *Predictor space* is a concept that is often used to describe machine learning algorithms.
* The term *space* refers to an advanced mathematical definition for which we provide a simplified explanation to help understand the term predictor space when used in the context of machine learning algorithms.
* We can think of all predictors $(x_{i,1}, \dots, x_{i,p})^\top$ for all observations $i=1,\dots,n$ as $n$ $p$-dimensional points.
* A *space* can be thought of as the collection of all possible points that should be considered for the data analysis in question.
* This includes points we could see, but have not been observed yet.
* In the case of the handwritten digits, we can think of the predictor space as any point $(x_{1}, \dots, x_{p})^\top$ as long as each entry $x_i, i = 1, \dots, p$ is between 0 and 255.
* Some Machine Learning algorithms also define subspaces.
* A common approach is to define neighborhoods of points that are close to a *center*.
* We can do this by selecting a center $\mathbf{x}_0$, a minimum distance $r$, and defining the subspace as the collection of points $\mathbf{x}$ that satisfy
$$
|| \mathbf{x} - \mathbf{x}_0 || \leq r.
$$
* We can think of this a multidimensional sphere since, as in a circle or sphere, every point is the same distance away from the center.
* Other machine learning algorithms partition the predictor space into non-overlapping regions and then make different predictions for each region using the data in the region.
## Exercises
(@) Generate two matrix, `A` and `B`, containing randomly generated and normally distributed numbers. The dimensions of these two matrices should $4 \times 3$ and $3 \times 6$, respectively. Confirm that `C <- A %*% B` produce the same results as:
```{r, eval=FALSE}
m <- nrow(A)
p <- ncol(B)
C <- matrix(0, m, p)
for(i in 1:m){
for(j in 1:p){
C[i,j] <- sum(A[i,] * B[,j])
}
}
```
(@) Solve the following system of equations using R:
$$
\begin{align}
x + y + z + w &= 10\\
2x + 3y - z - w &= 5\\
3x - y + 4z - 2w &= 15\\
2x + 2y - 2z - 2w &= 20\\
\end{align}
$$
(@) Define `x`
```{r}
#| eval: false
mnist <- read_mnist()
x <- mnist$train$images[1:300,]
y <- mnist$train$labels[1:300]
```
and compute the distance matrix
```{r}
#| eval: false
d <- dist(x)
class(d)
```
Generate a boxplot showing the distances for the second row of `d` stratified by digits. Do not include the distance to itself which we know it is 0. Can you predict what digit is represented by the second row of `x`?
(@) Use the `apply` function and matrix algebra to compute the distance between the fourth digit `mnist$train$images[4,]` and all other digits represented in `mnist$train$images`. Then generate as boxplot as in exercise 2 and predict what digit is the fourth row.
```{r}
my_dist <- apply(mnist$train$image[-4,], 1, function(x){
sqrt(sum((mnist$train$image[4,] - x)^2))
})
boxplot(my_dist~mnist$train$labels[-4])
```
(@) Compute the distance between each feature and the feature representing the middle pixel (row 14 column 14). Create an image plot of where the distance is shown with color in the pixel position.
```{r}
i <- 13*28+14
x_0 <- mnist$train$images[,i]
# my_dist <- apply(mnist$train$image, 2, function(x){
# sqrt(sum((x - x_0)^2))
# })
my_dist <- sqrt(matrix(x_0, nrow=1) %*% mnist$train$images)
#check if it's 0
my_dist[i]
mat <- matrix(my_dist, 28, 28)
image(mat[,28:1])
```