02-regression-multiple.Rmd

<!--
kate: indent-width 4; word-wrap-column 74; default-dictionary en_AU
Copyright (C) 2020-2021, Marek Gagolewski <https://www.gagolewski.com>
This material is licensed under the Creative Commons BY-NC-ND 4.0 License.
-->

# Multiple Regression

```{r chapter-header-motd,echo=FALSE,results="asis"}
cat(readLines("chapter-header-motd.md"), sep="\n")
```

<!-- TODO

standardisation of variables and interpretability

show how to "derive" the original model from the "standardised" one

-->


## Introduction

### Formalism


Let $\mathbf{X}\in\mathbb{R}^{n\times p}$ be an input matrix
that consists of $n$ points in a $p$-dimensional space.

In other words, we have a database on $n$ objects, each of which
being described by means of $p$ numerical features.

\[
\mathbf{X}=
\left[
\begin{array}{cccc}
x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\
x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\
\vdots & \vdots & \ddots & \vdots \\
x_{n,1} & x_{n,2} & \cdots & x_{n,p} \\
\end{array}
\right]
\]


Recall that in supervised learning,
apart from $\mathbf{X}$, we are also given the corresponding $\mathbf{y}$;
with each input point $\mathbf{x}_{i,\cdot}$ we associate the desired output $y_i$.

In this chapter we are still interested in  **regression** tasks;
hence, we assume that each $y_i$
it is a real number, i.e., $y_i\in\mathbb{R}$.

Hence, our dataset is $[\mathbf{X}\ \mathbf{y}]$ --
where each object is represented as a row vector
$[\mathbf{x}_{i,\cdot}\ y_i]$, $i=1,\dots,n$:

\[
[\mathbf{X}\ \mathbf{y}]=
\left[
\begin{array}{ccccc}
x_{1,1} & x_{1,2} & \cdots & x_{1,p} & y_1\\
x_{2,1} & x_{2,2} & \cdots & x_{2,p} & y_2\\
\vdots & \vdots & \ddots & \vdots    & \vdots\\
x_{n,1} & x_{n,2} & \cdots & x_{n,p} & y_n\\
\end{array}
\right].
\]


### Simple Linear Regression - Recap


In a simple regression task, we have assumed that $p=1$ -- there is only
one independent variable,
denoted $x_i=x_{i,1}$.

We restricted ourselves to linear models of the form $Y=f(X)=a+bX$
that minimised the  sum of squared residuals (SSR), i.e.,


\[
\min_{a,b\in\mathbb{R}} \sum_{i=1}^n \left(
a+bx_i-y_i
\right)^2.
\]

The solution is:

\[
\left\{
\begin{array}{rl}
b  = & \dfrac{
n \displaystyle\sum_{i=1}^n x_i y_i - \displaystyle\sum_{i=1}^n  y_i \displaystyle\sum_{i=1}^n x_i
}{
n \displaystyle\sum_{i=1}^n x_i x_i -   \displaystyle\sum_{i=1}^n x_i\displaystyle\sum_{i=1}^n x_i
}\\
a = & \dfrac{1}{n}\displaystyle\sum_{i=1}^n  y_i - b  \dfrac{1}{n} \displaystyle\sum_{i=1}^n x_i  \\
\end{array}
\right.
\]


Fitting in R can be performed by calling the `lm()` function:

```{r simple_recap1}
library("ISLR") # Credit dataset
X <- as.numeric(Credit$Balance[Credit$Balance>0])
Y <- as.numeric(Credit$Rating[Credit$Balance>0])
f <- lm(Y~X) # Y~X is a formula, read: Y is a function of X
print(f)
```


Figure \@ref(fig:simple-recap2) gives the scatter plot
of Y vs. X together with the fitted simple linear model.

```{r simple-recap2,fig.cap="Fitted regression line for the Credit dataset"}
plot(X, Y, xlab="X (Balance)", ylab="Y (Credit)")
abline(f, col=2, lwd=3)
```


## Multiple Linear Regression

### Problem Formulation


Let's now generalise the above to the case of many variables
$X_1, \dots, X_p$.

We wish to model the dependent variable as a function of $p$ independent variables.
\[
Y = f(X_1,\dots,X_p)   \qquad (+\varepsilon)
\]

Restricting ourselves to the class of **linear models**, we have
\[
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p.
\]

Above we studied the case where $p=1$, i.e., $Y=a+bX_1$ with $\beta_0=a$ and $\beta_1=b$.


The above equation defines:

- $p=1$ --- a line (see Figure \@ref(fig:simple-recap2)),
- $p=2$ --- a plane (see Figure \@ref(fig:scatterplot3dexample)),
- $p\ge 3$ --- a hyperplane (well, most people find it difficult
to imagine objects in high dimensions,
but we are lucky to have this thing called maths).


```{r scatterplot3dexamplefit,echo=FALSE}
X1 <- as.numeric(Credit$Balance[Credit$Balance>0])
X2 <- as.numeric(Credit$Income[Credit$Balance>0])
Y  <- as.numeric(Credit$Rating[Credit$Balance>0])
f <- lm(Y~X1+X2)
```

```{r scatterplot3dexample,echo=FALSE,fig.height=4,fig.cap="Fitted regression plane for the Credit dataset"}
par(ann=FALSE) # to disable our plot.window trickery
library("scatterplot3d")
s3d <- scatterplot3d(X1, X2, Y,
    angle=60, # change angle to reveal more
    highlight.3d=TRUE, xlab="Balance", ylab="Income",
    zlab="Rating")
s3d$plane3d(f, lty.box="solid")
```


### Fitting a Linear Model in R


`lm()` accepts a formula of the form `Y~X1+X2+...+Xp`.

It finds the least squares fit, i.e., solves
\[
\min_{\beta_0, \beta_1,\dots, \beta_p\in\mathbb{R}}
\sum_{i=1}^n \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_i \right) ^2
\]

```{r fig_scatterplot3dexamplefit}
<<scatterplot3dexamplefit>>
f$coefficients # ß0, ß1, ß2
```


By the way, the 3D scatter plot in Figure \@ref(fig:scatterplot3dexample)
was generated by calling:

```{r fig_scatterplot3dexamplefit_show_code,fig.keep='none',eval=FALSE,echo=-1}
<<scatterplot3dexample>>
```

(`s3d` is an R list, one of its elements named `plane3d` is a function object -- this is legal)


## Finding the Best Model

### Model Diagnostics

<!-- more metrics:
https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics
-->


Here is Rating ($Y$) as function of Balance ($X_1$, lefthand side of Figure \@ref(fig:x12-y))
and Income ($X_2$, righthand side of Figure \@ref(fig:x12-y)).

```{r x12-y,echo=FALSE,fig.cap="Scatter plots of $Y$ vs. $X_1$ and $X_2$"}
par(mfrow=c(1,2))
plot(X1, Y, xlab="X1 (Balance)", ylab="Y (Rating)")
f1  <- lm(Y~X1)    # Rating ~ Balance
abline(f1, col=2,lwd=3)
plot(X2, Y, xlab="X2 (Income)", ylab="Y (Rating)")
f2  <- lm(Y~X2)    # Rating ~ Income
abline(f2, col=2, lwd=3)
```

Moreover, Figure \@ref(fig:x12-ycolmap) depicts
(in a hopefully readable manner) both $X_1$ and $X_2$ with Rating $Y$
encoded with a colour (low ratings are green, high ratings are red;
some rating values are explicitly printed out within the plot).

```{r  x12-ycolmap,echo=FALSE,fig.cap="A heatmap for Rating as a function of Balance and Income; greens represent low credit ratings, whereas reds -- high ones"}
library("RColorBrewer")
pal <- paste0(rev(brewer.pal(11, "RdYlGn")), "77")
C <- pal[cut(Y, seq(min(Y), max(Y), length.out=length(pal)-1))]
plot(X1, X2, col=C, pch=16, cex=3, xlab="X1 (Balance)", ylab="X2 (Income)")
set.seed(123)
which_y <- order(Y)
which_y <- as.numeric(sapply(split(which_y, cut(Y[which_y], seq(min(Y), max(Y), length.out=length(pal)-1))), range))
text(X1[which_y], X2[which_y], Y[which_y])
```

Consider the three following models.


Formula                   |  Equation
--------------------------|----------------------------------------------------
Rating ~ Balance + Income |  $Y=\beta_0 + \beta_1 X_1 + \beta_2 X_2$
Rating ~ Balance          |  $Y=a + b X_1$ ($\beta_0=a, \beta_1=b, \beta_2=0$)
Rating ~ Income           |  $Y=a + b X_2$ ($\beta_0=a, \beta_1=0, \beta_2=b$)

```{R}
f12 <- lm(Y~X1+X2) # Rating ~ Balance + Income
f12$coefficients
f1  <- lm(Y~X1)    # Rating ~ Balance
f1$coefficients
f2  <- lm(Y~X2)    # Rating ~ Income
f2$coefficients
```

Which of the three models is the best?
Of course, by using the word "best",
we need to answer the question "best?... but with respect to what kind of measure?"


So far we were fitting w.r.t. SSR,
as the multiple regression model generalises the two simple ones,
the former must yield a not-worse SSR.
This is because in the case of $Y=\beta_0 + \beta_1 X_1 + \beta_2 X_2$,
setting $\beta_1$ to 0 (just one of uncountably many possible $\beta_1$s,
if it happens to be the *best* one, good for us)
gives $Y=a + b X_2$
whereas by setting $\beta_2$ to 0 we obtain $Y=a + b X_1$.


```{r ssrs}
sum(f12$residuals^2)
sum(f1$residuals^2)
sum(f2$residuals^2)
```

We get that, in terms of SSRs, $f_{12}$ is better than $f_{2}$,
which in turn is better than $f_{1}$.
However, these error values per se (sheer numbers)
are meaningless (not meaningful).


Remark.

: Interpretability in ML has always been an important issue, think the EU
General Data Protection Regulation (GDPR), amongst others.


#### SSR, MSE, RMSE and MAE

The quality of fit can be assessed by performing some *descriptive
statistical analysis of the residuals*, $\hat{y}_i-y_i$,
for $i=1,\dots,n$.


I know how to summarise data on the residuals!
Of course I should compute their arithmetic mean and I'm done with that task!
Interestingly, the mean of residuals (this can be shown analytically)
in the least squared fit is always equal to $0$:
\[
 \frac{1}{n} \sum_{i=1}^n (\hat{y}_i-y_i)=0.
\]
Therefore, we need a different metric.


{ BEGIN exercise }
(\*) A proof of this fact is left as an exercise to the curious;
assume $p=1$ just as in the previous chapter and note that $\hat{y}_i=a x_i+b$.
{ END exercise }


```{r meanresiduals}
mean(f12$residuals) # almost zero numerically
all.equal(mean(f12$residuals), 0)
```


We noted that sum of squared residuals (SSR) is not interpretable,
but the mean squared residuals
(MSR) -- also called mean squared error (MSE) regression loss -- is a little better.
Recall that mean is defined as the sum divided by number of samples.

\[
 \mathrm{MSE}(f) = \frac{1}{n} \sum_{i=1}^n (f(\mathbf{x}_{i,\cdot})-y_i)^2.
\]


```{r mse}
mean(f12$residuals^2)
mean(f1$residuals^2)
mean(f2$residuals^2)
```

This gives an information of how much do we err *per sample*,
so at least this measure does not depend on $n$ anymore.
However, if the original $Y$s are, say, in metres $[\mathrm{m}]$,
MSE is expressed in metres squared $[\mathrm{m}^2]$.

To account for that, we may consider the root mean squared error (RMSE):
\[
 \mathrm{RMSE}(f) = \sqrt{\frac{1}{n} \sum_{i=1}^n (f(\mathbf{x}_{i,\cdot})-y_i)^2}.
\]
This is just like with the sample variance  vs. standard deviation --
recall the latter is defined as the square root of the former.


```{r rmse}
sqrt(mean(f12$residuals^2))
sqrt(mean(f1$residuals^2))
sqrt(mean(f2$residuals^2))
```

The interpretation of the RMSE is rather quirky;
it is some-sort-of-averaged *deviance* from the true rating
(which is on the scale 0--1000, hence we see that the first model is not
that bad). Recall that the square function is sensitive to large observations,
hence, it penalises notable deviations more heavily.

As still we have  a problem with finding something easily interpretable
(your non-technical boss or client may ask you: but what do these numbers mean??),
we suggest here that the mean absolute error (MAE;
also called mean absolute deviations, MAD)
might be a better idea than the above:
\[
 \mathrm{MAE}(f) = \frac{1}{n} \sum_{i=1}^n |f(\mathbf{x}_{i,\cdot})-y_i|.
\]

```{r mae}
mean(abs(f12$residuals))
mean(abs(f1$residuals))
mean(abs(f2$residuals))
```

With the above we may say "On average, the predicted rating differs from the
observed one by...". That is good enough.


Remark.

:   (\*) You may ask why don't we fit models so as to minimise the MAE
    and we minimise the RMSE instead (note that minimising RMSE is the same as
    minimising the SSR, one is a strictly monotone transformation of the other
    and do not affect the solution). Well, it is possible.
    It turns out that, however, minimising MAE is more computationally expensive
    and the solution may be numerically unstable.
    So it's rarely an analyst's first choice (assuming they are well-educated
    enough to know about the MAD regression task). However, it may be worth
    trying it out sometimes.

    Sometimes we might prefer MAD regression to the classic one
    if our data is heavily contaminated by outliers. But
    in such cases it is worth checking if proper data cleansing does
    the trick.


#### Graphical Summaries of Residuals

If we are not happy with single numerical aggregated of the residuals
or their absolute values, we can (and should) always compute a whole
bunch of descriptive statistics:

```{r summaryresiduals}
summary(f12$residuals)
summary(f1$residuals)
summary(f2$residuals)
```

The outputs generated by `summary()` include:

- `Min.` -- sample minimum
- `1st Qu.` -- 1st quartile == 25th percentile == quantile of order 0.25
- `Median` -- median == 50th percentile == quantile of order 0.5
- `3rd Qu.` -- 3rd quartile = 75th percentile == quantile of order 0.75
- `Max.` --  sample maximum

For example, 1st quartile is the observation $q$ such that
25\% values are $\le q$ and 75\% values are $\ge q$,
see `?quantile` in R.

Graphically, it is nice to summarise the empirical distribution
of the residuals on a **box and whisker plot**.
Here is the key to decipher Figure \@ref(fig:boxplot-explained):

* IQR == Interquartile range == Q3$-$Q1 (box width)
* The box contains 50\% of the "most typical" observations
* Box and whiskers altogether have width $\le$ 4 IQR
* Outliers == observations potentially worth inspecting (is it a bug or a feature?)


```{r boxplot-explained,echo=FALSE,fig.cap="An example boxplot"}
x <- f1$residuals
boxplot(x, horizontal=TRUE, xlim=c(-0.3, 2.3), ylim=c(-300, 300), col="white")


q1 <- quantile(x, 0.25)
q2 <- quantile(x, 0.5)
q3 <- quantile(x, 0.75)
iqr <- q3-q1

segments(q2, 0, q2, 1, lty=3)
text(q2, 0, "Median", pos=1)

segments(q1, 2, q1, 1, lty=3)
text(q1, 2, "1st Qu. (Q1)", pos=3)

segments(q3, 1.8, q3, 1, lty=3)
text(q3, 1.8, "3rd Qu. (Q3)", pos=3)


segments(min(x), 2, min(x), 1, lty=3)
text(min(x), 2, "Min.", pos=3)


segments(max(x), 1.8, max(x), 1, lty=3)
text(max(x), 1.8, "Max.", pos=3)

segments(q3+1.5*iqr, 2, sort(x[x>q3+1.5*iqr], decreasing=TRUE)[c(1, 5, 10)], 1, lty=3)
text(q3+1.5*iqr, 2, "(potential) outliers", pos=3)

segments(q3+1.5*iqr, 0, q3+1.5*iqr, 1, lty=3)
text(q3+1.5*iqr, 0, "Q3 + 1.5 IQR", pos=1)

segments(q1-1.5*iqr, 0, q1-1.5*iqr, 1, lty=3)
text(q1-1.5*iqr, 0, "Q1 - 1.5 IQR", pos=1)
```


Figure \@ref(fig:boxplot-residuals) is worth a thousand words:

```{r boxplot-residuals,fig.cap="Box plots of the residuals for the three models studied"}
boxplot(horizontal=TRUE, xlab="residuals", col="white",
  list(f12=f12$residuals, f1=f1$residuals, f2=f2$residuals))
abline(v=0, lty=3)
```

Figure \@ref(fig:violinplot-residuals) gives  a *violin plot* -- a blend of a box plot and a (kernel) density estimator (histogram-like):

```{r violinplot-residuals,message=FALSE,fig.cap="Violin plots of the residuals for the three models studied"}
library("vioplot")
vioplot(horizontal=TRUE, xlab="residuals", col="white",
  list(f12=f12$residuals, f1=f1$residuals, f2=f2$residuals))
abline(v=0, lty=3)
```


We can also take a look at the absolute values of the residuals.
Here are some descriptive statistics:

```{r absresiduals}
summary(abs(f12$residuals))
summary(abs(f1$residuals))
summary(abs(f2$residuals))
```

Figure \@ref(fig:absresiduals-boxplot) is worth \$1000:

```{r absresiduals-boxplot,fig.cap="Box plots of the modules of the residuals for the three models studied"}
boxplot(horizontal=TRUE, col="white", xlab="abs(residuals)",
  list(f12=abs(f12$residuals), f1=abs(f1$residuals),
       f2=abs(f2$residuals)))
abline(v=0, lty=3)
```

#### Coefficient of Determination (R-squared)

If we didn't know the range of the dependent variable
(in our case we do know that the credit rating is on the scale 0--1000),
the RMSE or MAE would be hard to interpret.

It turns out that there is a popular *normalised* (unit-less) measure
that is somehow easy to interpret with no domain-specific knowledge
of the modelled problem.
Namely, the (unadjusted) **$R^2$ score** (the coefficient of determination)
is given by:

\[
R^2(f) = 1 - \frac{\sum_{i=1}^{n} \left(y_i-f(\mathbf{x}_{i,\cdot})\right)^2}{\sum_{i=1}^{n} \left(y_i-\bar{y}\right)^2},
\]
where $\bar{y}$ is the arithmetic mean $\frac{1}{n}\sum_{i=1}^n y_i$.

```{r rsquared}
(r12 <- summary(f12)$r.squared)
1 - sum(f12$residuals^2)/sum((Y-mean(Y))^2) # the same
(r1 <- summary(f1)$r.squared)
(r2 <- summary(f2)$r.squared)
```

The coefficient of determination gives the proportion of variance of the
dependent variable explained by independent variables in the model;
$R^2(f)\simeq 1$ indicates a perfect fit.
The first model is a very good one, the simple models are
"more or less okay".


Unfortunately, $R^2$ tends to automatically increase as the number of independent variables
increase (recall that the more variables in the model,
the better the SSR must be).
To correct for this phenomenon, we sometimes consider the **adjusted $R^2$**:

\[
\bar{R}^2(f) = 1 - (1-{R}^2(f))\frac{n-1}{n-p-1}
\]

```{r adj-rsquared}
summary(f12)$adj.r.squared
n <- length(x); 1 - (1 - r12)*(n-1)/(n-3) # the same
summary(f1)$adj.r.squared
summary(f2)$adj.r.squared
```

In other words, the adjusted $R^2$ penalises for more complex models.


Remark.

: (\*) Side note -- results of some statistical tests (e.g., significance of coefficients)
are reported by calling `summary(f12)` etc. --- refer to a more advanced source to obtain more information.
These, however, require the verification of some assumptions regarding the input data
and the residuals.


```{r summary_f12}
summary(f12)
```


#### Residuals vs. Fitted Plot

We can also create scatter plots of the residuals
(predicted $\hat{y}_i$ minus
true $y_i$) as a function of the predicted
$\hat{y}_i=f(\mathbf{x}_{i,\cdot})$, see Figure \@ref(fig:resid-vs-fitted).

```{r resid-vs-fitted,fig.cap="Residuals vs. fitted outputs for the three regression models"}
Y_pred12 <- f12$fitted.values # predict(f12, data.frame(X1, X2))
Y_pred1  <- f1$fitted.values  # predict(f1, data.frame(X1))
Y_pred2  <- f2$fitted.values  # predict(f2, data.frame(X2))
par(mfrow=c(1, 3))
plot(Y_pred12, Y_pred12-Y)
plot(Y_pred1,  Y_pred1 -Y)
plot(Y_pred2,  Y_pred2 -Y)
```

Ideally (provided that the hypothesis that the dependent variable
is indeed a linear function of the dependent variable(s) is true),
we would expect to see a point cloud that spread around $0$ in a
very much unorderly fashion.

<!--
homoskedastic
-->


### Variable Selection


Okay, up to now we've been considering the problem of modelling
the `Rating` variable as a function of `Balance` and/or `Income`.
However, it the `Credit` data  set there are other variables
possibly worth inspecting.

Consider all quantitative (numeric-continuous) variables in the `Credit` data set.

```{r Csubset}
C <- Credit[Credit$Balance>0,
    c("Rating", "Limit", "Income", "Age",
      "Education", "Balance")]
head(C)
```

Obviously there are many  possible combinations of the variables
upon which regression models can be constructed
(precisely, for $p$ variables there are $2^p$ such models).
How do we choose the *best* set of inputs?


Remark.

: We should already be suspicious at this point:
wait... *best* requires some sort of criterion, right?


First, however, let's draw a  matrix of scatter plots for
every pair of variables
-- so as to get an impression of how individual variables
interact with each other, see Figure \@ref(fig:pairs).

```{r pairs,fig.height=6,fig.cap="Scatter plot matrix for the Credit dataset"}
pairs(C)
```


It seems like `Rating` depends on `Limit` almost linearly...
We have a tool to actually quantify the degree of linear dependence
between a pair of variables --
Pearson's $r$ -- the linear correlation coefficient:

\[
r(\boldsymbol{x},\boldsymbol{y}) = \frac{
    \sum_{i=1}^n (x_i-\bar{x}) (y_i-\bar{y})
}{
    \sqrt{\sum_{i=1}^n (x_i-\bar{x})^2} \sqrt{\sum_{i=1}^n (y_i-\bar{y})^2}
}.
\]

It holds $r\in[-1,1]$, where:

* $r=1$ -- positive linear dependence ($y$ increases as $x$ increases)
* $r=-1$ -- negative linear dependence ($y$ decreases as $x$ increases)
* $r\simeq 0$ -- uncorrelated or non-linearly dependent

Figure \@ref(fig:pearson-interpret) gives an illustration of the above.


```{r pearson-interpret,echo=FALSE,fig.height=6,fig.cap="Different datasets and the corresponding Pearson's $r$ coefficients"}
par(mfrow=c(2,2))
set.seed(123)
x <- runif(25)

y <- 7*x+3+rnorm(25,0,0.5)
plot(x, y, axes=FALSE, ann=FALSE)
legend("topleft", bg="white", legend=c(sprintf("r=%.2f", cor(x,y)), "positive correlation"))
box()

y <- -7*x+3+rnorm(25,0,0.5)
plot(x, y, axes=FALSE, ann=FALSE)
legend("topright", bg="white", legend=c(sprintf("r=%.2f", cor(x,y)), "negative correlation"))
box()

y <- runif(25)
plot(x, y, axes=FALSE, ann=FALSE)
legend("topleft", bg="white", legend=c(sprintf("r=%.2f", cor(x,y)), "no correlation"))
box()

x <- runif(25)
y <- abs((x-0.5)*100)+rnorm(length(x))
plot(x, y, axes=FALSE, ann=FALSE)
legend("top", bg="white", legend=c(sprintf("r=%.2f", cor(x,y)), "non-linear correlation"))
box()

```


To compute Pearson's $r$ between all pairs of variables, we call:

```{r pearson_cor}
round(cor(C), 3)
```


`Rating` and `Limit` are almost perfectly linearly correlated,
and both seem to describe the same thing.

For practical purposes, we'd rather model `Rating` as a function of the other variables.
For simple linear regression models, we'd choose either `Income` or `Balance`.
How about multiple regression though?


The best model:

* has high predictive power,
* is simple.

These two criteria are often mutually exclusive.


<!--
Tension between prediction and description, between technology and science

"what companies want" -- more money (models that increase revenue, even slightly, at all cost)


-->


Which variables should be included in the optimal model?


Again, the definition of the "best" object needs a *fitness* function.

For fitting a single model to data, we use the SSR.

We need a metric that takes the number of dependent variables into account.


Remark.

: (\*) Unfortunately, the adjusted $R^2$, despite its interpretability,
is not really suitable for this task. It does not penalise complex models
heavily enough to be really useful.


Here we'll be using **the Akaike Information Criterion** (AIC).

For a model $f$ with $p'$ independent variables:
\[
\mathrm{AIC}(f) = 2(p'+1)+n\log(\mathrm{SSR}(f))-n\log n
\]

Our task is to find the combination of independent variables
that minimises the AIC.


Remark.

:   (\*\*) Note that this is a bi-level optimisation problem -- for every
    considered combination of variables (which we look for),
    we must solve another problem of finding the best model
    involving these variables -- the one that minimises the SSR.
    \[
    \min_{s_1,s_2,\dots,s_p\in\{0, 1\}}
    \left(
    \begin{array}{l}
    2\left(\displaystyle\sum_{j=1}^p s_j +1\right)+\\
    n\log\left(
    \displaystyle\min_{\beta_0,\beta_1,\dots,\beta_p\in\mathbb{R}}
    \sum_{i=1}^n \left(
    \beta_0 + s_1\beta_1 x_{i,1} + \dots + s_p\beta_p x_{i,p}
    -y_i
    \right)^2
    \right)
    \end{array}
    \right)
    \]
    We dropped the $n\log n$ term, because it is always constant
    and hence doesn't affect the solution.
    If $s_j=0$, then the $s_j\beta_j x_{i,j}$ term is equal to
    $0$, and hence is not considered in the model.
    This plays the role of including $s_j=1$ or omitting $s_j=0$ the $j$-th
    variable in the model building exercise.


For $p$ variables, the number of their possible
combinations is equal to $2^p$
(grows exponentially with $p$).
For large $p$ (think big data), an extensive search is impractical
(in our case we could get away with this though -- left as an exercise
to a slightly more advanced reader).
Therefore, to find the variable combination minimising the AIC,
we often rely on one of the two following greedy heuristics:

- forward selection:

    1. start with an empty model
    2. find an independent variable
whose addition to the current model would yield the highest decrease in the AIC and add it to the model
    3. go to step 2 until AIC decreases

- backward elimination:

    1. start with the full model
    2. find an independent variable
whose removal from the current model would  decrease the AIC the most and eliminate it from the model
    3. go to step 2 until AIC decreases


Remark.

: (\*\*) The above bi-level optimisation problem
can be solved by implementing a genetic algorithm -- see further chapter for more details.


Remark.

: (\*) There are of course many other methods which also perform
some form of variable selection, e.g., lasso regression.
But these minimise a different objective.


First, a forward selection example.
We need a data sample to work with:

```{r step1a}
C <- Credit[Credit$Balance>0,
    c("Rating", "Income", "Age",
      "Education", "Balance")]
```

Then, a formula that represents a model with no variables
(model from which we'll start our search):

```{r step1b}
(model_empty <- Rating~1)
```

Last, we need a model that includes all the variables.
We're too lazy to list all of them manually, therefore,
we can use the `model.frame()` function to generate
a corresponding formula:

```{r step1c}
(model_full <- formula(model.frame(Rating~., data=C))) # all variables
```

Now we are ready.

```{r step1d}
step(lm(model_empty, data=C), # starting model
    scope=model_full,         # gives variables to consider
    direction="forward")
formula(lm(Rating~., data=C))
```

The full model has been selected.

<!-- TODO: detailed explanation what's happening here -->


. . .

And now for something completely different --
a backward elimination example:

```{r step2}
step(lm(model_full, data=C), # from
     scope=model_empty,      # to
     direction="backward")
```


The full model is considered the best again.

<!-- TODO: detailed explanation what's happening here -->

. . .

Forward selection example -- full dataset:

```{r step3}
C <- Credit[,  # do not restrict to Credit$Balance>0
    c("Rating", "Income", "Age",
      "Education", "Balance")]
step(lm(model_empty, data=C),
    scope=model_full,
    direction="forward")
```

This procedure suggests including only the `Balance` and `Income`
variables.


. . .

Backward elimination example -- full dataset:


```{r step4}
step(lm(model_full, data=C), # full model
     scope=model_empty, # empty model
     direction="backward")
```


This procedure gives the same results as forward selection
(however, for other data sets this might not necessarily be the case).


### Variable Transformation


So far we have been fitting linear models of the form:
\[
Y = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p.
\]


What about some non-linear models such as polynomials etc.? For example:
\[
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_1^2 + \beta_3 X_1^3 + \beta_4 X_2.
\]


Solution: pre-process inputs by setting
$X_1' := X_1$, $X_2' := X_1^2$, $X_3' := X_1^3$, $X_4' := X_2$
and fit a linear model:

\[
Y = \beta_0 + \beta_1 X_1' + \beta_2 X_2' + \beta_3 X_3' + \beta_4 X_4'.
\]


This trick works for every model of the form
$Y=\sum_{i=1}^k \sum_{j=1}^p \varphi_{i,j}(X_j)$ for any $k$
and any univariate functions $\varphi_{i,j}$.


Also, with a little creativity (and maths), we might be able to transform
a few other models to a linear one, e.g.,

\[
Y = b e^{aX} \qquad \to \qquad \log Y = \log b + aX \qquad\to\qquad Y'=aX+b'
\]

This is an example of a model's **linearisation**.
However, not every model can be linearised.
In particular, one that involves functions that are not invertible.


For example, here's a series of simple ($p=1$) degree-$d$
polynomial regression models
of the form:
\[
Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \dots + \beta_d X^d.
\]

Such models can be fitted with the `lm()` function based on the formula
of the form `Y~poly(X, d, raw=TRUE)` or `Y~X+I(X^2)+I(X^3)+...`

```{r poly1}
f1_1  <- lm(Y~X1)
f1_3  <- lm(Y~X1+I(X1^2)+I(X1^3)) # also: Y~poly(X1, 3, raw=TRUE)
f1_10 <- lm(Y~poly(X1, 10, raw=TRUE))
```

Above we have fitted the polynomials of degrees 1, 3 and 10.
Note that a polynomial of degree 1 is just a line.

Let us depict the three models:

```{r poly2,fig.cap="Polynomials of different degrees fitted to the Credit dataset"}
plot(X1, Y, col="#000000aa", ylim=c(0, 1100))
x <- seq(min(X1), max(X1), length.out=101)
lines(x, predict(f1_1, data.frame(X1=x)), col="red", lwd=3)
lines(x, predict(f1_3, data.frame(X1=x)), col="blue", lwd=3)
lines(x, predict(f1_10, data.frame(X1=x)), col="darkgreen", lwd=3)
```

From Figure \@ref(fig:poly2) we see that there's clearly a problem
with the degree-10 polynomial.


### Predictive vs. Descriptive Power


The above high-degree polynomial model (`f1_10`) is a typical instance
of a phenomenon called an **overfit**.

Clearly (based on our expert knowledge), the `Rating` shouldn't
decrease as `Balance` increases.


In other words, `f1_10` gives a better fit to data actually observed,
but fails to produce good results for the points that are yet to come.

We say that it **generalises** poorly to unseen data.


Assume our true model is of the form:

```{r true_model1}
true_model <- function(x) 3*x^3+5
```

Let's generate the following random sample from this model (with $Y$ subject
to error), see Figure \@ref(fig:figBIASVARIANCE1):

```{r true_model2}
set.seed(1234) # to assure reproducibility
n <- 25
X <- runif(n, min=0, max=1)
Y <- true_model(X)+rnorm(n, sd=0.2) # add normally-distributed noise
```

```{r figBIASVARIANCE1,fig.cap="Synthetic data generated by means of the formula $Y=3x^3+5$ ($+$ noise)"}
plot(X, Y)
x <- seq(0, 1, length.out=101)
lines(x, true_model(x), col=2, lwd=3, lty=2)
```


Let's fit polynomials of different degrees, see Figure \@ref(fig:figBIASVARIANCE2).

```{r figBIASVARIANCE2,fig.cap="Polynomials fitted to our synthetic dataset"}
plot(X, Y)
lines(x, true_model(x), col=2, lwd=3, lty=2)

dmax <- 11 # maximal polynomial degree
MSE_train <- numeric(dmax)
MSE_test  <- numeric(dmax)
for (d in 1:dmax) { # for every polynomial degree
    f <- lm(Y~poly(X, d, raw=TRUE)) # fit a d-degree polynomial
    y <- predict(f, data.frame(X=x))
    lines(x, y, col=d)
    # MSE on given random X,Y:
    MSE_train[d] <- mean(f$residuals^2)
    # MSE on many more points:
    MSE_test[d]  <- mean((y-true_model(x))^2)
}
```

Some of the polynomials are fitted too well!

Remark

: (\*) The oscillation of the high-degree polynomials at the domain
boundaries is known as the Runge phenomenon.


Compare the mean squared error (MSE) for the observed vs. future data points,
see Figure \@ref(fig:figBIASVARIANCE3).

```{r figBIASVARIANCE3,warning=FALSE,fig.cap="MSE on the dataset used to construct the model vs. MSE on a whole range of points as function of the polynomial degree"}
matplot(1:dmax, cbind(MSE_train, MSE_test), type="b",
    ylim=c(1e-3, 2e3), log="y", pch=1:2,
    xlab="Model complexity (polynomial degree)",
    ylab="MSE")
legend("topleft", legend=c("MSE on original data", "MSE on the whole range"),
    lty=1:2, col=1:2, pch=1:2, bg="white")
```

Note the logarithmic scale on the $y$ axis.


This is a very typical behaviour!

- A model's fit to observed data improves as the model's complexity increases.

- A model's generalisation to unseen data initially improves, but then becomes worse.

- In the above example, the sweet spot is at a polynomial of degree 3, which is exactly
our true underlying model.


Hence, most often **we should be interested in the accuracy of the predictions
made in the case of unobserved data.**


If we have a data set of a considerable size,
we can divide it (randomly) into two parts:

- *training sample* (say, 60\% or 80\%) -- used to fit a model
- *test sample* (the remaining 40\% or 20\%) -- used to assess its quality
(e.g., using MSE)

More on this issue in the chapter on Classification.

Remark.

:  (\*) We shall see that sometimes a train-test-validate
split will be necessary, e.g., 60-20-20\%.


## Exercises in R


<!--

TODO

wines dataset: alcohol~density ????

example of unsuccessful modelling

-->


### Anscombe's Quartet Revisited

Consider the `anscombe` database once again:

```{r anscombe1}
print(anscombe) # `anscombe` is a built-in object
```

Recall that in the previous Chapter we have
split the above data into four data frames
`ans1`, ..., `ans4` with columns `x` and `y`.


{ BEGIN exercise }
In `ans1`, fit a regression line to the data set as-is.
{ END exercise }

{ BEGIN solution }


We've done that already, see Figure \@ref(fig:anscombe3).
What a wonderful exercise, thank you -- effective
learning is often done by repeating stuff.

```{r anscombe3,fig.cap="Fitted regression line for `ans1`"}
ans1 <- data.frame(x=anscombe$x1, y=anscombe$y1)
f1 <- lm(y~x, data=ans1)
plot(ans1$x, ans1$y)
abline(f1, col="red")
```

{ END solution }


{ BEGIN exercise }
In `ans2`, fit a quadratic model ($y=a + bx + cx^2$).
{ END exercise }

{ BEGIN solution }


How to fit a polynomial model is explained above.

```{r anscombe4,fig.cap="Fitted quadratic model for `ans2`"}
ans2 <- data.frame(x=anscombe$x2, y=anscombe$y2)
f2 <- lm(y~x+I(x^2), data=ans2)
plot(ans2$x, ans2$y)
x_plot <- seq(4, 14, by=0.1)
y_plot <- predict(f2, data.frame(x=x_plot))
lines(x_plot, y_plot, col="red")
```

*Comment: From Figure \@ref(fig:anscombe4) we see that it's
an almost-perfect fit! Clearly,
the second Anscombe dataset isn't a case of linearly
dependent variables.*

{ END solution }


{ BEGIN exercise }
In `ans3`, remove the obvious outlier from data
and fit a regression line.
{ END exercise }


{ BEGIN solution }


Let's plot the data set first, see Figure \@ref(fig:anscombe5).

```{r anscombe5, fig.cap="Scatter plot for `ans3`"}
ans3 <- data.frame(x=anscombe$x3, y=anscombe$y3)
plot(ans3$x, ans3$y)
```

Indeed, the observation at $x\simeq 13$ is an obvious outlier.
Perhaps the easiest way
to remove it is to call:

```{r anscombe6}
ans3b <- ans3[ans3$y<=12,] # the outlier is definitely at y>12
```

We could also use the condition `y < max(y)`, amongst others.

Now let's fit the linear model:

```{r anscombe7, fig.cap="Scatter plot for `ans3` with the outlier removed and the fitted linear model"}
f3b <- lm(y~x, data=ans3b)
plot(ans3b$x, ans3b$y)
abline(f3b, col="red")
```


*Comment: Now Figure \@ref(fig:anscombe7) is what we call linearly correlated data.
By the way, Pearson's coefficient now equals ```r cor(ans3b$x, ans3b$y)```.*

{ END solution }


### Countries of the World -- Simple models involving the GDP per capita


Let's consider the World Factbook 2020 dataset
(see this book's `datasets` folder).
It consists of country names, their population,
area, GDP, mortality rates etc. We have scraped it from the CIA website
at https://www.cia.gov/library/publications/the-world-factbook/docs/rankorderguide.html
and compiled into a single file on 3 April 2020.


```{r factbookA1}
factbook <- read.csv("datasets/world_factbook_2020.csv",
    comment.char="#")
```

Here is a preview of a few features for 3 selected countries (see `help("%in%")`):

```{r factbookA2}
factbook[factbook$country %in%
    c("Australia", "New Zealand", "United States"),
    c("country", "area", "population", "gdp_per_capita_ppp")]
```


<!--
The dataset consists of the following columns:
`r paste('"', names(factbook), '"', sep="")`
-->

<!--
> Please note that some of the columns contain characters
that cannot be used in R's variable names (e.g., spaces, brackets, plus, minus).
Therefore, accessing them using the `$` operator requires, e.g., the use of
double quotes.
We should use `factbook$gdp_per_capita_ppp`
or `factbook[,gdp_per_capita_ppp]` instead.
-->


<!--

> Convert `area` and `pop_density` from square miles to square kilometres.


We can easily find an appropriate conversion formula on the internet.
We will convert one column at a time using the well-known
vectorised vector division and replace the old column with new data.

```{r}
#countries$area <- countries$area*0.3861
#countries$pop_density <- countries$pop_density*0.3861
#countries[countries$country == "Australia", ] # preview
```

By the way, did you know that only few countries are still reluctant
to switch to the commonly-agreed-upon
[metric system](https://en.wikipedia.org/wiki/Metric_system)?
We are still (patiently...) waiting for the (inevitable)
[metrication](https://en.wikipedia.org/wiki/Metrication)
of North Korea, the UK, the US, Liberia and Myanmar.


-->


<!--
 Omitting the aforementioned  columns can be done in a few ways. Here is one:

```{r}
#wines <- wines[, is.na(match(names(wines), c("color", "response")))]
```

To recall, `names(data.frame)` gives the vector of column names.
On the other hand, the `match()` function, matches all the values
in the first argument against all the values in the second argument.
If there is no match (in our case, if a column name is not amongst
the two names we wish to remove), `NA` is generated.

-->

{ BEGIN exercise }
List the 10 countries with the highest GDP per capita.
{ END exercise }

{ BEGIN solution }

To recall, to generate a list of indexes that produce an ordered
version of a numeric vector, we need to call the `order()` function.

```{r factbookA3}
which_top <- tail(order(factbook$gdp_per_capita_ppp, na.last=FALSE), 10)
factbook[which_top, c("country", "gdp_per_capita_ppp")]
```

By the way, the reported values are in USD.

*Question: Which of these countries are tax havens?*

{ END solution }


{ BEGIN exercise }
Find the 5 most positively and the 5 most negatively
correlated variables with the `gdp_per_capita_ppp` feature
(of course, with respect to the Pearson coefficient).
{ END exercise }

{ BEGIN solution }

This can be solved via a call to `cor()`.
Note that we need to make sure that missing vales are omitted
from computations.
A quick glimpse at the manual page
(`?cor`) reveals that computing the correlation between a column
and all the other ones (of course, except `country`, which
is non-numeric) can be performed as follows.

```{r factbookA4}
r <- cor(factbook$gdp_per_capita_ppp,
    factbook[,!(names(factbook) %in% c("country", "gdp_per_capita_ppp"))],
    use="complete.obs")[1,]
or <- order(r) # ordering permutation (indexes)
r[head(or, 5)] # first 5 ordered indexes
r[tail(or, 5)] # last 5 ordered indexes
```

*Comment: "Live long and prosper" just gained a new meaning.
Richer countries have lower infant and maternal mortality rates,
lower birth rates, but higher life expectancy and obesity prevalence.
Note, however, that correlation is not causation:
we are unlikely to increase the GDP by asking people to put on weight.*

{ END solution }


{ BEGIN exercise }
Fit simple regression models where the per capita GDP explains
its four most correlated variables (four individual models).
Draw them on a scatter plot. Compute the root mean squared errors (RMSE),
mean absolute errors (MAE) and the coefficients of determination ($R^2$).
{ END exercise }


{ BEGIN solution }

The four most correlated variables (we should look at the absolute
value of the correlation coefficient now -- recall that it
is the correlation of 0 that means no linear dependence; 1 and -1
show a strong association between a pair of variables) are:

```{r factbookA5}
(most_correlated <- names(r)[tail(order(abs(r)), 4)])
```

We could take the above column names and construct four
formulas manually, e.g., by writing
`gdp_per_capita_ppp~life_expectancy_at_birth`,
but we are lazy. Being lazy when it comes to computer
programming is often a virtue, not a flaw in one's character.

Instead, we will run a `for` loop that extracts the pairs of
interesting columns and constructs a formula based on two vectors
(`lm(Y~X)`), see Figure \@ref(fig:factbookA6).

```{r factbookA6,fig.height=6,fig.cap="A scatter plot matrix and regression lines for the 4 variables most correlated with the per capita GDP"}
par(mfrow=c(2, 2)) # 4 plots on a 2x2 grid
for (i in 1:4) {
    print(most_correlated[i])
    X <- factbook[,"gdp_per_capita_ppp"]
    Y <- factbook[,most_correlated[i]]
    f <- lm(Y~X)
    print(cbind(RMSE=sqrt(mean(f$residuals^2)),
                MAE=mean(abs(f$residuals)),
                R2=summary(f)$r.squared))
    plot(X, Y, xlab="gdp_per_capita_ppp",
               ylab=most_correlated[i])
    abline(f, col="red")
}
```


Recall that the root mean squared error is the square root of
the arithmetic mean of the squared residuals.
Mean absolute error is the average of the absolute values of the residuals.
The coefficient of determination
is given by: $R^2(f) = 1 - \frac{\sum_{i=1}^{n} \left(y_i-f(\mathbf{x}_{i,\cdot})\right)^2}{\sum_{i=1}^{n} \left(y_i-\bar{y}\right)^2}$.


*Comment: Unfortunately, we were misled by the high correlation coefficients
between the $X$s and $Y$s:
the low actual $R^2$ scores indicate that these models should not
be deemed trustworthy. Note that 3 of the plots are evidently L-shaped.*

*Fun fact: (\*) Interestingly, it can be shown that $R^2$
(in the case of the linear models fitted by minimising
the SSR) is the square of the correlation
between the true $Y$s and the predicted $Y$s:*

```{r factbookA}
X <- factbook[,"gdp_per_capita_ppp"]
Y <- factbook[,most_correlated[i]]
f <- lm(Y~X, y=TRUE)
print(summary(f)$r.squared)
print(cor(f$fitted.values, f$y)^2)
```

*Side note: Do note that RMSE and MAE are interpretable: for instance,
average error of life expectancy prediction based on the GDP is
4-5 years. Recall that you can find the information on the variables' units
of measure at
https://www.cia.gov/library/publications/the-world-factbook/docs/rankorderguide.html.*

{ END solution }


### Countries of the World -- Most correlated variables (\*)


Let's get back to the World Factbook 2020 dataset (`world_factbook_2020.csv`).

```{r factbookB1}
factbook <- read.csv("datasets/world_factbook_2020.csv",
    comment.char="#")
```


{ BEGIN exercise }
Create a data frame `C` with three columns named `col1`, `col2`
and `r` and $p(p-1)/2$ rows,
where $p$ is the number of numeric features in `factbook`.
Every row should represent a unique pair of column names in `factbook`
(we do not distinguish between `a,b` and `b,a`)
of correlation coefficients between them.
{ END exercise }

{ BEGIN solution }

First we will solve this exercise considering only
4 numeric features in our dataset, so that we can keep
track of how the R expressions we evaluate actually work.

Let us compute the Pearson coefficients between chosen pairs of variables.

```{r factbookB2}
R <- cor(factbook[,c("area", "median_age", "birth_rate", "exports")],
    use="complete.obs") # 4 selected columns
print(R)
```

Note that the `R` matrix has `1.0` on the diagonal (where each entry
represents a correlation between a variable and itself).
Moreover, it is symmetric around the diagonal -- `R[i,j] == R[j,i]`,
because it is the correlation between the same pair of variables.
Hence, from now on we may be interested in the elements
below the diagonal. We can get access to them by using `lower.tri()`
("lower triangle").

```{r factbookB3}
R[lower.tri(R)]
```

This is already the 3rd column of the data frame we are asked to generate,
which should look like:

```{r factbookB4,echo=FALSE }
rrr <- matrix(dimnames(R)[[1]], nrow=nrow(R), ncol=ncol(R))
ccc <- matrix(dimnames(R)[[2]], byrow=TRUE, nrow=nrow(R), ncol=ncol(R))
data.frame(col1=rrr[lower.tri(rrr)], col2=ccc[lower.tri(ccc)],
    r=R[lower.tri(R)])
```

How the generate `col1` and `col2`?
One idea is to take the "lower triangles" of the following matrices:

```{r factbookB5,echo=FALSE}
print(rrr)
```

and:

```{r factbookB6,echo=FALSE}
print(ccc)
```


Here is a complete solution for all the features is `factbook`:

```{r factbookB7}
R <- cor(factbook[,-1], use="complete.obs") # skip the `country` column
rrr <- matrix(dimnames(R)[[1]], nrow=nrow(R), ncol=ncol(R))
ccc <- matrix(dimnames(R)[[2]], byrow=TRUE, nrow=nrow(R), ncol=ncol(R))
C <- data.frame(col1=rrr[lower.tri(rrr)],
                col2=ccc[lower.tri(ccc)],
                r=R[lower.tri(R)])
```

*Comment: In "classical" programming languages we would perhaps
have used of a double (nested) `for` loop here (a less readable solution).*


{ END solution }


{ BEGIN exercise }
Find the 5 most correlated pairs of variables.
{ END exercise }


{ BEGIN solution }

This can be done by ordering the rows of `C` in decreasing
order of absolute values of `C$r`, and then choosing the first 5 rows.


```{r factbookB8}
C_top <- head(C[order(abs(C$r), decreasing=TRUE),], 5)
knitr::kable(C_top)
```

*Comment: The most correlated pairs of features are not really
"mind-blowing"...*

{ END solution }


{ BEGIN exercise }
Fit simple regression models for the most correlated pair of variables.
{ END exercise }


{ BEGIN solution }

There is a degree of ambiguity here: should `col1` or rather `col2`
be treated as the dependent variable in our model?
Let's do it either way.

To learn something new, which is exactly why we are all here,
we will create the formulas programmatically, by first
concatenating (joining) appropriate strings
(note that in order to input a double quotes character,
we need to proceed in with a backslash), and then
calling the `formula()` function.

```{r factbookB9,fig.cap="Most correlated pair of variables and the invisible regression line"}
form <- formula(paste(C_top[1,2], "~", C_top[1,1]))
f <- lm(form, data=factbook)
print(f)
plot(factbook[,C_top[1,1]], factbook[,C_top[1,2]],
    xlab=C_top[1,1], ylab=C_top[1,2])
abline(f, col="red")
```

Figure \@ref(fig:factbookB9) depicts the fitted model.


{ END solution }


### Countries of the World -- A non-linear model based on the GDP per capita


Let's revisit the World Factbook 2020 dataset (`world_factbook_2020.csv`).

```{r factbookC1}
factbook <- read.csv("datasets/world_factbook_2020.csv",
    comment.char="#")
```


{ BEGIN exercise }
Draw a histogram of the empirical distribution
of the GDP per capita. Moreover, draw a histogram of the logarithm
of the GDP/person.
{ END exercise }

{ BEGIN solution }

```{r factbookC2,fig.cap="Histograms of the empirical distribution of the GDP per capita with linear (left) and log (right) scale on the X axis"}
par(mfrow=c(1,2))
hist(factbook$gdp_per_capita_ppp, col="white", main=NA)
hist(log(factbook$gdp_per_capita_ppp), col="white", main=NA)
```

*Comment: In Figure \@ref(fig:factbookC2) we see that
distribution of the GDP is right-skewed: most countries
have small GDP. However, few of them
(those in the "right tail" of the distribution)
are very very rich (hey, how about taxing the richest countries?!).
There is the famous observation made by V. Pareto
stating that most assets are in the hands of the "wealthy minority"
(compare: power law, rich-get-richer rule, preferential attachment in complex networks).
Interestingly, many real-world-phenomena are distributed similarly
(e.g., the popularity of web pages, the number of followers of Instagram
profiles). It is frequently the case that the logarithm of the aforementioned
variable looks more "normal" (is bell-shaped).*

*Side note: "The" logarithm most often refers to the logarithm base
$e$, $\log x = \log_e x$,
where $e\simeq 2.72$ is the Euler constant, see `exp(1)` in R.
Note that you can only compute logarithms of positive real numbers.*

Non-technical audience might be confused when asked to contemplate
the distribution of the logarithm of a variable. Let's make it
more user-friendly (on the other hand, we could've asked them
to harden up...)
by nicely re-labelling the X axis,
see Figure \@ref(fig:factbookC3).


```{r factbookC3,fig.cap="Histogram of the empirical distribution of the GDP per capita now with human-readable X axis labels (not the logarithmic scale)"}
hist(log(factbook$gdp_per_capita_ppp), axes=FALSE,
    xlab="GDP per capita (thousands USD)", main=NA, col="white")
box()
axis(2) # Y axis
at <- c(1000, 2000, 5000, 10000, 20000, 50000, 100000, 200000)
axis(1, at=log(at), labels=at/1000)
```


*Comment: This is still a plot of the logarithm of the
distribution of the per capita GDP, but it's somehow "hidden" behind
the human-readable axis labels. Nice.*


{ END solution }


{ BEGIN exercise }
Fit a simple linear model of `life_expectancy_at_birth`
as a function of `gdp_per_capita_ppp`.
{ END exercise }


{ BEGIN solution }

Easy. We have already done than in one of the previous exercises.
Yet, to learn something new, let's note that the `plot()` function
accepts formulas as well.

```{r factbookC4,fig.cap="Linear model fitted for life expectancy vs. GDP/person"}
f <- lm(life_expectancy_at_birth~gdp_per_capita_ppp, data=factbook)
plot(life_expectancy_at_birth~gdp_per_capita_ppp, data=factbook)
abline(f, col="purple")
summary(f)$r.squared
```

*Comment: From Figure \@ref(fig:factbookC4) we see that
this is not a good model.*


{ END solution }


{ BEGIN exercise }
Draw a scatter plot of `life_expectancy_at_birth` as a function
`gdp_per_capita_ppp`, with the X axis being logarithmic.
Compute the correlation coefficient between
`log(gdp_per_capita_ppp)` and `life_expectancy_at_birth`.
{ END exercise }


{ BEGIN solution }

We could apply the `log()`-transformation manually
and generate fancy X axis labels ourselves. However,
the `plot()` function has the `log` argument (see `?plot.default`)
which provides us with all we need, see Figure \@ref(fig:factbookC5).

```{r factbookC5,fig.cap="Scatter plot of life expectancy vs. GDP/person with log scale on the X axis"}
plot(factbook$gdp_per_capita_ppp,
    factbook$life_expectancy_at_birth,
    log="x")
```

Here is the *linear* correlation coefficient between the logarithm
of the GDP/person and the life expectancy.

```{r factbookC6}
cor(log(factbook$gdp_per_capita_ppp), factbook$life_expectancy_at_birth,
    use="complete.obs")
```

The correlation is quite high, hence the following task.

{ END solution }


{ BEGIN exercise }
Fit a model predicting `life_expectancy_at_birth`
by means of `log(gdp_per_capita_ppp)`.
{ END exercise }


{ BEGIN solution }

We would like to fit a model of the form $Y=a\log X+b$.
The formula `life_expectancy_at_birth~log(gdp_per_capita_ppp)`
should do the trick here.

```{r factbookC7,fig.cap="Linear model fitted for life expectancy vs. the logarithm of GDP/person"}
f <- lm(life_expectancy_at_birth~log(gdp_per_capita_ppp), data=factbook)
plot(life_expectancy_at_birth~log(gdp_per_capita_ppp), data=factbook)
abline(f, col="red", lty=3)
f$coefficients
summary(f)$r.squared
```

*Comment: That is an okay model (in terms of the coefficient of determination), see Figure \@ref(fig:factbookC7).*

{ END solution }


{ BEGIN exercise }
Draw the fitted logarithmic model on a scatter plot
with a standard, non-logarithmic X axis.
{ END exercise }


{ BEGIN solution }

The model fitted above is of the form
$Y\simeq`r round(f$coefficients[2],2 )` \log X+`r round(f$coefficients[1], 2)`$.
To depict it on a plot with linear (non-logarithmic) axes,
we can compute this formula on multiple points by hand,
see Figure \@ref(fig:factbookC8).

```{r factbookC8,fig.cap="Logarithmic model fitted for life expectancy vs. GDP/person"}
plot(factbook$gdp_per_capita_ppp, factbook$life_expectancy_at_birth)

# many points on the X axis:
xxx <- seq(min(factbook$gdp_per_capita_ppp, na.rm=TRUE),
            max(factbook$gdp_per_capita_ppp, na.rm=TRUE),
            length.out=101)
yyy <- f$coefficients[1] + f$coefficients[2]*log(xxx)
lines(xxx, yyy, col="red", lty=3)
```

*Comment: Well, people are not immortal...
The original (linear) model didn't really take that into account.
Also, recall that correlation is not causation.
Moreover, there is a lot of variability at an individual level.
Being born in a less-wealthy country (e.g., not in a tax haven),
doesn't mean you don't have the whole life ahead of you.
Do the cool stuff, do something for the others. Life's not about money.*

{ END solution }


### Countries of the World -- A multiple regression model for the per capita GDP


Let's play with World Factbook 2020
 (`world_factbook_2020.csv`) once again.
World is an interesting place, so we're far from being bored with this dataset.

```{r factbookD1}
factbook <- read.csv("datasets/world_factbook_2020.csv",
    comment.char="#")
```

Let's restrict ourselves to the following columns, mostly
related to imports and exports:

```{r factbookD2}
factbookn <- factbook[c("gdp_purchasing_power_parity",
    "imports", "exports", "electricity_exports",
    "electricity_imports", "military_expenditures",
    "crude_oil_exports", "crude_oil_imports",
    "natural_gas_exports", "natural_gas_imports",
    "reserves_of_foreign_exchange_and_gold")]
```

Let's compute the per capita versions of the above, by dividing
all values by each country's population:

```{r factbookD3}
for (i in 1:ncol(factbookn))
    factbookn[[i]] <- factbookn[[i]]/factbook$population
```

We are going to build a few multiple regression models using the
`step()` function, which is not too fond of missing values, therefore
they should be removed first:

```{r factbookD4}
factbookn <- na.omit(factbookn)
c(nrow(factbook), nrow(factbookn)) # how many countries were omitted?
```


{ BEGIN exercise }
Build a model for `gdp_purchasing_power_parity` as a function
of `imports` and `exports` (all per capita).
{ END exercise }


{ BEGIN solution }

Let's first take a look at how the aforementioned variables
are related to each other, see Figure \@ref(fig:factbookD5).

```{r factbookD5,fig.height=6,fig.cap="Scatter plot matrix for GDP, imports and exports"}
pairs(factbookn[c("gdp_purchasing_power_parity", "imports", "exports")])
cor(factbookn[c("gdp_purchasing_power_parity", "imports", "exports")])
```

They are nicely correlated. Moreover, they are on a similar scale
("tens of thousands of USD per capita").

Fitting the requested model yields:

```{r factbookD6}
options(scipen=10) # prefer "decimal" over "scientific" notation
f1 <- lm(gdp_purchasing_power_parity~imports+exports, data=factbookn)
f1$coefficients
summary(f1)$adj.r.squared
```


{ END solution }


{ BEGIN exercise }
Use forward selection to come up with
a model for `gdp_purchasing_power_parity` per capita.
{ END exercise }


{ BEGIN solution }

```{r factbookD7}
(model_empty <- gdp_purchasing_power_parity~1)
(model_full <- formula(model.frame(gdp_purchasing_power_parity~., data=factbookn)))
f2 <- step(lm(model_empty, data=factbookn),
    scope=model_full,
    direction="forward", trace=0)
f2
summary(f2)$adj.r.squared
```

*Comment: Interestingly, it's mostly the import-related variables
that contribute to the GDP per capita.  However, the model
is not perfect, so we should refrain ourselves from building a brand new
economic theory around this "discovery". On the other hand,
you know what they say: all models are wrong, but some might be useful.
Note that we used the adjusted $R^2$ coefficient to correct
for the number of variables in the model
so as to make it more comparable with the coefficient corresponding
to the `f1` model.*

{ END solution }


{ BEGIN exercise }
Use backward elimination to construct a model
for `gdp_purchasing_power_parity` per capita.
{ END exercise }

{ BEGIN solution }

```{r factbookD8}
f3 <- step(lm(model_full, data=factbookn),
    scope=model_empty,
    direction="backward", trace=0)
f3
summary(f3)$adj.r.squared
```

*Comment: This is the same model as the one
found by forward selection, i.e., `f2`.*

{ END solution }


<!--


a binary variable 0/1
fit two models and compare
fit a single model involving the 0/1 variable and compare


multiple regression

draw scatterplot matrix (pair plot)

compute the matrix of Pearson's correlations

fit the full model

fit some other models

draw the fitted vs residuals plot

provide an interpretation of each coefficient in the model

standardise the variables??? and give the interpretation

forward selection/backward elimination


linearisation example


highly correlated input variables


-->


## Outro

### Remarks


Multiple regression is simple, fast to apply and interpretable.


Linear models go beyond fitting of straight lines and other hyperplanes!


A complex model may overfit and hence generalise poorly to unobserved inputs.


Note that the SSR criterion makes the models sensitive to outliers.


**Remember:**

good models
\[=\]
better understanding of the modelled reality $+$ better predictions
\[=\]
more revenue, your boss' happiness, your startup's growth etc.


### Other Methods for Regression


Other example approaches to regression:

- ridge regression,
- lasso regression,
- least absolute deviations (LAD) regression,
- multiadaptive regression splines (MARS),
- K-nearest neighbour (K-NN) regression, see `FNN::knn.reg()` in R,
- regression trees,
- support-vector regression (SVR),
- neural networks (also deep) for regression.


###  Derivation of the Solution  (\*\*)


We would like to find an analytical solution
to the problem of minimising of the sum of squared residuals:

\[
\min_{\beta_0, \beta_1,\dots, \beta_p\in\mathbb{R}} E(\beta_0, \beta_1, \dots, \beta_p)=
\sum_{i=1}^n \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right)^2
\]

This requires computing the $p+1$ partial derivatives
${\partial E}/{\partial \beta_j}$ for $j=0,\dots,p$.


The partial derivatives are very similar to each other;
$\frac{\partial E}{\partial \beta_0}$ is given by:
\[
\frac{\partial E}{\partial \beta_0}(\beta_0,\beta_1,\dots,\beta_p)=
2 \sum_{i=1}^n \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right)
\]
and $\frac{\partial E}{\partial \beta_j}$ for $j>0$ is equal to:
\[
\frac{\partial E}{\partial \beta_j}(\beta_0,\beta_1,\dots,\beta_p)=
2 \sum_{i=1}^n x_{i,j} \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right)
\]


Then all we need to do is to solve the system of linear equations:


\[
\left\{
\begin{array}{rcl}
\frac{\partial E}{\partial \beta_0}(\beta_0,\beta_1,\dots,\beta_p)&=&0 \\
\frac{\partial E}{\partial \beta_1}(\beta_0,\beta_1,\dots,\beta_p)&=&0 \\
\vdots\\
\frac{\partial E}{\partial \beta_p}(\beta_0,\beta_1,\dots,\beta_p)&=&0 \\
\end{array}
\right.
\]


The above system of $p+1$ linear equations, which we are supposed to solve
for $\beta_0,\beta_1,\dots,\beta_p$:
\[
\left\{
\begin{array}{rcl}
2 \sum_{i=1}^n \phantom{x_{i,0}}\left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right)&=&0 \\
2 \sum_{i=1}^n x_{i,1} \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right)&=&0 \\
\vdots\\
2 \sum_{i=1}^n x_{i,p} \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right)&=&0 \\
\end{array}
\right.
\]
can be rewritten as:
\[
\left\{
\begin{array}{rcl}
\sum_{i=1}^n \phantom{x_{i,0}}\left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p}\right)&=& \sum_{i=1}^n \phantom{x_{i,0}} y_i \\
\sum_{i=1}^n x_{i,1} \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p}\right)&=&\sum_{i=1}^n x_{i,1} y_i \\
\vdots\\
\sum_{i=1}^n x_{i,p} \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p}\right)&=&\sum_{i=1}^n x_{i,p} y_i \\
\end{array}
\right.
\]


and further as:
\[
\left\{
\begin{array}{rcl}
\beta_0\ n\phantom{\sum_{i=1}^n x} + \beta_1\sum_{i=1}^n \phantom{x_{i,0}}  x_{i,1}+\dots+\beta_p \sum_{i=1}^n \phantom{x_{i,0}}  x_{i,p} &=&\sum_{i=1}^n\phantom{x_{i,0}} y_i \\
\beta_0 \sum_{i=1}^n x_{i,1} + \beta_1\sum_{i=1}^n x_{i,1}  x_{i,1}+\dots+\beta_p \sum_{i=1}^n x_{i,1}  x_{i,p} &=&\sum_{i=1}^n x_{i,1} y_i \\
\vdots\\
\beta_0 \sum_{i=1}^n x_{i,p} + \beta_1\sum_{i=1}^n x_{i,p}  x_{i,1}+\dots+\beta_p \sum_{i=1}^n x_{i,p}  x_{i,p} &=&\sum_{i=1}^n x_{i,p} y_i \\
\end{array}
\right.
\]
Note that the terms involving $x_{i,j}$ and $y_i$ (the sums) are all constant
-- these are some fixed real numbers. We have learned how to solve such
problems in high school.

{ BEGIN exercise }
Try deriving the analytical solution and implementing it for $p=2$.
Recall that in the previous chapter we solved the special case of $p=1$.
{ END exercise }


###  Solution in Matrix Form (\*\*\*)


Assume that $\mathbf{X}\in\mathbb{R}^{n\times p}$ (a matrix with inputs),
$\mathbf{y}\in\mathbb{R}^{n\times 1}$ (a column vector of reference outputs)
and
$\boldsymbol{\beta}\in\mathbb{R}^{(p+1)\times 1}$ (a column vector of parameters).

Firstly, note that a linear model of the form:
\[
f_{\boldsymbol\beta}(\mathbf{x})=\beta_0+\beta_1 x_1+\dots+\beta_p x_p
\]
can be rewritten as:
\[
f_{\boldsymbol\beta}(\mathbf{x})=\beta_0 1+\beta_1 x_1+\dots+\beta_p x_p
=\mathbf{\dot{x}}\boldsymbol\beta,
\]
where $\mathbf{\dot{x}}=[1\ x_1\ x_2\ \cdots\ x_p]$.


Similarly, if we assume that $\mathbf{\dot{X}}=[\boldsymbol{1}\ \mathbf{X}]\in\mathbb{R}^{n\times (p+1)}$
is the input matrix with a prepended column of $1$s, i.e.,
$\boldsymbol{1}=[1\ 1\ \cdots\ 1]^T$ and $\dot{x}_{i,0}=1$ (for brevity of notation
the columns added will have index $0$),
$\dot{x}_{i,j}=x_{i,j}$ for all $j\ge 1$ and all $i$,
then:
\[
\mathbf{\hat{y}} = \mathbf{\dot{X}} \boldsymbol\beta
\]
gives the vector of predicted outputs for every input point.


This way, the sum of squared residuals
\[
E(\beta_0, \beta_1, \dots, \beta_p)=
\sum_{i=1}^n \left( \beta_0 + \beta_1 x_{i,1}+\dots+\beta_p x_{i,p} - y_{i} \right)^2
\]
can be rewritten as:
\[
E(\boldsymbol\beta)=\| \mathbf{\dot{X}} \boldsymbol\beta - \mathbf{y} \|^2,
\]
where as usual $\|\cdot\|^2$ denotes the squared Euclidean norm.

Recall that this can be re-expressed as:
\[
E(\boldsymbol\beta)= (\mathbf{\dot{X}} \boldsymbol\beta - \mathbf{y})^T (\mathbf{\dot{X}} \boldsymbol\beta - \mathbf{y}).
\]


In order to find the minimum of $E$ w.r.t. $\boldsymbol\beta$,
we need to find the parameters that make the partial derivatives vanish, i.e.:

\[
\left\{
\begin{array}{rcl}
\frac{\partial E}{\partial \beta_0}(\boldsymbol\beta)&=&0 \\
\frac{\partial E}{\partial \beta_1}(\boldsymbol\beta)&=&0 \\
\vdots\\
\frac{\partial E}{\partial \beta_p}(\boldsymbol\beta)&=&0 \\
\end{array}
\right.
\]


Remark.

:   (\*\*\*) Interestingly, the above can also be expressed in matrix form,
    using the special notation:
    \[
    \nabla E(\boldsymbol\beta) = \boldsymbol{0}
    \]
    Here, $\nabla E$ (nabla symbol = differential operator)
    denotes the function gradient, i.e., the vector of all partial derivatives.
    This is nothing more than syntactic sugar for this quite commonly applied operator.


Anyway,  the system of linear equations we have derived above:
\[
\left\{
\begin{array}{rcl}
\beta_0\ n\phantom{\sum_{i=1}^n x} + \beta_1\sum_{i=1}^n \phantom{x_{i,0}}  x_{i,1}+\dots+\beta_p \sum_{i=1}^n \phantom{x_{i,0}}  x_{i,p} &=&\sum_{i=1}^n\phantom{x_{i,0}} y_i \\
\beta_0 \sum_{i=1}^n x_{i,1} + \beta_1\sum_{i=1}^n x_{i,1}  x_{i,1}+\dots+\beta_p \sum_{i=1}^n x_{i,1}  x_{i,p} &=&\sum_{i=1}^n x_{i,1} y_i \\
\vdots\\
\beta_0 \sum_{i=1}^n x_{i,p} + \beta_1\sum_{i=1}^n x_{i,p}  x_{i,1}+\dots+\beta_p \sum_{i=1}^n x_{i,p}  x_{i,p} &=&\sum_{i=1}^n x_{i,p} y_i \\
\end{array}
\right.
\]
can be  rewritten in matrix terms as:
\[
\left\{
\begin{array}{rcl}
\beta_0 \mathbf{\dot{x}}_{\cdot,0}^T \mathbf{\dot{x}}_{\cdot,0} + \beta_1 \mathbf{\dot{x}}_{\cdot,0}^T \mathbf{\dot{x}}_{\cdot,1}+\dots+\beta_p \mathbf{\dot{x}}_{\cdot,0}^T \mathbf{\dot{x}}_{\cdot,p} &=& \mathbf{\dot{x}}_{\cdot,0}^T \mathbf{y} \\
\beta_0 \mathbf{\dot{x}}_{\cdot,1}^T \mathbf{\dot{x}}_{\cdot,0} + \beta_1 \mathbf{\dot{x}}_{\cdot,1}^T \mathbf{\dot{x}}_{\cdot,1}+\dots+\beta_p \mathbf{\dot{x}}_{\cdot,1}^T \mathbf{\dot{x}}_{\cdot,p} &=& \mathbf{\dot{x}}_{\cdot,1}^T \mathbf{y} \\
\vdots\\
\beta_0 \mathbf{\dot{x}}_{\cdot,p}^T \mathbf{\dot{x}}_{\cdot,0} + \beta_1 \mathbf{\dot{x}}_{\cdot,p}^T \mathbf{\dot{x}}_{\cdot,1}+\dots+\beta_p \mathbf{\dot{x}}_{\cdot,p}^T \mathbf{\dot{x}}_{\cdot,p} &=& \mathbf{\dot{x}}_{\cdot,p}^T \mathbf{y}\\
\end{array}
\right.
\]


This can be restated as:
\[
\left\{
\begin{array}{rcl}
\left(\mathbf{\dot{x}}_{\cdot,0}^T \mathbf{\dot{X}}\right)\, \boldsymbol\beta &=& \mathbf{\dot{x}}_{\cdot,0}^T \mathbf{y} \\
\left(\mathbf{\dot{x}}_{\cdot,1}^T \mathbf{\dot{X}}\right)\, \boldsymbol\beta  &=& \mathbf{\dot{x}}_{\cdot,1}^T \mathbf{y} \\
\vdots\\
\left(\mathbf{\dot{x}}_{\cdot,p}^T \mathbf{\dot{X}}\right)\, \boldsymbol\beta  &=& \mathbf{\dot{x}}_{\cdot,p}^T \mathbf{y}\\
\end{array}
\right.
\]
which in turn is equivalent to:
\[
\left(\mathbf{\dot{X}}^T\mathbf{X}\right)\,\boldsymbol\beta = \mathbf{\dot{X}}^T\mathbf{y}.
\]

Such a system of linear equations in matrix form can be solved numerically using,
amongst others, the `solve()` function.


Remark.

: (\*\*\*) In practice, we'd rather rely on QR or SVD decompositions
of matrices for efficiency and numerical accuracy reasons.


Numeric example -- solution via `lm()`:

```{r matrixlm1}
X1 <- as.numeric(Credit$Balance[Credit$Balance>0])
X2 <- as.numeric(Credit$Income[Credit$Balance>0])
Y  <- as.numeric(Credit$Rating[Credit$Balance>0])
lm(Y~X1+X2)$coefficients
```

Recalling that $\mathbf{A}^T \mathbf{B}$ can be computed
by calling `t(A) %*% B` or -- even faster -- by calling `crossprod(A, B)`,
we can also use `solve()` to obtain the same result:

```{r matrixlm2}
X_dot <- cbind(1, X1, X2)
solve( crossprod(X_dot, X_dot), crossprod(X_dot, Y) )
```


### Pearson's r in Matrix Form (\*\*)


Recall the Pearson linear correlation coefficient:
\[
r(\boldsymbol{x},\boldsymbol{y}) = \frac{
    \sum_{i=1}^n (x_i-\bar{x}) (y_i-\bar{y})
}{
    \sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}\ \sqrt{\sum_{i=1}^n (y_i-\bar{y})^2}
}
\]

Denote with $\boldsymbol{x}^\circ$ and $\boldsymbol{y}^\circ$ the centred versions
of $\boldsymbol{x}$ and $\boldsymbol{y}$, respectively,
i.e., $x_i^\circ=x_i-\bar{x}$ and $y_i^\circ=y_i-\bar{y}$.

Rewriting the above yields:
\[
r(\boldsymbol{x},\boldsymbol{y}) = \frac{
    \sum_{i=1}^n x_i^\circ y_i^\circ
}{
    \sqrt{\sum_{i=1}^n ({x_i^\circ})^2}\  \sqrt{\sum_{i=1}^n ({y_i^\circ})^2}
}
\]
which is exactly:
\[
r(\boldsymbol{x},\boldsymbol{y}) = \frac{
    \boldsymbol{x}^\circ\cdot \boldsymbol{y}^\circ
}{
    \| \boldsymbol{x}^\circ \|\    \| \boldsymbol{y}^\circ \|
}
\]
i.e., the normalised dot product of the centred versions of the two vectors.

This is the cosine of the angle between the two vectors
(in $n$-dimensional spaces)!


(\*\*) Recalling from the previous chapter that $\mathbf{A}^T \mathbf{A}$
gives the dot product between all the pairs of columns in a matrix $\mathbf{A}$,
we can implement an equivalent version of `cor(C)` as follows:

```{r cormanual}
C <- Credit[Credit$Balance>0,
    c("Rating", "Limit", "Income", "Age",
    "Education", "Balance")]
C_centred <- apply(C, 2, function(c) c-mean(c))
C_normalised <- apply(C_centred, 2, function(c)
    c/sqrt(sum(c^2)))
round(t(C_normalised) %*% C_normalised, 3)
```


### Further Reading

Recommended further reading: [@islr: Chapters 1, 2 and 3]

Other: [@esl: Chapter 1, Sections 3.2 and 3.3]