94-R-data-frames.Rmd

<!--
kate: indent-width 4; word-wrap-column 74; default-dictionary en_AU
Copyright (C) 2020-2021, Marek Gagolewski <https://www.gagolewski.com>
This material is licensed under the Creative Commons BY-NC-ND 4.0 License.
-->

# Data Frame Wrangling in R

```{r chapter-header-motd,echo=FALSE,results="asis"}
cat(readLines("chapter-header-motd.md"), sep="\n")
```


R `data.frame`s are similar to matrices in the sense that
we use them to store tabular data.
However, in data frames each column can be of different type:

```{r df1}
head(iris)
head(rpart::car90, 2)
```


## Creating Data Frames

Most frequently, we will be creating data frames
based on a series of numeric, logical, characters vectors of identical lengths.

```{r df2,echo=-1}
set.seed(1236)
x <- data.frame(
    u=runif(5),
    v=sample(c(TRUE, FALSE), 5, replace=TRUE),
    w=LETTERS[1:5]
)
print(x)
```


Some objects, such as matrices, can easily be coerced to data frames:

```{r df5}
(A <- matrix(1:12, byrow=TRUE, nrow=3,
    dimnames=list(
        NULL,     # row labels
        c("x", "y", "z", "w") # column labels
)))
as.data.frame(A)
```

Named lists are amongst other candidates for a meaningful conversion:

```{r df6}
(l <- lapply(split(iris$Sepal.Length, iris$Species),
    function(x) {
        c(min=min(x), median=median(x), mean=mean(x), max=max(x))
    }))
as.data.frame(l)
```


## Importing Data Frames

Many interesting data frames come from external sources, such as CSV files,
web APIs, SQL databases and so on.

In particular, `read.csv()` (see `?read.table` for a long list
of tunable parameters) imports data from plain text files
organised in a tabular manner (such as *c*omma-*s*eparated lists of *v*alues):

```{r csv}
f <- tempfile() # temporary file name
write.csv(x, f, row.names=FALSE) # save data frame to file
cat(readLines(f), sep="\n") # print file contents
read.csv(f)
```

Note that CSV is by far the most portable format for exchanging matrix-like
objects between different programs (statistical or numeric computing
environments, spreadsheets etc.).

<!-- TODO: relative vs absolute paths -->


## Data Frame Subsetting

### Each Data Frame is a List

First of all, we should note that
each data frame is in fact represented as an ordinary named list:

```{r dfclass}
class(x)
typeof(x)
```

Each column is stored as a separate list item.
Having said that, we shouldn't be surprised that we already know
how to perform quite a few operations on data frames:

```{r listdf}
length(x) # number of columns
names(x)  # column labels
x$u # accessing column `u` (synonym: x[["u"]])
x[[2]] # 2nd column
x[c(1,3)] # a sub-data.frame
sapply(x, class) # apply class() on each column
```


### Each Data Frame is Matrix-like

Data frames can be considered as "generalised" matrices.
Therefore, operations such as subsetting will work in the same manner.

```{r matsubs1}
dim(x) # number of rows and columns
x[1:2,] # first two rows
x[,c(1,3)] # 1st and 3rd column, synonym: x[c(1,3)]
x[,1] # synonym: x[[1]]
x[,1,drop=FALSE] # synonym: x[1]
```

Take a special note of selecting rows based on logical vectors.
For instance, let's extract all the rows from `x` where the values
in the column named `u` are between 0.3 and 0.6:

```{r matsubs2}
x[x$u>=0.3 & x$u<=0.6, ]
x[!(x[,"u"]<0.3 | x[,"u"]>0.6), ] # equivalent
```

Moreover, subsetting based on integer vectors can be used to
change the order of rows. Here is how we can sort the rows in `x`
with respect to the values in column `u`:

```{r matsubs3}
(x_sorted <- x[order(x$u),])
```


Let's stress that the programming style we emphasise on here is very
transparent. If we don't understand how a complex operation is being executed,
we can always decompose it into smaller chunks that can be studied separately.
For instance, as far as the last example is concerned, we can
take a look at the manual of `?order` and then inspect the result
of calling `order(x$u)`.

On a side note, we can re-set the row names by referring to:

```{r rownamesreset}
row.names(x_sorted) <- NULL
x_sorted
```


## Common Operations


We already know how to filter rows based on logical conditions, e.g.:

```{r iris}
iris[iris$Petal.Width >= 1.2 & iris$Petal.Width <= 1.3,
    c("Petal.Width", "Species")]
iris[iris$Sepal.Length > 6.5 & iris$Species == "versicolor", ]
```

and aggregate information in individual columns:

```{r summary}
sapply(iris[1:4], summary)
```

Quite frequently, we will be interested
in summarising data within subgroups generated by a list of factor-like variables.

```{r}
aggregate(iris[1:4], iris[5], mean)  # not: iris[, 5] !
ToothGrowth[sample(nrow(ToothGrowth), 5), ] # 5 random rows
aggregate(ToothGrowth["len"], ToothGrowth[c("supp", "dose")], median)
```

According to the manual of the aggregate function,
see `help("aggregate")`, the syntax is `aggregate(x, by, FUN)`, where:

* `x` is a data frame whose columns are to be aggregated;
* `by` is a list of grouping elements, each as long as the variables
    in the data frame `x`; recall that each data frame is a list of
    vectors of identical lengths, so ultimately this can also
    be a data frame as well;
    this is why we have written `iris[5]` and neither `iris[[5]]` nor
    `iris[, 5]` (although we could have used
    the `iris[, 5, drop=FALSE]` notation);
* `FUN` is an R function to be applied on each column in `x`
    within all groups of observations defined by `by`.

Further arguments can be passed to the function being called,
e.g., using the notation like `aggregate(X, y, mean, na.rm=TRUE)`.


{ BEGIN exercise }
Below are examples using the built-in `ToothGrowth` data frame.
As an exercise, run all the following lines in R and study
the results carefully:

```{r aggregatehelp1,eval=FALSE}
ToothGrowth["supp"]  # selects a specific column from a data frame; the result is still a data frame; remember that a data frame is also a list
ToothGrowth[["supp"]] # this is different - this is a vector; cannot be used in aggregate, at least not as `by` argument; extracts data from a column, is no longer a data frame
ToothGrowth[c("supp", "dose")]  # selects 2 columns
aggregate(ToothGrowth["len"], ToothGrowth["supp"], mean)  # computing average len in groups defined by supp
aggregate(ToothGrowth["len"], ToothGrowth[c("supp", "dose")], mean)  # computing average len in groups defined by combinations of supp and dose
f <- function(x) c(MEAN=mean(x), MEDIAN=median(x))  # a custom function that returns two aggregates in form of a named (labelled) vector
aggregate(ToothGrowth["len"], ToothGrowth["supp"], f)
aggregate(ToothGrowth["len"], ToothGrowth[c("supp", "dose")], f)
```
{ END exercise }

{ BEGIN exercise }
We can introduce new grouping variables of any kind,
for example based on data in other columns.
Run the following line by line:

```{r aggregatehelp2,eval=FALSE}
ToothGrowth[["dose_1_or_2"]] <- ToothGrowth[["dose"]] >= 1.0  # note 2 square brackets now - dealing with individual columns
ToothGrowth  # note the added column
aggregate(ToothGrowth["len"], ToothGrowth["dose_1_or_2"], function(x) c(MIN=min(x), MAX=max(x)))
```
{ END exercise }


Taking into account that `split()` accepts a data frame input as well,
we can perform what follows:

```{r sapply}
sapply(
    # split iris into 3 sub-data.frames:
    split(iris, iris[5]),
    # on each sub-data.frame, apply the following function
    function(df) {
        # compute the mean of first four columns:
        sapply(df[1:4], mean)
    })
sapply(split(iris, iris[5]), function(df) {
    c(Sepal.Length=summary(iris$Sepal.Length),
      Petal.Length=summary(iris$Petal.Length)
     )
})
```

The above syntax solely relies on the building blocks that we have
already mastered! This should be very appealing to the minimalists.

Note that R packages `data.table` and `dplyr` provide us with
reimplementations  of the most important functions for the processing
of data frames. We ​only cover the classic/base functions here,
because they have existed since the very beginning of R. These are known
to every R user and are likely to work this way forever.
Having said this, we should acknowledge that some users might find
`dplyr` or `data.table` more convenient
(oftentimes they are faster or handle big data better) in many contexts.
The problem, though, is that when faced with less typical problems
(as a future AI/ML engineer/data scientist we'll be particularly
exposed to them), we'll have to fall back to base R anyway,
because it is more powerful and gives us more transferable skills
(also with regards to objects of different types, e.g., matrices).
This is exactly why we prefer functions like `aggregate()`
subsetting like `df[,]`, etc.


## Metaprogramming and Formulas (\*)

R (together with a few other programming languages such as Lisp and Scheme,
that heavily inspired R's semantics) allows its programmers to apply
some *metaprogramming* techniques, that is,
to write programs that manipulate unevaluated R expressions.

For instance, take a close look at the following plot:

```{r plotz-metaprogramming,fig.cap="Metaprogramming in action: Just take a look at the Y axis label"}
z <- seq(-2*pi, 2*pi, length.out=101)
plot(z, sin(z), type="l")
```

How did the `plot()` function know that we are plotting `sin` of `z`
(see Figure \@ref(fig:plotz-metaprogramming))?
It turns out that, at any time, we not only have access to the value
of an object (such as the result of evaluating `sin(z)`, which is
a vector of 101 reals) but also to the expression that was passed as
a function's argument itself.

```{r test_meta}
test_meta <- function(x) {
    cat("x equals to ", x, "\n") # \n == newline
    cat("x stemmed from ", deparse(substitute(x)), "\n")
}
test_meta(2+7)
```

This is very powerful and yet potentially very confusing to the users, because
we can write functions that don't compute the arguments provided
in a way we expect them to (i.e., following the R language specification).
Each function can constitute a new micro-verse, where with its own rules --
we should always refer to the documentation.

For instance, consider the `subset()` function:

```{r subset}
head(iris)
subset(iris, Sepal.Length>7.5, select=-(Sepal.Width:Petal.Width))
```

Neither `Sepal.Length>6` nor `-(Sepal.Width:Petal.Width)` make sense
as standalone R expressions! However, according to the `subset()` function's
own rules, the former expression is considered as a row selector
(here, `Sepal.Length` refers to a particular column *within* the `iris` data frame).
The latter plays the role of a column filter (select everything but all the columns
between...).

The data.table and dplyr packages (which are very popular)
rely on this language feature
all the time, so we shouldn't be surprised when we see them.

. . .


There is one more interesting language feature that is possible
thanks to metaprogramming.
*Formulas* are special R objects that consist of two unevaluated
R expressions separated by a tilde (`~`).
For example:

```{r formula}
len ~ supp+dose
```

A formula on its own has no meaning. However, many R functions
accept formulas as arguments and can interpret them in various different ways.

For example, the `lm()` function that fits a linear regression model,
uses formulas to specify the output and input variables:

```{r lm}
lm(Sepal.Length~Petal.Length+Sepal.Width, data=iris)
```

On the other hand, `boxplot()` (see Figure \@ref(fig:boxplot-metaprogramming))
allows for creating
separate box-and-whisker plots for each subgroup given by a combination
of factors.

```{r boxplot-metaprogramming,fig.cap="Example box plot created via the formula interface"}
boxplot(len~supp+dose, data=ToothGrowth,
    horizontal=TRUE, col="white")
```

The `aggregate()` function supports formulas too:

```{r aggregate}
aggregate(cbind(Sepal.Length, Sepal.Width)~Species, data=iris, mean)
```

We should therefore make sure that we know how every function interacts
with a formula -- information on that can be found in `?lm`, `?boxplot`,
`?aggregate` and so forth.


<!--

TODO: Add OnTrack exercises

-->


## Further Reading

Recommended further reading: [@Rintro]

Other: [@rprogdatascience], [@r4ds]

R packages `dplyr` and `data.table` implement the most common
data frame wrangling procedures. You may find them very useful.
Moreover, they are very fast even for large data sets.
Additionally, the `magrittr` package provides a pipe operator, `%>%`,
that simplifies the writing of complex, nested function calls.
Do note that not everyone is a big fan of these, however.