-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path94-R-data-frames.Rmd
397 lines (290 loc) · 12.2 KB
/
94-R-data-frames.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
<!--
kate: indent-width 4; word-wrap-column 74; default-dictionary en_AU
Copyright (C) 2020-2021, Marek Gagolewski <https://www.gagolewski.com>
This material is licensed under the Creative Commons BY-NC-ND 4.0 License.
-->
# Data Frame Wrangling in R
```{r chapter-header-motd,echo=FALSE,results="asis"}
cat(readLines("chapter-header-motd.md"), sep="\n")
```
R `data.frame`s are similar to matrices in the sense that
we use them to store tabular data.
However, in data frames each column can be of different type:
```{r df1}
head(iris)
head(rpart::car90, 2)
```
## Creating Data Frames
Most frequently, we will be creating data frames
based on a series of numeric, logical, characters vectors of identical lengths.
```{r df2,echo=-1}
set.seed(1236)
x <- data.frame(
u=runif(5),
v=sample(c(TRUE, FALSE), 5, replace=TRUE),
w=LETTERS[1:5]
)
print(x)
```
Some objects, such as matrices, can easily be coerced to data frames:
```{r df5}
(A <- matrix(1:12, byrow=TRUE, nrow=3,
dimnames=list(
NULL, # row labels
c("x", "y", "z", "w") # column labels
)))
as.data.frame(A)
```
Named lists are amongst other candidates for a meaningful conversion:
```{r df6}
(l <- lapply(split(iris$Sepal.Length, iris$Species),
function(x) {
c(min=min(x), median=median(x), mean=mean(x), max=max(x))
}))
as.data.frame(l)
```
## Importing Data Frames
Many interesting data frames come from external sources, such as CSV files,
web APIs, SQL databases and so on.
In particular, `read.csv()` (see `?read.table` for a long list
of tunable parameters) imports data from plain text files
organised in a tabular manner (such as *c*omma-*s*eparated lists of *v*alues):
```{r csv}
f <- tempfile() # temporary file name
write.csv(x, f, row.names=FALSE) # save data frame to file
cat(readLines(f), sep="\n") # print file contents
read.csv(f)
```
Note that CSV is by far the most portable format for exchanging matrix-like
objects between different programs (statistical or numeric computing
environments, spreadsheets etc.).
<!-- TODO: relative vs absolute paths -->
## Data Frame Subsetting
### Each Data Frame is a List
First of all, we should note that
each data frame is in fact represented as an ordinary named list:
```{r dfclass}
class(x)
typeof(x)
```
Each column is stored as a separate list item.
Having said that, we shouldn't be surprised that we already know
how to perform quite a few operations on data frames:
```{r listdf}
length(x) # number of columns
names(x) # column labels
x$u # accessing column `u` (synonym: x[["u"]])
x[[2]] # 2nd column
x[c(1,3)] # a sub-data.frame
sapply(x, class) # apply class() on each column
```
### Each Data Frame is Matrix-like
Data frames can be considered as "generalised" matrices.
Therefore, operations such as subsetting will work in the same manner.
```{r matsubs1}
dim(x) # number of rows and columns
x[1:2,] # first two rows
x[,c(1,3)] # 1st and 3rd column, synonym: x[c(1,3)]
x[,1] # synonym: x[[1]]
x[,1,drop=FALSE] # synonym: x[1]
```
Take a special note of selecting rows based on logical vectors.
For instance, let's extract all the rows from `x` where the values
in the column named `u` are between 0.3 and 0.6:
```{r matsubs2}
x[x$u>=0.3 & x$u<=0.6, ]
x[!(x[,"u"]<0.3 | x[,"u"]>0.6), ] # equivalent
```
Moreover, subsetting based on integer vectors can be used to
change the order of rows. Here is how we can sort the rows in `x`
with respect to the values in column `u`:
```{r matsubs3}
(x_sorted <- x[order(x$u),])
```
Let's stress that the programming style we emphasise on here is very
transparent. If we don't understand how a complex operation is being executed,
we can always decompose it into smaller chunks that can be studied separately.
For instance, as far as the last example is concerned, we can
take a look at the manual of `?order` and then inspect the result
of calling `order(x$u)`.
On a side note, we can re-set the row names by referring to:
```{r rownamesreset}
row.names(x_sorted) <- NULL
x_sorted
```
## Common Operations
We already know how to filter rows based on logical conditions, e.g.:
```{r iris}
iris[iris$Petal.Width >= 1.2 & iris$Petal.Width <= 1.3,
c("Petal.Width", "Species")]
iris[iris$Sepal.Length > 6.5 & iris$Species == "versicolor", ]
```
and aggregate information in individual columns:
```{r summary}
sapply(iris[1:4], summary)
```
Quite frequently, we will be interested
in summarising data within subgroups generated by a list of factor-like variables.
```{r}
aggregate(iris[1:4], iris[5], mean) # not: iris[, 5] !
ToothGrowth[sample(nrow(ToothGrowth), 5), ] # 5 random rows
aggregate(ToothGrowth["len"], ToothGrowth[c("supp", "dose")], median)
```
According to the manual of the aggregate function,
see `help("aggregate")`, the syntax is `aggregate(x, by, FUN)`, where:
* `x` is a data frame whose columns are to be aggregated;
* `by` is a list of grouping elements, each as long as the variables
in the data frame `x`; recall that each data frame is a list of
vectors of identical lengths, so ultimately this can also
be a data frame as well;
this is why we have written `iris[5]` and neither `iris[[5]]` nor
`iris[, 5]` (although we could have used
the `iris[, 5, drop=FALSE]` notation);
* `FUN` is an R function to be applied on each column in `x`
within all groups of observations defined by `by`.
Further arguments can be passed to the function being called,
e.g., using the notation like `aggregate(X, y, mean, na.rm=TRUE)`.
{ BEGIN exercise }
Below are examples using the built-in `ToothGrowth` data frame.
As an exercise, run all the following lines in R and study
the results carefully:
```{r aggregatehelp1,eval=FALSE}
ToothGrowth["supp"] # selects a specific column from a data frame; the result is still a data frame; remember that a data frame is also a list
ToothGrowth[["supp"]] # this is different - this is a vector; cannot be used in aggregate, at least not as `by` argument; extracts data from a column, is no longer a data frame
ToothGrowth[c("supp", "dose")] # selects 2 columns
aggregate(ToothGrowth["len"], ToothGrowth["supp"], mean) # computing average len in groups defined by supp
aggregate(ToothGrowth["len"], ToothGrowth[c("supp", "dose")], mean) # computing average len in groups defined by combinations of supp and dose
f <- function(x) c(MEAN=mean(x), MEDIAN=median(x)) # a custom function that returns two aggregates in form of a named (labelled) vector
aggregate(ToothGrowth["len"], ToothGrowth["supp"], f)
aggregate(ToothGrowth["len"], ToothGrowth[c("supp", "dose")], f)
```
{ END exercise }
{ BEGIN exercise }
We can introduce new grouping variables of any kind,
for example based on data in other columns.
Run the following line by line:
```{r aggregatehelp2,eval=FALSE}
ToothGrowth[["dose_1_or_2"]] <- ToothGrowth[["dose"]] >= 1.0 # note 2 square brackets now - dealing with individual columns
ToothGrowth # note the added column
aggregate(ToothGrowth["len"], ToothGrowth["dose_1_or_2"], function(x) c(MIN=min(x), MAX=max(x)))
```
{ END exercise }
Taking into account that `split()` accepts a data frame input as well,
we can perform what follows:
```{r sapply}
sapply(
# split iris into 3 sub-data.frames:
split(iris, iris[5]),
# on each sub-data.frame, apply the following function
function(df) {
# compute the mean of first four columns:
sapply(df[1:4], mean)
})
sapply(split(iris, iris[5]), function(df) {
c(Sepal.Length=summary(iris$Sepal.Length),
Petal.Length=summary(iris$Petal.Length)
)
})
```
The above syntax solely relies on the building blocks that we have
already mastered! This should be very appealing to the minimalists.
Note that R packages `data.table` and `dplyr` provide us with
reimplementations of the most important functions for the processing
of data frames. We only cover the classic/base functions here,
because they have existed since the very beginning of R. These are known
to every R user and are likely to work this way forever.
Having said this, we should acknowledge that some users might find
`dplyr` or `data.table` more convenient
(oftentimes they are faster or handle big data better) in many contexts.
The problem, though, is that when faced with less typical problems
(as a future AI/ML engineer/data scientist we'll be particularly
exposed to them), we'll have to fall back to base R anyway,
because it is more powerful and gives us more transferable skills
(also with regards to objects of different types, e.g., matrices).
This is exactly why we prefer functions like `aggregate()`
subsetting like `df[,]`, etc.
## Metaprogramming and Formulas (\*)
R (together with a few other programming languages such as Lisp and Scheme,
that heavily inspired R's semantics) allows its programmers to apply
some *metaprogramming* techniques, that is,
to write programs that manipulate unevaluated R expressions.
For instance, take a close look at the following plot:
```{r plotz-metaprogramming,fig.cap="Metaprogramming in action: Just take a look at the Y axis label"}
z <- seq(-2*pi, 2*pi, length.out=101)
plot(z, sin(z), type="l")
```
How did the `plot()` function know that we are plotting `sin` of `z`
(see Figure \@ref(fig:plotz-metaprogramming))?
It turns out that, at any time, we not only have access to the value
of an object (such as the result of evaluating `sin(z)`, which is
a vector of 101 reals) but also to the expression that was passed as
a function's argument itself.
```{r test_meta}
test_meta <- function(x) {
cat("x equals to ", x, "\n") # \n == newline
cat("x stemmed from ", deparse(substitute(x)), "\n")
}
test_meta(2+7)
```
This is very powerful and yet potentially very confusing to the users, because
we can write functions that don't compute the arguments provided
in a way we expect them to (i.e., following the R language specification).
Each function can constitute a new micro-verse, where with its own rules --
we should always refer to the documentation.
For instance, consider the `subset()` function:
```{r subset}
head(iris)
subset(iris, Sepal.Length>7.5, select=-(Sepal.Width:Petal.Width))
```
Neither `Sepal.Length>6` nor `-(Sepal.Width:Petal.Width)` make sense
as standalone R expressions! However, according to the `subset()` function's
own rules, the former expression is considered as a row selector
(here, `Sepal.Length` refers to a particular column *within* the `iris` data frame).
The latter plays the role of a column filter (select everything but all the columns
between...).
The data.table and dplyr packages (which are very popular)
rely on this language feature
all the time, so we shouldn't be surprised when we see them.
. . .
There is one more interesting language feature that is possible
thanks to metaprogramming.
*Formulas* are special R objects that consist of two unevaluated
R expressions separated by a tilde (`~`).
For example:
```{r formula}
len ~ supp+dose
```
A formula on its own has no meaning. However, many R functions
accept formulas as arguments and can interpret them in various different ways.
For example, the `lm()` function that fits a linear regression model,
uses formulas to specify the output and input variables:
```{r lm}
lm(Sepal.Length~Petal.Length+Sepal.Width, data=iris)
```
On the other hand, `boxplot()` (see Figure \@ref(fig:boxplot-metaprogramming))
allows for creating
separate box-and-whisker plots for each subgroup given by a combination
of factors.
```{r boxplot-metaprogramming,fig.cap="Example box plot created via the formula interface"}
boxplot(len~supp+dose, data=ToothGrowth,
horizontal=TRUE, col="white")
```
The `aggregate()` function supports formulas too:
```{r aggregate}
aggregate(cbind(Sepal.Length, Sepal.Width)~Species, data=iris, mean)
```
We should therefore make sure that we know how every function interacts
with a formula -- information on that can be found in `?lm`, `?boxplot`,
`?aggregate` and so forth.
<!--
TODO: Add OnTrack exercises
-->
## Further Reading
Recommended further reading: [@Rintro]
Other: [@rprogdatascience], [@r4ds]
R packages `dplyr` and `data.table` implement the most common
data frame wrangling procedures. You may find them very useful.
Moreover, they are very fast even for large data sets.
Additionally, the `magrittr` package provides a pipe operator, `%>%`,
that simplifies the writing of complex, nested function calls.
Do note that not everyone is a big fan of these, however.