forked from datacarpentry/R-ecology-lesson
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path03-data-frames.Rmd
298 lines (243 loc) · 12 KB
/
03-data-frames.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
---
layout: topic
title: The `data.frame` class
author: Data Carpentry contributors
minutes: 30
---
```{r, echo=FALSE, purl=FALSE, message = FALSE}
source("setup.R")
surveys <- read.csv("data/portal_data_joined.csv")
```
```{r, echo=FALSE, purl=TRUE}
## The data.frame class
```
------------
> ## Learning Objectives
>
> * describe what a `data.frame` is
> * read data from a file into a `data.frame` and change how character strings are handled
> * summarize the size and data types of a `data.frame`
> * write a command to print a sequence of numbers
> * subset part of a `data.frame` e.g. particular rows or columns
------------
## What are data frames?
Data frames are the _de facto_ data structure for most tabular data, and what we
use for statistics and plotting.
A data frame is a collection of vectors of identical lengths. Each vector
represents a column, and each vector can be of a different data type (e.g.,
characters, integers, factors). The `str()` function is useful to inspect the
data types of the columns.
A data frame can be created by hand, but most commonly they are generated by the
functions `read.csv()` or `read.table()`; in other words, when importing
spreadsheets from your hard drive (or the web).
By default, when building or importing a data frame, the columns that contain
characters (i.e., text) are coerced (=converted) into the `factor` data
type. Depending on what you want to do with the data, you may want to keep these
columns as `character`. To do so, `read.csv()` and `read.table()` have an
argument called `stringsAsFactors` which can be set to `FALSE`:
```{r, eval=FALSE, purl=FALSE}
some_data <- read.csv("data/some_file.csv", stringsAsFactors=FALSE)
```
You can also create a data frame manually with the function `data.frame()`. This
function can also take the argument `stringsAsFactors`. Compare the output of
these examples, and compare the difference between when the data are being read
as `character`, and when they are being read as `factor`.
```{r, results='show', purl=TRUE}
## Compare the output of these examples, and compare the difference between when
## the data are being read as `character`, and when they are being read as
## `factor`.
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "furry", "squishy", "spiny"),
weight=c(45, 8, 1.1, 0.8))
str(example_data)
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "furry", "squishy", "spiny"),
weight=c(45, 8, 1.1, 0.8), stringsAsFactors=FALSE)
str(example_data)
```
### Challenge
1. There are a few mistakes in this hand crafted `data.frame`, can you spot and
fix them? Don't hesitate to experiment!
```{r, eval=FALSE, purl=FALSE}
author_book <- data.frame(author_first=c("Charles", "Ernst", "Theodosius"),
author_last=c(Darwin, Mayr, Dobzhansky),
year=c(1942, 1970))
```
```{r, eval=FALSE, purl=TRUE, echo=FALSE}
## Challenge
## There are a few mistakes in this hand crafted `data.frame`,
## can you spot and fix them? Don't hesitate to experiment!
author_book <- data.frame(author_first=c("Charles", "Ernst", "Theodosius"),
author_last=c(Darwin, Mayr, Dobzhansky),
year=c(1942, 1970))
```
2. Can you predict the class for each of the columns in the following example?
Check your guesses using `str(country_climate)`:
* Are they what you expected? Why? Why not?
* What would have been different if we had added `stringsAsFactors = FALSE` to this call?
* What would you need to change to ensure that each column had the accurate data type?
```{r, eval=FALSE, purl=FALSE}
country_climate <- data.frame(country=c("Canada", "Panama", "South Africa", "Australia"),
climate=c("cold", "hot", "temperate", "hot/temperate"),
temperature=c(10, 30, 18, "15"),
northern_hemisphere=c(TRUE, TRUE, FALSE, "FALSE"),
has_kangaroo=c(FALSE, FALSE, FALSE, 1))
```
```{r, eval=FALSE, purl=TRUE, echo=FALSE}
## Challenge:
## Can you predict the class for each of the columns in the following
## example?
## Check your guesses using `str(country_climate)`:
## * Are they what you expected? Why? why not?
## * What would have been different if we had added `stringsAsFactors = FALSE`
## to this call?
## * What would you need to change to ensure that each column had the
## accurate data type?
country_climate <- data.frame(country=c("Canada", "Panama", "South Africa", "Australia"),
climate=c("cold", "hot", "temperate", "hot/temperate"),
temperature=c(10, 30, 18, "15"),
northern_hemisphere=c(TRUE, TRUE, FALSE, "FALSE"),
has_kangaroo=c(FALSE, FALSE, FALSE, 1))
```
3. We introduced you to the `data.frame()` function and `read.csv()`, but what
if we are starting with some vectors? The best way to do this is to pass
those vectors to the `data.frame()` function, similar to the above.
```{r, eval=FALSE, purl=FALSE}
color <- c("red", "green", "blue", "yellow")
counts <- c(50, 60, 65, 82)
new_datarame <- data.frame(colors = color, counts = counts)
```
Try making your own new data frame from some vectors. You can check the data
type of the new object using `class()`.
<!--- Answers
```{r, eval=FALSE, echo=FALSE, purl=FALSE}
## Answers
## * missing quotations around the first names of the authors
## * the year column is missing one value, 1859 (the year of publication of
## the origin of species)
```
```{r, eval=FALSE, echo=FALSE, purl=FALSE}
## Answers
## * `country`, `climate`, `temperature`, and `northern_hemisphere` are
## factors; `has_kangaroo` is numeric.
## * using `stringsAsFactors=FALSE` would have made them character instead of
## factors
## * removing the quotes in temperature, northern_hemisphere, and replacing 1
## by TRUE in the `has_kangaroo` column would probably what was originally
## intended.
```
-->
The automatic conversion of data type is sometimes a blessing, sometimes an
annoyance. Be aware that it exists, learn the rules, and double check that
data you import in R are of the correct type within your data frame. If not,
use it to your advantage to detect mistakes that might have been introduced
during data entry (a letter in a column that should only contain numbers for
instance.).
## Inspecting `data.frame` Objects
We already saw how the functions `head()` and `str()` can be useful to check the
content and the structure of a `data.frame`. Here is a non-exhaustive list of
functions to get a sense of the content/structure of the data.
* Size:
* `dim()` - returns a vector with the number of rows in the first element,
and the number of columns as the second element (the **dim**ensions of
the object)
* `nrow()` - returns the number of rows
* `ncol()` - returns the number of columns
* Content:
* `head()` - shows the first 6 rows
* `tail()` - shows the last 6 rows
* Names:
* `names()` - returns the column names (synonym of `colnames()` for `data.frame`
objects)
* `rownames()` - returns the row names
* Summary:
* `str()` - structure of the object and information about the class, length and
content of each column
* `summary()` - summary statistics for each column
Note: most of these functions are "generic", they can be used on other types of
objects besides `data.frame`.
## Indexing, Sequences, and Subsetting
```{r, echo=FALSE, purl=TRUE}
## Sequences and Subsetting data frames
```
`:` is a special function that creates numeric vectors of integers in increasing
or decreasing order, test `1:10` and `10:1` for instance. The function `seq()`
(for **seq**uence) can be used to create more complex patterns:
```{r, results='show', purl=FALSE}
seq(1, 10, by=2)
seq(5, 10, length.out=3)
seq(50, by=5, length.out=10)
seq(1, 8, by=3) # sequence stops to stay below upper limit
```
Our survey data frame has rows and columns (it has 2 dimensions), if we want to
extract some specific data from it, we need to specify the "coordinates" we
want from it. Row numbers come first, followed by column numbers. However, note
that different ways of specifying these coordinates lead to results with
different classes.
```{r, purl=FALSE}
surveys[1] # first column in the data frame (as a data.frame)
surveys[,1] # first column in the data frame (as a vector)
surveys[1, 1] # first element in the first column of the data frame (as a vector)
surveys[1, 6] # first element in the 6th column (as a vector)
surveys[1:3, 7] # first three elements in the 7th column (as a vector)
surveys[3, ] # the 3rd element for all columns (as a data.frame)
head_surveys <- surveys[1:6, ] # equivalent to head(surveys)
```
You can also exclude certain parts of a data frame
```{r, purl=FALSE}
surveys[,-1] #The whole data frame, except the first column
surveys[-c(7:34786),] #equivalent to head(surveys)
```
As well as using numeric values to subset a `data.frame` (or `matrix`), columns
can be called by name, using one of the three following notations:
```{r, eval = FALSE, purl=FALSE}
surveys["species_id"] # Result is a data.frame
surveys[, "species_id"] # Result is a vector
surveys[["species_id"]] # Result is a vector
surveys$species_id # Result is a vector
```
For our purposes, these three notations are equivalent. However, the last one
with the `$` does partial matching on the name. So you could also select the
column `"day"` by typing `surveys$d`. It's a shortcut, as with all shortcuts,
they can have dangerous consequences, and are best avoided. Besides, with
auto-completion in RStudio, you rarely have to type more than a few characters
to get the full and correct column name.
### Challenge
1. The function `nrow()` on a `data.frame` returns the number of rows. Use it,
in conjunction with `seq()` to create a new `data.frame` called
`surveys_by_10` that includes every 10th row of the survey data frame
starting at row 10 (10, 20, 30, ...)
2. Create a `data.frame` containing only the observations from row 1999 of the
`surveys` dataset.
3. Notice how `nrow()` gave you the number of rows in a `data.frame`? Use `nrow()`
instead of a row number to make a `data.frame` with observations from only the last
row of the `surveys` dataset.
4. Now that you've seen how `nrow()` can be used to stand in for a row index, let's combine
that behavior with the `-` notation above to reproduce the behavior of `head(surveys)`
excluding the 7th through final row of the `surveys` dataset.
```{r, echo=FALSE, purl=TRUE}
### 1. The function `nrow()` on a `data.frame` returns the number of
### rows. Use it, in conjuction with `seq()` to create a new
### `data.frame` called `surveys_by_10` that includes every 10th row
### of the survey data frame starting at row 10 (10, 20, 30, ...)
###
### 2. Create a data.frame containing only the observation from row 1999 of the -->
### surveys dataset.
###
### 3. Notice how `nrow()` gave you the number of rows in a `data.frame`? Use `nrow()`
### instead of a row number to make a `data.frame` with observations from only the last
### row of the `surveys` dataset.
###
### 4. Now that you've seen how `nrow()` can be used to stand in for a row index, let's combine
### that behavior with the `-` notation above to reproduce the behavior of `head(surveys)`
### excluding the 7th through final row of the `surveys` dataset.
```
<!---
```{r, purl=FALSE}
## Answers
surveys_by_10 <- surveys[seq(10, nrow(surveys), by=10), ]
surveys_1999 <- surveys[surveys$year == 1999, ]
surveys_last <- surveys[nrow(surveys),]
surveys_head <- surveys[-c(7:nrow(surveys)),]
```
--->