forked from datacarpentry/R-ecology-lesson
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path03-data-frames.html
336 lines (289 loc) · 20.8 KB
/
03-data-frames.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />
<meta name="author" content="Data Carpentry contributors" />
<title>The data.frame class</title>
<script src="libs/jquery-1.11.3/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link href="libs/bootstrap-3.3.5/css/bootstrap.min.css" rel="stylesheet" />
<script src="libs/bootstrap-3.3.5/js/bootstrap.min.js"></script>
<script src="libs/bootstrap-3.3.5/shim/html5shiv.min.js"></script>
<script src="libs/bootstrap-3.3.5/shim/respond.min.js"></script>
<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
div.sourceCode { overflow-x: auto; }
table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
margin: 0; padding: 0; vertical-align: baseline; border: none; }
table.sourceCode { width: 100%; line-height: 100%; background-color: #f8f8f8; }
td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
td.sourceCode { padding-left: 5px; }
pre, code { background-color: #f8f8f8; }
code > span.kw { color: #204a87; font-weight: bold; } /* Keyword */
code > span.dt { color: #204a87; } /* DataType */
code > span.dv { color: #0000cf; } /* DecVal */
code > span.bn { color: #0000cf; } /* BaseN */
code > span.fl { color: #0000cf; } /* Float */
code > span.ch { color: #4e9a06; } /* Char */
code > span.st { color: #4e9a06; } /* String */
code > span.co { color: #8f5902; font-style: italic; } /* Comment */
code > span.ot { color: #8f5902; } /* Other */
code > span.al { color: #ef2929; } /* Alert */
code > span.fu { color: #000000; } /* Function */
code > span.er { color: #a40000; font-weight: bold; } /* Error */
code > span.wa { color: #8f5902; font-weight: bold; font-style: italic; } /* Warning */
code > span.cn { color: #000000; } /* Constant */
code > span.sc { color: #000000; } /* SpecialChar */
code > span.vs { color: #4e9a06; } /* VerbatimString */
code > span.ss { color: #4e9a06; } /* SpecialString */
code > span.im { } /* Import */
code > span.va { color: #000000; } /* Variable */
code > span.cf { color: #204a87; font-weight: bold; } /* ControlFlow */
code > span.op { color: #ce5c00; font-weight: bold; } /* Operator */
code > span.pp { color: #8f5902; font-style: italic; } /* Preprocessor */
code > span.ex { } /* Extension */
code > span.at { color: #c4a000; } /* Attribute */
code > span.do { color: #8f5902; font-weight: bold; font-style: italic; } /* Documentation */
code > span.an { color: #8f5902; font-weight: bold; font-style: italic; } /* Annotation */
code > span.cv { color: #8f5902; font-weight: bold; font-style: italic; } /* CommentVar */
code > span.in { color: #8f5902; font-weight: bold; font-style: italic; } /* Information */
</style>
<style type="text/css">
pre:not([class]) {
background-color: white;
}
</style>
<style type="text/css">
h1 {
font-size: 34px;
}
h1.title {
font-size: 38px;
}
h2 {
font-size: 30px;
}
h3 {
font-size: 24px;
}
h4 {
font-size: 18px;
}
h5 {
font-size: 16px;
}
h6 {
font-size: 12px;
}
.table th:not([align]) {
text-align: left;
}
</style>
</head>
<body>
<style type = "text/css">
.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}
code {
color: inherit;
background-color: rgba(0, 0, 0, 0.04);
}
img {
max-width:100%;
height: auto;
}
.tabbed-pane {
padding-top: 12px;
}
button.code-folding-btn:focus {
outline: none;
}
</style>
<div class="container-fluid main-container">
<!-- tabsets -->
<script src="libs/navigation-1.1/tabsets.js"></script>
<script>
$(document).ready(function () {
window.buildTabsets("TOC");
});
</script>
<!-- code folding -->
<div class="fluid-row" id="header">
<h1 class="title toc-ignore">The <code>data.frame</code> class</h1>
<h4 class="author"><em>Data Carpentry contributors</em></h4>
</div>
<div id="TOC">
<ul>
<li><a href="#what-are-data-frames">What are data frames?</a><ul>
<li><a href="#challenge">Challenge</a></li>
</ul></li>
<li><a href="#inspecting-data.frame-objects">Inspecting <code>data.frame</code> Objects</a></li>
<li><a href="#indexing-sequences-and-subsetting">Indexing, Sequences, and Subsetting</a><ul>
<li><a href="#challenge-1">Challenge</a></li>
</ul></li>
</ul>
</div>
<hr />
<blockquote>
<h2 id="learning-objectives">Learning Objectives</h2>
<ul>
<li>understand the concept of a <code>data.frame</code></li>
<li>use sequences</li>
<li>know how to access any element of a <code>data.frame</code></li>
</ul>
</blockquote>
<hr />
<div id="what-are-data-frames" class="section level2">
<h2>What are data frames?</h2>
<p>Data frames are the <em>de facto</em> data structure for most tabular data, and what we use for statistics and plotting.</p>
<p>A data frame is a collection of vectors of identical lengths. Each vector represents a column, and each vector can be of a different data type (e.g., characters, integers, factors). The <code>str()</code> function is useful to inspect the data types of the columns.</p>
<p>A data frame can be created by hand, but most commonly they are generated by the functions <code>read.csv()</code> or <code>read.table()</code>; in other words, when importing spreadsheets from your hard drive (or the web).</p>
<p>By default, when building or importing a data frame, the columns that contain characters (i.e., text) are coerced (=converted) into the <code>factor</code> data type. Depending on what you want to do with the data, you may want to keep these columns as <code>character</code>. To do so, <code>read.csv()</code> and <code>read.table()</code> have an argument called <code>stringsAsFactors</code> which can be set to <code>FALSE</code>:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">some_data <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">"data/some_file.csv"</span>, <span class="dt">stringsAsFactors=</span><span class="ot">FALSE</span>)</code></pre></div>
<p>You can also create a data frame manually with the function <code>data.frame()</code>. This function can also take the argument <code>stringsAsFactors</code>. Compare the output of these examples, and compare the difference between when the data are being read as <code>character</code>, and when they are being read as <code>factor</code>.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## Compare the output of these examples, and compare the difference between when
## the data are being read as `character`, and when they are being read as
## `factor`.
example_data <-<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">animal=</span><span class="kw">c</span>(<span class="st">"dog"</span>, <span class="st">"cat"</span>, <span class="st">"sea cucumber"</span>, <span class="st">"sea urchin"</span>),
<span class="dt">feel=</span><span class="kw">c</span>(<span class="st">"furry"</span>, <span class="st">"furry"</span>, <span class="st">"squishy"</span>, <span class="st">"spiny"</span>),
<span class="dt">weight=</span><span class="kw">c</span>(<span class="dv">45</span>, <span class="dv">8</span>, <span class="fl">1.1</span>, <span class="fl">0.8</span>))
<span class="kw">str</span>(example_data)</code></pre></div>
<pre><code>#> 'data.frame': 4 obs. of 3 variables:
#> $ animal: Factor w/ 4 levels "cat","dog","sea cucumber",..: 2 1 3 4
#> $ feel : Factor w/ 3 levels "furry","spiny",..: 1 1 3 2
#> $ weight: num 45 8 1.1 0.8</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">example_data <-<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">animal=</span><span class="kw">c</span>(<span class="st">"dog"</span>, <span class="st">"cat"</span>, <span class="st">"sea cucumber"</span>, <span class="st">"sea urchin"</span>),
<span class="dt">feel=</span><span class="kw">c</span>(<span class="st">"furry"</span>, <span class="st">"furry"</span>, <span class="st">"squishy"</span>, <span class="st">"spiny"</span>),
<span class="dt">weight=</span><span class="kw">c</span>(<span class="dv">45</span>, <span class="dv">8</span>, <span class="fl">1.1</span>, <span class="fl">0.8</span>), <span class="dt">stringsAsFactors=</span><span class="ot">FALSE</span>)
<span class="kw">str</span>(example_data)</code></pre></div>
<pre><code>#> 'data.frame': 4 obs. of 3 variables:
#> $ animal: chr "dog" "cat" "sea cucumber" "sea urchin"
#> $ feel : chr "furry" "furry" "squishy" "spiny"
#> $ weight: num 45 8 1.1 0.8</code></pre>
<div id="challenge" class="section level3">
<h3>Challenge</h3>
<ol style="list-style-type: decimal">
<li><p>There are a few mistakes in this hand crafted <code>data.frame</code>, can you spot and fix them? Don’t hesitate to experiment!</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">author_book <-<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">author_first=</span><span class="kw">c</span>(<span class="st">"Charles"</span>, <span class="st">"Ernst"</span>, <span class="st">"Theodosius"</span>),
<span class="dt">author_last=</span><span class="kw">c</span>(Darwin, Mayr, Dobzhansky),
<span class="dt">year=</span><span class="kw">c</span>(<span class="dv">1942</span>, <span class="dv">1970</span>))</code></pre></div></li>
<li>Can you predict the class for each of the columns in the following example? Check your guesses using <code>str(country_climate)</code>:
<ul>
<li>Are they what you expected? Why? Why not?</li>
<li>What would have been different if we had added <code>stringsAsFactors = FALSE</code> to this call?</li>
<li>What would you need to change to ensure that each column had the accurate data type?</li>
</ul>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">country_climate <-<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">country=</span><span class="kw">c</span>(<span class="st">"Canada"</span>, <span class="st">"Panama"</span>, <span class="st">"South Africa"</span>, <span class="st">"Australia"</span>),
<span class="dt">climate=</span><span class="kw">c</span>(<span class="st">"cold"</span>, <span class="st">"hot"</span>, <span class="st">"temperate"</span>, <span class="st">"hot/temperate"</span>),
<span class="dt">temperature=</span><span class="kw">c</span>(<span class="dv">10</span>, <span class="dv">30</span>, <span class="dv">18</span>, <span class="st">"15"</span>),
<span class="dt">northern_hemisphere=</span><span class="kw">c</span>(<span class="ot">TRUE</span>, <span class="ot">TRUE</span>, <span class="ot">FALSE</span>, <span class="st">"FALSE"</span>),
<span class="dt">has_kangaroo=</span><span class="kw">c</span>(<span class="ot">FALSE</span>, <span class="ot">FALSE</span>, <span class="ot">FALSE</span>, <span class="dv">1</span>))</code></pre></div></li>
<li><p>We introduced you to the <code>data.frame()</code> function and <code>read.csv()</code>, but what if we are starting with some vectors? The best way to do this is to pass those vectors to the <code>data.frame()</code> function, similar to the above.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">color <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"red"</span>, <span class="st">"green"</span>, <span class="st">"blue"</span>, <span class="st">"yellow"</span>)
counts <-<span class="st"> </span><span class="kw">c</span>(<span class="dv">50</span>, <span class="dv">60</span>, <span class="dv">65</span>, <span class="dv">82</span>)
new_datarame <-<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">colors =</span> color, <span class="dt">counts =</span> counts)</code></pre></div></li>
</ol>
<p>Try making your own new data frame from some vectors. You can check the data type of the new object using <code>class()</code>.</p>
<p><!--- Answers
--></p>
<p>The automatic conversion of data type is sometimes a blessing, sometimes an annoyance. Be aware that it exists, learn the rules, and double check that data you import in R are of the correct type within your data frame. If not, use it to your advantage to detect mistakes that might have been introduced during data entry (a letter in a column that should only contain numbers for instance.).</p>
</div>
</div>
<div id="inspecting-data.frame-objects" class="section level2">
<h2>Inspecting <code>data.frame</code> Objects</h2>
<p>We already saw how the functions <code>head()</code> and <code>str()</code> can be useful to check the content and the structure of a <code>data.frame</code>. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data.</p>
<ul>
<li>Size:
<ul>
<li><code>dim()</code> - returns a vector with the number of rows in the first element, and the number of columns as the second element (the <strong>dim</strong>ensions of the object)</li>
<li><code>nrow()</code> - returns the number of rows</li>
<li><code>ncol()</code> - returns the number of columns</li>
</ul></li>
<li>Content:
<ul>
<li><code>head()</code> - shows the first 6 rows</li>
<li><code>tail()</code> - shows the last 6 rows</li>
</ul></li>
<li>Names:
<ul>
<li><code>names()</code> - returns the column names (synonym of <code>colnames()</code> for <code>data.frame</code> objects)</li>
<li><code>rownames()</code> - returns the row names</li>
</ul></li>
<li>Summary:
<ul>
<li><code>str()</code> - structure of the object and information about the class, length and content of each column</li>
<li><code>summary()</code> - summary statistics for each column</li>
</ul></li>
</ul>
<p>Note: most of these functions are “generic”, they can be used on other types of objects besides <code>data.frame</code>.</p>
</div>
<div id="indexing-sequences-and-subsetting" class="section level2">
<h2>Indexing, Sequences, and Subsetting</h2>
<p><code>:</code> is a special function that creates numeric vectors of integers in increasing or decreasing order, test <code>1:10</code> and <code>10:1</code> for instance. The function <code>seq()</code> (for <strong>seq</strong>uence) can be used to create more complex patterns:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">seq</span>(<span class="dv">1</span>, <span class="dv">10</span>, <span class="dt">by=</span><span class="dv">2</span>)</code></pre></div>
<pre><code>#> [1] 1 3 5 7 9</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">seq</span>(<span class="dv">5</span>, <span class="dv">10</span>, <span class="dt">length.out=</span><span class="dv">3</span>)</code></pre></div>
<pre><code>#> [1] 5.0 7.5 10.0</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">seq</span>(<span class="dv">50</span>, <span class="dt">by=</span><span class="dv">5</span>, <span class="dt">length.out=</span><span class="dv">10</span>)</code></pre></div>
<pre><code>#> [1] 50 55 60 65 70 75 80 85 90 95</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">seq</span>(<span class="dv">1</span>, <span class="dv">8</span>, <span class="dt">by=</span><span class="dv">3</span>) <span class="co"># sequence stops to stay below upper limit</span></code></pre></div>
<pre><code>#> [1] 1 4 7</code></pre>
<p>Our survey data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers. However, note that different ways of specifying these coordinates lead to results with different classes.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">surveys[<span class="dv">1</span>] <span class="co"># first column in the data frame (as a data.frame)</span>
surveys[,<span class="dv">1</span>] <span class="co"># first column in the data frame (as a vector)</span>
surveys[<span class="dv">1</span>, <span class="dv">1</span>] <span class="co"># first element in the first column of the data frame (as a vector)</span>
surveys[<span class="dv">1</span>, <span class="dv">6</span>] <span class="co"># first element in the 6th column (as a vector)</span>
surveys[<span class="dv">1</span>:<span class="dv">3</span>, <span class="dv">7</span>] <span class="co"># first three elements in the 7th column (as a vector)</span>
surveys[<span class="dv">3</span>, ] <span class="co"># the 3rd element for all columns (as a data.frame)</span>
head_surveys <-<span class="st"> </span>surveys[<span class="dv">1</span>:<span class="dv">6</span>, ] <span class="co"># equivalent to head(surveys)</span></code></pre></div>
<p>You can also exclude certain parts of a data frame</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">surveys[,-<span class="dv">1</span>] <span class="co">#The whole data frame, except the first column</span>
surveys[-<span class="kw">c</span>(<span class="dv">7</span>:<span class="dv">34786</span>),] <span class="co">#equivalent to head(surveys)</span></code></pre></div>
<p>As well as using numeric values to subset a <code>data.frame</code> (or <code>matrix</code>), columns can be called by name, using one of the three following notations:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">surveys[<span class="st">"species_id"</span>] <span class="co"># Result is a data.frame</span>
surveys[, <span class="st">"species_id"</span>] <span class="co"># Result is a vector</span>
surveys[[<span class="st">"species_id"</span>]] <span class="co"># Result is a vector</span>
surveys$species_id <span class="co"># Result is a vector</span></code></pre></div>
<p>For our purposes, these three notations are equivalent. However, the last one with the <code>$</code> does partial matching on the name. So you could also select the column <code>"day"</code> by typing <code>surveys$d</code>. It’s a shortcut, as with all shortcuts, they can have dangerous consequences, and are best avoided. Besides, with auto-completion in RStudio, you rarely have to type more than a few characters to get the full and correct column name.</p>
<div id="challenge-1" class="section level3">
<h3>Challenge</h3>
<ol style="list-style-type: decimal">
<li><p>The function <code>nrow()</code> on a <code>data.frame</code> returns the number of rows. Use it, in conjunction with <code>seq()</code> to create a new <code>data.frame</code> called <code>surveys_by_10</code> that includes every 10th row of the survey data frame starting at row 10 (10, 20, 30, …)</p></li>
<li><p>Create a <code>data.frame</code> containing only the observations from row 1999 of the <code>surveys</code> dataset.</p></li>
<li><p>Notice how <code>nrow()</code> gave you the number of rows in a <code>data.frame</code>? Use <code>nrow()</code> instead of a row number to make a <code>data.frame</code> with observations from only the last row of the <code>surveys</code> dataset.</p></li>
<li><p>Now that you’ve seen how <code>nrow()</code> can be used to stand in for a row index, let’s combine that behavior with the <code>-</code> notation above to reproduce the behavior of <code>head(surveys)</code> excluding the 7th through final row of the <code>surveys</code> dataset.</p></li>
</ol>
<!---
```r
## Answers
surveys_by_10 <- surveys[seq(10, nrow(surveys), by=10), ]
surveys_1999 <- surveys[surveys$year == 1999, ]
surveys_last <- surveys[nrow(surveys),]
surveys_head <- surveys[-c(7:nrow(surveys)),]
```
--->
</div>
</div>
</div>
<script>
// add bootstrap table styles to pandoc tables
$(document).ready(function () {
$('tr.header').parent('thead').parent('table').addClass('table table-condensed');
});
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
</body>
</html>