forked from datacarpentry/R-ecology-lesson
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path02-starting-with-data.html
361 lines (317 loc) · 16.8 KB
/
02-starting-with-data.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />
<meta name="author" content="Data Carpentry contributors" />
<title>Starting with data</title>
<script src="libs/jquery-1.11.3/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link href="libs/bootstrap-3.3.5/css/bootstrap.min.css" rel="stylesheet" />
<script src="libs/bootstrap-3.3.5/js/bootstrap.min.js"></script>
<script src="libs/bootstrap-3.3.5/shim/html5shiv.min.js"></script>
<script src="libs/bootstrap-3.3.5/shim/respond.min.js"></script>
<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
div.sourceCode { overflow-x: auto; }
table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
margin: 0; padding: 0; vertical-align: baseline; border: none; }
table.sourceCode { width: 100%; line-height: 100%; background-color: #f8f8f8; }
td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
td.sourceCode { padding-left: 5px; }
pre, code { background-color: #f8f8f8; }
code > span.kw { color: #204a87; font-weight: bold; } /* Keyword */
code > span.dt { color: #204a87; } /* DataType */
code > span.dv { color: #0000cf; } /* DecVal */
code > span.bn { color: #0000cf; } /* BaseN */
code > span.fl { color: #0000cf; } /* Float */
code > span.ch { color: #4e9a06; } /* Char */
code > span.st { color: #4e9a06; } /* String */
code > span.co { color: #8f5902; font-style: italic; } /* Comment */
code > span.ot { color: #8f5902; } /* Other */
code > span.al { color: #ef2929; } /* Alert */
code > span.fu { color: #000000; } /* Function */
code > span.er { color: #a40000; font-weight: bold; } /* Error */
code > span.wa { color: #8f5902; font-weight: bold; font-style: italic; } /* Warning */
code > span.cn { color: #000000; } /* Constant */
code > span.sc { color: #000000; } /* SpecialChar */
code > span.vs { color: #4e9a06; } /* VerbatimString */
code > span.ss { color: #4e9a06; } /* SpecialString */
code > span.im { } /* Import */
code > span.va { color: #000000; } /* Variable */
code > span.cf { color: #204a87; font-weight: bold; } /* ControlFlow */
code > span.op { color: #ce5c00; font-weight: bold; } /* Operator */
code > span.pp { color: #8f5902; font-style: italic; } /* Preprocessor */
code > span.ex { } /* Extension */
code > span.at { color: #c4a000; } /* Attribute */
code > span.do { color: #8f5902; font-weight: bold; font-style: italic; } /* Documentation */
code > span.an { color: #8f5902; font-weight: bold; font-style: italic; } /* Annotation */
code > span.cv { color: #8f5902; font-weight: bold; font-style: italic; } /* CommentVar */
code > span.in { color: #8f5902; font-weight: bold; font-style: italic; } /* Information */
</style>
<style type="text/css">
pre:not([class]) {
background-color: white;
}
</style>
<style type="text/css">
h1 {
font-size: 34px;
}
h1.title {
font-size: 38px;
}
h2 {
font-size: 30px;
}
h3 {
font-size: 24px;
}
h4 {
font-size: 18px;
}
h5 {
font-size: 16px;
}
h6 {
font-size: 12px;
}
.table th:not([align]) {
text-align: left;
}
</style>
</head>
<body>
<style type = "text/css">
.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}
code {
color: inherit;
background-color: rgba(0, 0, 0, 0.04);
}
img {
max-width:100%;
height: auto;
}
.tabbed-pane {
padding-top: 12px;
}
button.code-folding-btn:focus {
outline: none;
}
</style>
<div class="container-fluid main-container">
<!-- tabsets -->
<script src="libs/navigation-1.1/tabsets.js"></script>
<script>
$(document).ready(function () {
window.buildTabsets("TOC");
});
</script>
<!-- code folding -->
<div class="fluid-row" id="header">
<h1 class="title toc-ignore">Starting with data</h1>
<h4 class="author"><em>Data Carpentry contributors</em></h4>
</div>
<div id="TOC">
<ul>
<li><a href="#presentation-of-the-survey-data">Presentation of the Survey Data</a><ul>
<li><a href="#challenge">Challenge</a></li>
</ul></li>
<li><a href="#factors">Factors</a><ul>
<li><a href="#converting-factors">Converting factors</a></li>
<li><a href="#challenge-1">Challenge</a></li>
</ul></li>
</ul>
</div>
<hr />
<blockquote>
<h2 id="learning-objectives">Learning Objectives</h2>
<ul>
<li>load external data (CSV files) in memory using the survey table (<code>surveys.csv</code>) as an example</li>
<li>explore the structure and the content of a data frame in R</li>
<li>understand what factors are and how to manipulate them</li>
</ul>
</blockquote>
<hr />
<div id="presentation-of-the-survey-data" class="section level2">
<h2>Presentation of the Survey Data</h2>
<p>We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a CSV file: each row holds information for a single animal, and the columns represent:</p>
<table>
<thead>
<tr class="header">
<th align="left">Column</th>
<th align="left">Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">record_id</td>
<td align="left">Unique id for the observation</td>
</tr>
<tr class="even">
<td align="left">month</td>
<td align="left">month of observation</td>
</tr>
<tr class="odd">
<td align="left">day</td>
<td align="left">day of observation</td>
</tr>
<tr class="even">
<td align="left">year</td>
<td align="left">year of observation</td>
</tr>
<tr class="odd">
<td align="left">plot_id</td>
<td align="left">ID of a particular plot</td>
</tr>
<tr class="even">
<td align="left">species_id</td>
<td align="left">2-letter code</td>
</tr>
<tr class="odd">
<td align="left">sex</td>
<td align="left">sex of animal (“M”, “F”)</td>
</tr>
<tr class="even">
<td align="left">hindfoot_length</td>
<td align="left">length of the hindfoot in mm</td>
</tr>
<tr class="odd">
<td align="left">weight</td>
<td align="left">weight of the animal in grams</td>
</tr>
<tr class="even">
<td align="left">genus</td>
<td align="left">genus of animal</td>
</tr>
<tr class="odd">
<td align="left">species</td>
<td align="left">species of animal</td>
</tr>
<tr class="even">
<td align="left">taxa</td>
<td align="left">e.g. Rodent, Reptile, Bird, Rabbit</td>
</tr>
<tr class="odd">
<td align="left">plot_type</td>
<td align="left">type of plot</td>
</tr>
</tbody>
</table>
<p>We are going to use the R function <code>download.file()</code> to download the CSV file that contains the survey data from figshare, and we will use <code>read.csv()</code> to load into memory (as a <code>data.frame</code>) the content of the CSV file.</p>
<p>To download the data into the <code>data/</code> subdirectory, do:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">download.file</span>(<span class="st">"https://ndownloader.figshare.com/files/2292169"</span>,
<span class="st">"data/portal_data_joined.csv"</span>)</code></pre></div>
<p>You are now ready to load the data:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">surveys <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">'data/portal_data_joined.csv'</span>)</code></pre></div>
<p>This statement doesn’t produce any output because, as you might recall, assignment doesn’t display anything. If we want to check that our data has been loaded, we can print the variable’s value: <code>surveys</code>.</p>
<p>Wow… that was a lot of output. At least it means the data loaded properly. Let’s check the top (the first 6 lines) of this <code>data.frame</code> using the function <code>head()</code>:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">head</span>(surveys)</code></pre></div>
<pre><code>#> record_id month day year plot_id species_id sex hindfoot_length weight
#> 1 1 7 16 1977 2 NL M 32 NA
#> 2 72 8 19 1977 2 NL M 31 NA
#> 3 224 9 13 1977 2 NL NA NA
#> 4 266 10 16 1977 2 NL NA NA
#> 5 349 11 12 1977 2 NL NA NA
#> 6 363 11 12 1977 2 NL NA NA
#> genus species taxa plot_type
#> 1 Neotoma albigula Rodent Control
#> 2 Neotoma albigula Rodent Control
#> 3 Neotoma albigula Rodent Control
#> 4 Neotoma albigula Rodent Control
#> 5 Neotoma albigula Rodent Control
#> 6 Neotoma albigula Rodent Control</code></pre>
<p>A <code>data.frame</code> is the representation of data in the format of a table where the columns are vectors that all have the same length. Because each column is a vector, they all contain the same type of data. We can see this when inspecting the __str__ucture of a <code>data.frame</code> with the function <code>str()</code>:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">str</span>(surveys)</code></pre></div>
<div id="challenge" class="section level3">
<h3>Challenge</h3>
<p>Based on the output of <code>str(surveys)</code>, can you answer the following questions?</p>
<ul>
<li>What is the class of the object <code>surveys</code>?</li>
<li>How many rows and how many columns are in this object?</li>
<li>How many species have been recorded during these surveys?</li>
</ul>
<!---
--->
<p>As you can see, many of the columns consist of integers, however, the columns <code>species</code> and <code>sex</code> are of a special class called a <code>factor</code>. Before we learn more about the <code>data.frame</code> class, let’s talk about factors. They are very useful but not necessarily intuitive, and therefore require some attention.</p>
</div>
</div>
<div id="factors" class="section level2">
<h2>Factors</h2>
<p>Factors are used to represent categorical data. Factors can be ordered or unordered, and understanding them is necessary for statistical analysis and for plotting.</p>
<p>Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.</p>
<p>Once created, factors can only contain a pre-defined set of values, known as <em>levels</em>. By default, R always sorts <em>levels</em> in alphabetical order. For instance, if you have a factor with 2 levels:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">sex <-<span class="st"> </span><span class="kw">factor</span>(<span class="kw">c</span>(<span class="st">"male"</span>, <span class="st">"female"</span>, <span class="st">"female"</span>, <span class="st">"male"</span>))</code></pre></div>
<p>R will assign <code>1</code> to the level <code>"female"</code> and <code>2</code> to the level <code>"male"</code> (because <code>f</code> comes before <code>m</code>, even though the first element in this vector is <code>"male"</code>). You can check this by using the function <code>levels()</code>, and check the number of levels using <code>nlevels()</code>:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">levels</span>(sex)
<span class="kw">nlevels</span>(sex)</code></pre></div>
<p>Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”) or it is required by a particular type of analysis. Additionally, specifying the order of the levels allows for level comparison:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">food <-<span class="st"> </span><span class="kw">factor</span>(<span class="kw">c</span>(<span class="st">"low"</span>, <span class="st">"high"</span>, <span class="st">"medium"</span>, <span class="st">"high"</span>, <span class="st">"low"</span>, <span class="st">"medium"</span>, <span class="st">"high"</span>))
<span class="kw">levels</span>(food)
food <-<span class="st"> </span><span class="kw">factor</span>(food, <span class="dt">levels=</span><span class="kw">c</span>(<span class="st">"low"</span>, <span class="st">"medium"</span>, <span class="st">"high"</span>))
<span class="kw">levels</span>(food)
<span class="kw">min</span>(food) ## doesn't work</code></pre></div>
<pre><code>#> Error in Summary.factor(structure(c(1L, 3L, 2L, 3L, 1L, 2L, 3L), .Label = c("low", : 'min' not meaningful for factors</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">food <-<span class="st"> </span><span class="kw">factor</span>(food, <span class="dt">levels=</span><span class="kw">c</span>(<span class="st">"low"</span>, <span class="st">"medium"</span>, <span class="st">"high"</span>), <span class="dt">ordered=</span><span class="ot">TRUE</span>)
<span class="kw">levels</span>(food)
<span class="kw">min</span>(food) ## works!</code></pre></div>
<p>In R’s memory, these factors are represented by integers (1, 2, 3), but are more informative than integers because factors are self describing: <code>"low"</code>, <code>"medium"</code>, <code>"high"</code>" is more descriptive than <code>1</code>, <code>2</code>, <code>3</code>. Which is low? You wouldn’t be able to tell just from the integer data. Factors, on the other hand, have this information built in. It is particularly helpful when there are many levels (like the species in our example data set).</p>
<div id="converting-factors" class="section level3">
<h3>Converting factors</h3>
<p>If you need to convert a factor to a character vector, you use <code>as.character(x)</code>.</p>
<p>Converting factors where the levels appear as numbers (such as concentration levels) to a numeric vector is a little trickier. One method is to convert factors to characters and then numbers. Another method is to use the <code>levels()</code> function. Compare:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">f <-<span class="st"> </span><span class="kw">factor</span>(<span class="kw">c</span>(<span class="dv">1</span>, <span class="dv">5</span>, <span class="dv">10</span>, <span class="dv">2</span>))
<span class="kw">as.numeric</span>(f) ## wrong! and there is no warning...
<span class="kw">as.numeric</span>(<span class="kw">as.character</span>(f)) ## works...
<span class="kw">as.numeric</span>(<span class="kw">levels</span>(f))[f] ## The recommended way.</code></pre></div>
<p>Notice that in the <code>levels()</code> approach, three important steps occur:</p>
<ul>
<li>We obtain all the factor levels using <code>levels(f)</code></li>
<li>We convert these levels to numeric values using <code>as.numeric(levels(f))</code></li>
<li>We then access these numeric values using the underlying integers of the vector <code>f</code> inside the square brackets</li>
</ul>
</div>
<div id="challenge-1" class="section level3">
<h3>Challenge</h3>
<p>The function <code>plot()</code> can be used to quickly create a bar plot of a factor. For instance, for the factor <code>exprmt <- factor(c("treat1", "treat2", "treat1", "treat3", "treat1", "control", "treat1", "treat2", "treat3"))</code>, the code <code>plot(exprmt)</code> gives you a barplot of the number of observations at each level, as shown below.</p>
<ul>
<li>What determines the order in which the treatments are listed in the plot? (Hint: use <code>str</code> to inspect the factor.)</li>
<li>How can you recreate this plot with “control” listed last instead of first?</li>
</ul>
<p><img src="img/R-ecology-wrong-order-1.png" width="672" /></p>
<!---
```r
## Answers
##
## * The treatments are listed in alphabetical order because they are factors.
## * By redefining the order of the levels
exprmt <- factor(exprmt, levels=c("treat1", "treat2", "treat3", "control"))
plot(exprmt)
```
<img src="img/R-ecology-correct-order-1.png" width="672" />
--->
</div>
</div>
</div>
<script>
// add bootstrap table styles to pandoc tables
$(document).ready(function () {
$('tr.header').parent('thead').parent('table').addClass('table table-condensed');
});
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
</body>
</html>