forked from datacarpentry/R-ecology-lesson
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path06-r-and-sql.html
304 lines (262 loc) · 16.6 KB
/
06-r-and-sql.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />
<meta name="author" content="Data Carpentry contributors" />
<title>SQL databases and R</title>
<script src="libs/jquery-1.11.3/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link href="libs/bootstrap-3.3.5/css/bootstrap.min.css" rel="stylesheet" />
<script src="libs/bootstrap-3.3.5/js/bootstrap.min.js"></script>
<script src="libs/bootstrap-3.3.5/shim/html5shiv.min.js"></script>
<script src="libs/bootstrap-3.3.5/shim/respond.min.js"></script>
<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
div.sourceCode { overflow-x: auto; }
table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
margin: 0; padding: 0; vertical-align: baseline; border: none; }
table.sourceCode { width: 100%; line-height: 100%; background-color: #f8f8f8; }
td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
td.sourceCode { padding-left: 5px; }
pre, code { background-color: #f8f8f8; }
code > span.kw { color: #204a87; font-weight: bold; } /* Keyword */
code > span.dt { color: #204a87; } /* DataType */
code > span.dv { color: #0000cf; } /* DecVal */
code > span.bn { color: #0000cf; } /* BaseN */
code > span.fl { color: #0000cf; } /* Float */
code > span.ch { color: #4e9a06; } /* Char */
code > span.st { color: #4e9a06; } /* String */
code > span.co { color: #8f5902; font-style: italic; } /* Comment */
code > span.ot { color: #8f5902; } /* Other */
code > span.al { color: #ef2929; } /* Alert */
code > span.fu { color: #000000; } /* Function */
code > span.er { color: #a40000; font-weight: bold; } /* Error */
code > span.wa { color: #8f5902; font-weight: bold; font-style: italic; } /* Warning */
code > span.cn { color: #000000; } /* Constant */
code > span.sc { color: #000000; } /* SpecialChar */
code > span.vs { color: #4e9a06; } /* VerbatimString */
code > span.ss { color: #4e9a06; } /* SpecialString */
code > span.im { } /* Import */
code > span.va { color: #000000; } /* Variable */
code > span.cf { color: #204a87; font-weight: bold; } /* ControlFlow */
code > span.op { color: #ce5c00; font-weight: bold; } /* Operator */
code > span.pp { color: #8f5902; font-style: italic; } /* Preprocessor */
code > span.ex { } /* Extension */
code > span.at { color: #c4a000; } /* Attribute */
code > span.do { color: #8f5902; font-weight: bold; font-style: italic; } /* Documentation */
code > span.an { color: #8f5902; font-weight: bold; font-style: italic; } /* Annotation */
code > span.cv { color: #8f5902; font-weight: bold; font-style: italic; } /* CommentVar */
code > span.in { color: #8f5902; font-weight: bold; font-style: italic; } /* Information */
</style>
<style type="text/css">
pre:not([class]) {
background-color: white;
}
</style>
<style type="text/css">
h1 {
font-size: 34px;
}
h1.title {
font-size: 38px;
}
h2 {
font-size: 30px;
}
h3 {
font-size: 24px;
}
h4 {
font-size: 18px;
}
h5 {
font-size: 16px;
}
h6 {
font-size: 12px;
}
.table th:not([align]) {
text-align: left;
}
</style>
</head>
<body>
<style type = "text/css">
.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}
code {
color: inherit;
background-color: rgba(0, 0, 0, 0.04);
}
img {
max-width:100%;
height: auto;
}
.tabbed-pane {
padding-top: 12px;
}
button.code-folding-btn:focus {
outline: none;
}
</style>
<div class="container-fluid main-container">
<!-- tabsets -->
<script src="libs/navigation-1.1/tabsets.js"></script>
<script>
$(document).ready(function () {
window.buildTabsets("TOC");
});
</script>
<!-- code folding -->
<div class="fluid-row" id="header">
<h1 class="title toc-ignore">SQL databases and R</h1>
<h4 class="author"><em>Data Carpentry contributors</em></h4>
</div>
<div id="TOC">
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#connecting-r-to-sqlite-databases">Connecting R to sqlite databases</a></li>
<li><a href="#running-sql-queries-from-r">Running SQL queries from R</a><ul>
<li><a href="#creating-a-sqlite-db-using-r">Creating a SQLite DB using R</a></li>
</ul></li>
</ul>
</div>
<blockquote>
<h2 id="learning-objectives">Learning Objectives</h2>
<p>By the end of this lesson the learner will:</p>
<ul>
<li>Know how to connect R to an existing database</li>
<li>Be able to use SQL syntax inside R to to query a database and return a subset of data</li>
<li>Be able to use R and SQL to create a new database from existing csv files.</li>
<li>Understand that scripted database interactions enhance the reproducibility of their analysis</li>
</ul>
</blockquote>
<hr />
<div id="introduction" class="section level1">
<h1>Introduction</h1>
<p>Interacting with databases through scripted languages can offer advantages over querying databases via a GUI interface. A GUI interface for your database is easier to use and allows the quick viewing of adhoc queries. Querying a database with a programmatic interface (in this case R, but it could be any language) are slightly more complicated. However the trade-off is that data manipulations are preserved in the code. Aggregations, summaries and other database operations are preserved. Therefore those pre-analysis data manipulation steps are not lost and can be reproduced later by yourself or others.</p>
</div>
<div id="connecting-r-to-sqlite-databases" class="section level1">
<h1>Connecting R to sqlite databases</h1>
<p>R can connect to databases through a number of packages. In our case we will use <a href="http://cran.r-project.org/web/packages/RSQLite/index.html">RSQLite</a> to connect to existing SQLite3 databases. However you should be able to connect to almost any database in R via <a href="http://cran.r-project.org/web/packages/RJDBC/index.html">JDBC</a> or <a href="http://cran.r-project.org/web/packages/RODBC/index.html">ODBC</a>, or specific database packages (such as we are doing, or <a href="http://cran.r-project.org/web/packages/RMySQL/index.html">MySQL</a> ).</p>
<p>To begin these exercises we’ll connect to the mammal database you’ve just created. We’ll need the RSQLite package so be sure to install it first. <code>install.packages('RSQLite')</code>.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(RSQLite)</code></pre></div>
<pre><code>#> Loading required package: DBI</code></pre>
<pre><code>#> Loading required package: methods</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## Set dbname and driver out of convenience
myDB <-<span class="st"> "data/portal_mammals.sqlite"</span>
conn <-<span class="st"> </span><span class="kw">dbConnect</span>(<span class="dt">drv =</span> <span class="kw">SQLite</span>(), <span class="dt">dbname=</span> myDB)</code></pre></div>
<p>Now we’re connected to our database, let’s explore the table structure in the database. Remember we could see the list of tables in the SQLite Firefox gui. In R, we’ll need to use SQL commands.</p>
</div>
<div id="running-sql-queries-from-r" class="section level1">
<h1>Running SQL queries from R</h1>
<p>We can view information about the database structure which includes a list of all tables like this:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">dbGetQuery</span>(conn, <span class="st">"SELECT type, tbl_name FROM sqlite_master"</span>)</code></pre></div>
<p>The <code>RSQLite</code> package also has functions that can be used to list both tables and fields within a table. Here you can see the types and names of fields and get a count of records.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">dbListTables</span>(conn)
<span class="co">#get list of fields in the surveys table</span>
<span class="kw">dbListFields</span>(conn,<span class="st">"surveys"</span>)
<span class="kw">dbGetQuery</span>(conn,<span class="st">"SELECT count(*) FROM surveys"</span>)</code></pre></div>
<p>It’s good practice to always close a connection that you open in R. Let’s do that next. Note that once you’ve closed a connection, you will have to open a new connection to query and import the data again.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">dbDisconnect</span>(conn)</code></pre></div>
<p>We’ve now covered the basics of connecting to a database and exploring its basic structure. From here we can write queries to access subsets of the data within our database, using the same methods that we used in SQLite.</p>
<p>Let’s try some basic queries from the previous lesson. Querying simply takes a connection to a database and query as inputs and returns a dataframe with the results.</p>
<p>Before we can do this, we need to re-establish a database connection.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## Set dbname and driver out of convenience
myDB <-<span class="st"> "data/portal_mammals.sqlite"</span>
conn <-<span class="st"> </span><span class="kw">dbConnect</span>(<span class="dt">drv =</span> <span class="kw">SQLite</span>(), <span class="dt">dbname=</span> myDB)</code></pre></div>
<p>Next, we build our query. We can use the <code>dbGetQuery</code> function to run the query and access data returned in a <code>data.frame</code> object.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">q <-<span class="st"> 'SELECT DISTINCT year, species_id FROM surveys'</span>
result <-<span class="st"> </span><span class="kw">dbGetQuery</span>(conn, q)
<span class="kw">head</span>(result)</code></pre></div>
<blockquote>
<h2 id="challenge">Challenge</h2>
<ol style="list-style-type: decimal">
<li>Write a query that returns counts of genus by <code>plot_id</code></li>
<li><p>You can join multiple tables together in SQL using the following syntax where foreign key refers to your unique id (e.g., <code>species_id</code>):</p>
<p>SELECT table.col, table.col FROM table1 JOIN table2<br />
ON table1.key = table2.key<br />
JOIN table3 ON table2.key = table3.key</p></li>
</ol>
<p>Write a query that joins the species, plot, and survey tables together. The query should return counts of genus by plot type. Then create a bar plot of your results in R.</p>
</blockquote>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">q <-<span class="st"> "SELECT d.plot_type , c.genus, count(*)</span>
<span class="st">FROM</span>
<span class="st">(SELECT a.genus, b.plot_id</span>
<span class="st">FROM species a</span>
<span class="st">JOIN surveys b</span>
<span class="st">ON a.species_id = b.species_id) c</span>
<span class="st">JOIN plots d</span>
<span class="st">ON c.plot_id = d.plot_id</span>
<span class="st">GROUP BY d.plot_type,c.genus"</span>
result <-<span class="st"> </span><span class="kw">dbGetQuery</span>(conn,q)
<span class="kw">head</span>(result)</code></pre></div>
<p>All we’ve done so far is execute the same sorts of queries that can easily be made with a GUI. Now let’s try leveraging the power of scripted queries. Imagine you want to know how many rodents were found every other year. To get this we’ll get the range of dates from the database, sequence them by two and make new queries.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">yearRange <-<span class="st"> </span><span class="kw">dbGetQuery</span>(conn,<span class="st">"SELECT min(year),max(year) FROM surveys"</span>)
years <-<span class="st"> </span><span class="kw">seq</span>(yearRange[,<span class="dv">1</span>],yearRange[,<span class="dv">2</span>],<span class="dt">by=</span><span class="dv">2</span>)</code></pre></div>
<p>Next we’ll build our query string using the <code>paste()</code> function.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">q <-<span class="st"> </span><span class="kw">paste</span>(<span class="st">"</span>
<span class="st">SELECT a.year,b.taxa,count(*) as count</span>
<span class="st">FROM surveys a</span>
<span class="st">JOIN species b</span>
<span class="st">ON a.species_id = b.species_id</span>
<span class="st">AND b.taxa = 'Rodent'</span>
<span class="st">AND a.year in ("</span>,
<span class="kw">paste</span>(years,<span class="dt">collapse=</span><span class="st">","</span>)
,<span class="st">")</span>
<span class="st">GROUP BY a.year, b.taxa"</span>,
<span class="dt">sep =</span> <span class="st">""</span> )
rCount <-<span class="st"> </span><span class="kw">dbGetQuery</span>(conn,q)
<span class="kw">head</span>(rCount)</code></pre></div>
<p>With the nested paste commands we were able to construct a query programmatically, without having to type out all the years. This could also be done with a for loop, especially if the query to be constructed is more complicated.</p>
<div id="creating-a-sqlite-db-using-r" class="section level2">
<h2>Creating a SQLite DB using R</h2>
<p>We can use R to create a new database and associated structure (also known as the schema) databases from existing <code>csv</code> files. Let’s recreate the mammals database that we’ve been working with, in R. First let’s read in the <code>csv</code> files.</p>
<p>Up until now we been working with an existing data. However we can use R as a way to build up databases from existing flat files. We’ll use the flat files that make up the mammals database to recreate it. First let’s read in the files.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">species <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">"data/species.csv"</span>)
surveys <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">"data/surveys.csv"</span>)
plots <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">"data/plots.csv"</span>)</code></pre></div>
<p>Next we’ll need to create a database and add our tables.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">myDB <-<span class="st"> "portalR.db"</span>
myConn <-<span class="st"> </span><span class="kw">dbConnect</span>(<span class="dt">drv =</span> <span class="kw">SQLite</span>(), <span class="dt">dbname=</span> myDB)
<span class="kw">dbListTables</span>(myConn)</code></pre></div>
<p>By opening up a connection, you’ve created a new database. If you list the tables you’ll see that the database is empty. Next we’ll add the dataframes we just read in to the database.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">dbWriteTable</span>(myConn,<span class="st">"species"</span>,species)
<span class="kw">dbListTables</span>(myConn)
<span class="kw">dbGetQuery</span>(myConn,<span class="st">"SELECT * from species limit 10"</span>)</code></pre></div>
<p>If you check the location of our database you’ll see that data is automatically being written to disk. Not only does R and RSQLite provide easy ways to query existing databases, it also allows you to easily create your own databases from flat files.</p>
<p>If you check the location of your database you’ll see that data is automatically being written to disk. Not only does R and RSQLite provide easy ways to query existing databases, it also allows you to easily create your own databases from flat files.</p>
<blockquote>
<h2 id="challenge-1">Challenge</h2>
<p>Add the remaining tables to the existing database. Open the new database using the Firefox SQLite manager to verify that the database was built successfully.</p>
</blockquote>
<blockquote>
<h2 id="extra-challenge">Extra Challenge</h2>
<p>Run some of your queries from earlier in the lesson to verify that you have faithfully recreated the mammals database.</p>
</blockquote>
<p>Don’t forget to close your database connection when you are done with it!</p>
</div>
</div>
</div>
<script>
// add bootstrap table styles to pandoc tables
$(document).ready(function () {
$('tr.header').parent('thead').parent('table').addClass('table table-condensed');
});
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
</body>
</html>