-
Notifications
You must be signed in to change notification settings - Fork 55
/
Copy path06-advanced-operations.Rmd
363 lines (258 loc) · 7.47 KB
/
06-advanced-operations.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
---
Title: "Advanced operations"
output: html_notebook
---
## Class catchup
```{r, cathup06, include = FALSE}
options(lifecycle_disable_verbose_retirement = TRUE)
library(dplyr)
library(dbplyr)
library(DBI)
library(purrr)
suppressWarnings(library(rlang, warn.conflicts = FALSE))
# Class catchup
con <- DBI::dbConnect(odbc::odbc(), "Postgres Dev")
airports <- tbl(con, in_schema("datawarehouse", "airport"))
flights <- tbl(con, in_schema("datawarehouse", "vflight"))
carriers <- tbl(con, in_schema("datawarehouse", "carrier"))
```
## 6.1 - Simple wrapper function
*Create a function that accepts a value that is passed to a specific dplyr operation*
1. The following `dplyr` operation is fixed to only return the mean of *arrtime*. The desire is to create a function that returns the mean of any variable passed to it.
```{r}
flights %>%
summarise(mean = mean(arrtime, na.rm = TRUE))
```
2. Load the `rlang` library, and create a function with one argument. The function will simply return the result of `equo()`
```{r}
library(rlang)
my_mean <- function(x){
x <- enquo(x)
x
}
my_mean(mpg)
```
3. Add the `summarise()` operation, and replace *arrtime* with *!! x*
```{r}
```
4. Test the function with *deptime*
```{r}
my_mean(deptime)
```
5. Make the function use what is passed to the *x* argument as the name of the calculation. Replace *mean = * with *!! quo_name(x) :=* .
```{r}
```
6. Test the function again with *arrtime*. The name of the variable should now by *arrtime*
```{r}
my_mean(arrtime)
```
7. Test the function with a formula: *arrtime+deptime*.
```{r}
my_mean(arrtime+deptime)
```
8. Make the function generic by adding a *.data* argument and replacing *flights* with *.data*
```{r}
```
9. The function now behaves more like a `dplyr` verb. Start with *flights* and pipe into the function.
```{r}
```
10. Test the function with a different data set. Use `mtcars` and *mpg* as the *x* argument.
```{r}
```
11. Clean up the function by removing the pipe
```{r}
```
12. Test again, no visible changes should be there for the results
```{r}
```
13. Because the function only uses `dplyr` operations, `show_query()` should work
```{r}
```
## 6.2 - Multiple variables
*Create functions that handle a variable number of arguments. The goal of the exercise is to create an "anti-select()" function.*
1. Use *...* as the second argument of a function called `de_select()`. Inside the function use `enquos()` to parse it.
```{r}
de_select <- function(.data, ...){
vars <- enquos(...)
vars
}
```
2. Test the function using *airports*
```{r}
airports %>%
de_select(airport, airportname)
```
3. Add a step to the function that iterates through each quosure and prefixes a minus sign to tell `select()` to drop that specific field. Use `map()` for the iteration, and `expr()` to create the prefixed expression.
```{r}
de_select <- function(.data, ...){
vars <- enquos(...)
vars
}
```
4. Run the same test to view the new results
```{r}
airports %>%
de_select(airport, airportname)
```
5. Add the `select()` step. Use *!!!* to parse the *vars* variable inside `select()`
```{r}
de_select <- function(.data, ...){
vars <- enquos(...)
}
```
6. Run the test again, this time the operation will take place.
```{r}
airports %>%
de_select(airport, airportname)
```
7. Add a `show_query()` step to see the resulting SQL
```{r}
airports %>%
de_select(airport, airportname) %>%
show_query()
```
8. Test the function with a different data set, such as `mtcars`
```{r}
```
## 6.3 - Multiple queries
*Suggested approach to avoid passing multiple, and similar, queries to the database*
1. Create a simple `dplyr` piped operation that returns the mean of *arrdelay* for the months of January, February and March as a group.
```{r}
flights %>%
filter(month %in% c(1,2,3)) %>%
summarise(mean = mean(arrdelay, na.rm = TRUE))
```
2. Assign the first operation to a variable called *a*, and create copy of the operation but changing the selected months to January, March and April. Assign the second one to a variable called *b*.
```{r}
```
3. Use *union()* to pass *a* and *b* at the same time to the database.
```{r}
```
4. Assign to a new variable called *months* an overlapping set of months.
```{r}
months <- list(
c(1,2,3),
c(1,3,4),
c(2,4,6)
)
```
5. Use `map()` to cycle through each set of overlapping months. Notice that it returns three separate results, meaning that it went to the database three times.
```{r}
months %>%
map(~.x) # Replace this line with your code
```
6. Add a `reduce()` operation and use `union()` command to create a single query.
```{r}
months %>%
map( ~ .x) %>% # Replace this line with your code
reduce(function(x, y) c(x, y)) # Replace this line with your code
```
7. Use `show_query()` to see the resulting single query sent to the database.
```{r}
```
## 6.4 - Multiple queries with an overlaping range
1. Create a table with a *from* and *to* ranges.
```{r}
ranges <- tribble(
~ from, ~to,
1, 4,
2, 5,
3, 7
)
ranges
```
2. See how `map2()` works by passing the two variables as the *x* and *y* arguments, and adding them as the function.
```{r}
map2(ranges$from, ranges$to, ~.x + .y)
```
3. Replace *x + y* with the `dplyr` operation from the previous exercise. In it, re-write the filter to use *x* and *y* as the month ranges
```{r}
map2(
ranges$from,
ranges$to,
~ c(.x, .y) # Replace this line with your code
)
```
4. Add the reduce operation
```{r}
```
5. Add a `show_query()` step to see how the final query was constructed.
```{r}
```
## 6.5 Multiple queries with an overlapping range
1. Create a table with a *from* and *to* ranges.
```{r}
ranges <- tribble(
~ from, ~to,
1, 4,
2, 5,
3, 7
)
```
2. See how `map2()` works by passing the two variables as the *x* and *y* arguments, and adding them as the function.
```{r}
map2(ranges$from, ranges$to, ~.x + .y)
```
3. Replace *x + y* with the `dplyr` operation from the previous exercise. In it, re-write the filter to use *x* and *y* as the month ranges
```{r}
map2(
ranges$from,
ranges$to,
~ flights %>%
filter(month >= .x & month <= .y) %>%
summarise(mean = mean(arrdelay, na.rm = TRUE))
)
```
4. Add the reduce operation
```{r}
map2(
ranges$from,
ranges$to,
~ flights %>%
filter(month >= .x & month <= .y) %>%
summarise(mean = mean(arrdelay, na.rm = TRUE))
) %>%
reduce(function(x, y) union(x, y))
```
5. Add a `show_query()` step to see how the final query was constructed.
```{r}
map2(
ranges$from,
ranges$to,
~ flights %>%
filter(month >= .x & month <= .y) %>%
summarise(mean = mean(arrdelay, na.rm = TRUE))
) %>%
reduce(function(x, y) union(x, y)) %>%
show_query()
```
## 6.6 Characters to field names
1. Create two character variables. One with the name of a field in *flights* and another with a new name to be given to the field.
```{r}
my_field <- "new"
flights_field <- "arrdelay"
```
2. Add a `mutate()` step that adds the new field. And then another step selecting just the new field.
```{r}
flights %>%
mutate(!! my_field := !! flights_field) %>%
select(my_field)
```
3. Add `show_query()` to take a look at what was sent to the database
```{r}
```
4. Encase *flights_field* inside `expr()` to see what changes. Remove `show_query()`
```{r}
```
5. Replace `expr()` with `sym()`
```{r}
```
6. Re-add `show_query()`
```{r}
```
7. Encase *my_field* inside `sym()`
```{r}
```
```{r, include = FALSE}
dbDisconnect(con)
```