generated from stat545ubc-2022/collaborative-template
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtroubleshooting-2.Rmd
189 lines (142 loc) · 8.09 KB
/
troubleshooting-2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
title: "Team Troubleshooting Deliverable 2"
output: github_document
---
```{r include = FALSE}
knitr::opts_chunk$set(error = TRUE)
```
There are **11 code chunks with errors** in this Rmd. Your objective is to fix all of the errors in this worksheet. For the purpose of grading, each erroneous code chunk is equally weighted.
Note that errors are not all syntactic (i.e., broken code)! Some are logical errors as well (i.e. code that does not do what it was intended to do).
## Exercise 1: Exploring with `select()` and `filter()`
[MovieLens](https://dl.acm.org/doi/10.1145/2827872) are a series of datasets widely used in education, that describe movie ratings from the MovieLens [website](https://movielens.org/). There are several MovieLens datasets, collected by the [GroupLens Research Project](https://grouplens.org/datasets/movielens/) at the University of Minnesota. Here, we load the MovieLens 100K dataset from Rafael Irizarry and Amy Gill's R package, [dslabs](https://cran.r-project.org/web/packages/dslabs/dslabs.pdf), which contains datasets useful for data analysis practice, homework, and projects in data science courses and workshops. We'll also load other required packages.
```{r}
### ERROR HERE ###
library(dslabs) ##change to library function for syntax error
library(tidyverse)
library(stringr)
library(gapminder)
```
Let's have a look at the dataset! My goal is to:
- Find out the "class" of the dataset.
- If it isn't a tibble already, coerce it into a tibble and store it in the variable "movieLens".
- Have a quick look at the tibble, using a *dplyr function*.
```{r}
### ERROR HERE ###
class(dslabs::movielens)
movieLens <- as_tibble(dslabs::movielens)
dim(movieLens)
glimpse(movieLens) ## add to show some rows of dataset
```
Now that we've had a quick look at the dataset, it would be interesting to explore the rows (observations) in some more detail. I'd like to consider the movie entries that...
- belong *exclusively* to the genre *"Drama"*;
- don't belong *exclusively* to the genre *"Drama"*;
- were filmed *after* the year 2000;
- were filmed in 1999 *or* 2000;
- have *more than* 4.5 stars, and were filmed *before* 1995.
```{r}
### ERROR HERE ###
filter(movieLens, genres == "Drama")
filter(movieLens, !genres == "Drama")
filter(movieLens, year > 2000) ## change to > 2000 as we need "after" year 2000
filter(movieLens, year == 1999 | year == 2000) ## change to year == 2000
filter(movieLens, rating > 4.5, year < 1995)
```
While filtering for *all movies that do not belong to the genre drama* above, I noticed something interesting. I want to filter for the same thing again, this time selecting variables **title and genres first,** and then *everything else*. But I want to do this in a robust way, so that (for example) if I end up changing `movieLens` to contain more or less columns some time in the future, the code will still work. Hint: there is a function to select "everything else"...
```{r}
### ERROR HERE ###
movieLens %>%
filter(!genres == "Drama") %>%
select(title, genres, everything()) ##use everything to represent the rest
```
## Exercise 2: Calculating with `mutate()`-like functions
Some of the variables in the `movieLens` dataset are in *camelCase* (in fact, *movieLens* is in camelCase). Let's clean these two variables to use *snake_case* instead, and assign our post-rename object back to "movieLens".
```{r}
### ERROR HERE ###
movieLens <- movieLens %>%
rename(user_id = userId,
movie_id = movieId) ## == to = for syntax error
head(movieLens)
```
As you already know, `mutate()` defines and inserts new variables into a tibble. There is *another mystery function similar to `mutate()`* that adds the new variable, but also drops existing ones. I wanted to create an `average_rating` column that takes the `mean(rating)` across all entries, and I only want to see that variable (i.e drop all others!) but I forgot what that mystery function is. Can you remember?
```{r}
### ERROR HERE ###
transmute(movieLens,
average_rating = mean(rating)) ## use transmute to show only new column.
```
## Exercise 3: Calculating with `summarise()`-like functions
Alone, `tally()` is a short form of `summarise()`. `count()` is short-hand for `group_by()` and `tally()`.
Each entry of the movieLens table corresponds to a movie rating by a user. Therefore, if more than one user rated the same movie, there will be several entries for the same movie. I want to find out how many times each movie has been reviewed, or in other words, how many times each movie title appears in the dataset.
```{r}
movieLens %>%
group_by(title) %>%
tally()
```
Without using `group_by()`, I want to find out how many movie reviews there have been for each year.
```{r}
### ERROR HERE ###
movieLens %>%
count(year) ##change from tally to count
```
Both `count()` and `tally()` can be grouped by multiple columns. Below, I want to count the number of movie reviews by title and rating, and sort the results.
```{r}
### ERROR HERE ###
movieLens %>%
count(title, rating, sort = TRUE) ## don't have to make vector as input to count
```
Not only do `count()` and `tally()` quickly allow you to count items within your dataset, `add_tally()` and `add_count()` are handy shortcuts that add an additional columns to your tibble, rather than collapsing each group.
## Exercise 4: Calculating with `group_by()`
We can calculate the mean rating by year, and store it in a new column called `avg_rating`:
```{r}
movieLens %>%
group_by(year) %>%
summarize(avg_rating = mean(rating))
```
Using `summarize()`, we can find the minimum and the maximum rating by title, stored under columns named `min_rating`, and `max_rating`, respectively.
```{r}
### ERROR HERE ###
movieLens %>%
group_by(title) %>% ## group by the title
summarise(min_rating = min(rating), ## summarise the min and max of rating
max_rating = max(rating))
```
## Exercise 5: Scoped variants with `across()`
`across()` is a newer dplyr function (`dplyr` 1.0.0) that allows you to apply a transformation to multiple variables selected with the `select()` and `rename()` syntax. For this section, we will use the `starwars` dataset, which is built into R. First, let's transform it into a tibble and store it under the variable `starWars`.
```{r}
starWars <- as_tibble(starwars)
```
We can find the mean for all columns that are numeric, ignoring the missing values:
```{r}
starWars %>%
summarise(across(where(is.numeric), function(x) mean(x, na.rm=TRUE)))
```
We can find the minimum height and mass within each species, ignoring the missing values:
```{r}
### ERROR HERE ###
starWars %>%
group_by(species) %>%
summarise(across(c(height, mass), function(x) min(x, na.rm=TRUE))) ## create vector of columns "height" and "mass" in across to fix syntax error
```
Note that here R has taken the convention that the minimum value of a set of `NA`s is `Inf`.
## Exercise 6: Making tibbles
Manually create a tibble with 4 columns:
- `birth_year` should contain years 1998 to 2005 (inclusive);
- `birth_weight` should take the `birth_year` column, subtract 1995, and multiply by 0.45;
- `birth_location` should contain three locations (Liverpool, Seattle, and New York).
```{r}
### ERROR HERE ###
fakeStarWars <- tribble(
~name, ~birth_weight, ~birth_year, ~birth_location,
"Luke Skywalker", 1.35 , 1998 , "Liverpool, England",
"C-3PO" , 1.80 , 1999 , "Liverpool, England",
"R2-D2" , 2.25 , 2000 , "Seattle, WA",
"Darth Vader" , 2.70 , 2001 , "Liverpool, England",
"Leia Organa" , 3.15 , 2002 , "New York, NY",
"Owen Lars" , 3.60 , 2003 , "Seattle, WA",
"Beru Whitesun Iars", 4.05 , 2004 , "Liverpool, England",
"R5-D4" , 4.50 , 2005 , "New York, NY"
) ## add , after birth_location to solve syntax error, and "" the birth locations
#fakeStarWars %>% mutate(birth_weight_check = (birth_year-1995)*0.45)
fakeStarWars
```
## Attributions
Thanks to Icíar Fernández-Boyano for writing most of this document, and Albina Gibadullina, Diana Lin, Yulia Egorova, and Vincenzo Coia for their edits.