There are 11 code chunks with errors in this Rmd. Your objective is to fix all of the errors in this worksheet. For the purpose of grading, each erroneous code chunk is equally weighted.
Note that errors are not all syntactic (i.e., broken code)! Some are logical errors as well (i.e. code that does not do what it was intended to do).
MovieLens are a series of datasets widely used in education, that describe movie ratings from the MovieLens website. There are several MovieLens datasets, collected by the GroupLens Research Project at the University of Minnesota. Here, we load the MovieLens 100K dataset from Rafael Irizarry and Amy Gill’s R package, dslabs, which contains datasets useful for data analysis practice, homework, and projects in data science courses and workshops. We’ll also load other required packages.
library("dslabs")
library("tidyverse")
library("stringr")
library("gapminder")
Let’s have a look at the dataset! My goal is to:
- Find out the “class” of the dataset.
- If it isn’t a tibble already, coerce it into a tibble and store it in the variable “movieLens”.
- Have a quick look at the tibble, using a dplyr function.
class(dslabs::movielens)
## [1] "data.frame"
movieLens <- as_tibble(dslabs::movielens)
dim(movieLens)
## [1] 100004 7
# In addition to dim() (which is a part of base R), I used the dplyr function glipmse()
# to have a brief description of what the dataset looks like
glimpse(movieLens)
## Rows: 100,004
## Columns: 7
## $ movieId <int> 31, 1029, 1061, 1129, 1172, 1263, 1287, 1293, 1339, 1343, 13…
## $ title <chr> "Dangerous Minds", "Dumbo", "Sleepers", "Escape from New Yor…
## $ year <int> 1995, 1941, 1996, 1981, 1989, 1978, 1959, 1982, 1992, 1991, …
## $ genres <fct> Drama, Animation|Children|Drama|Musical, Thriller, Action|Ad…
## $ userId <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ rating <dbl> 2.5, 3.0, 3.0, 2.0, 4.0, 2.0, 2.0, 2.0, 3.5, 2.0, 2.5, 1.0, …
## $ timestamp <int> 1260759144, 1260759179, 1260759182, 1260759185, 1260759205, …
Now that we’ve had a quick look at the dataset, it would be interesting to explore the rows (observations) in some more detail. I’d like to consider the movie entries that…
- belong exclusively to the genre “Drama”;
- don’t belong exclusively to the genre “Drama”;
- were filmed after the year 2000;
- were filmed in 1999 or 2000;
- have more than 4.5 stars, and were filmed before 1995.
filter(movieLens, genres == "Drama")
## # A tibble: 7,757 × 7
## movieId title year genres userId rating timestamp
## <int> <chr> <int> <fct> <int> <dbl> <int>
## 1 31 Dangerous Minds 1995 Drama 1 2.5 1.26e9
## 2 1172 Cinema Paradiso (Nuovo cinema P… 1989 Drama 1 4 1.26e9
## 3 1293 Gandhi 1982 Drama 1 2 1.26e9
## 4 62 Mr. Holland's Opus 1995 Drama 2 3 8.35e8
## 5 261 Little Women 1994 Drama 2 4 8.35e8
## 6 300 Quiz Show 1994 Drama 2 3 8.35e8
## 7 508 Philadelphia 1993 Drama 2 4 8.35e8
## 8 537 Sirens 1994 Drama 2 4 8.35e8
## 9 2702 Summer of Sam 1999 Drama 3 3.5 1.30e9
## 10 3949 Requiem for a Dream 2000 Drama 3 5 1.30e9
## # ℹ 7,747 more rows
# Changed this
#filter(movieLens, !genres == "Drama")
# To this:
filter(movieLens, genres != "Drama")
## # A tibble: 92,247 × 7
## movieId title year genres userId rating timestamp
## <int> <chr> <int> <fct> <int> <dbl> <int>
## 1 1029 Dumbo 1941 Animat… 1 3 1.26e9
## 2 1061 Sleepers 1996 Thrill… 1 3 1.26e9
## 3 1129 Escape from New York 1981 Action… 1 2 1.26e9
## 4 1263 Deer Hunter, The 1978 Drama|… 1 2 1.26e9
## 5 1287 Ben-Hur 1959 Action… 1 2 1.26e9
## 6 1339 Dracula (Bram Stoker's Dracula) 1992 Fantas… 1 3.5 1.26e9
## 7 1343 Cape Fear 1991 Thrill… 1 2 1.26e9
## 8 1371 Star Trek: The Motion Picture 1979 Advent… 1 2.5 1.26e9
## 9 1405 Beavis and Butt-Head Do America 1996 Advent… 1 1 1.26e9
## 10 1953 French Connection, The 1971 Action… 1 4 1.26e9
## # ℹ 92,237 more rows
# Changed this
#filter(movieLens, year >= 2000)
# To this:
filter(movieLens, year > 2000)
## # A tibble: 25,481 × 7
## movieId title year genres userId rating timestamp
## <int> <chr> <int> <fct> <int> <dbl> <int>
## 1 5349 Spider-Man 2002 Actio… 3 3 1.30e9
## 2 5669 Bowling for Columbine 2002 Docum… 3 3.5 1.30e9
## 3 6377 Finding Nemo 2003 Adven… 3 3 1.30e9
## 4 7153 Lord of the Rings: The Return o… 2003 Actio… 3 2.5 1.30e9
## 5 7361 Eternal Sunshine of the Spotles… 2004 Drama… 3 3 1.30e9
## 6 8622 Fahrenheit 9/11 2004 Docum… 3 3.5 1.30e9
## 7 8636 Spider-Man 2 2004 Actio… 3 3 1.30e9
## 8 44191 V for Vendetta 2006 Actio… 3 3.5 1.30e9
## 9 48783 Flags of Our Fathers 2006 Drama… 3 4.5 1.30e9
## 10 50068 Letters from Iwo Jima 2006 Drama… 3 4.5 1.30e9
## # ℹ 25,471 more rows
# Changed this
#filter(movieLens, year == 1999 | month == 2000)
# To this:
filter(movieLens, year == 1999 | year == 2000)
## # A tibble: 9,088 × 7
## movieId title year genres userId rating timestamp
## <int> <chr> <int> <fct> <int> <dbl> <int>
## 1 2694 Big Daddy 1999 Comedy 3 3 1.30e9
## 2 2702 Summer of Sam 1999 Drama 3 3.5 1.30e9
## 3 2762 Sixth Sense, The 1999 Drama… 3 3.5 1.30e9
## 4 2841 Stir of Echoes 1999 Horro… 3 4 1.30e9
## 5 2858 American Beauty 1999 Drama… 3 4 1.30e9
## 6 2959 Fight Club 1999 Actio… 3 5 1.30e9
## 7 3510 Frequency 2000 Drama… 3 4 1.30e9
## 8 3949 Requiem for a Dream 2000 Drama 3 5 1.30e9
## 9 27369 Daria: Is It Fall Yet? 2000 Anima… 3 3.5 1.30e9
## 10 2628 Star Wars: Episode I - The Phan… 1999 Actio… 4 5 9.50e8
## # ℹ 9,078 more rows
filter(movieLens, rating > 4.5, year < 1995)
## # A tibble: 8,386 × 7
## movieId title year genres userId rating timestamp
## <int> <chr> <int> <fct> <int> <dbl> <int>
## 1 265 Like Water for Chocolate (Como … 1992 Drama… 2 5 8.35e8
## 2 266 Legends of the Fall 1994 Drama… 2 5 8.35e8
## 3 551 Nightmare Before Christmas, The 1993 Anima… 2 5 8.35e8
## 4 589 Terminator 2: Judgment Day 1991 Actio… 2 5 8.35e8
## 5 590 Dances with Wolves 1990 Adven… 2 5 8.35e8
## 6 592 Batman 1989 Actio… 2 5 8.35e8
## 7 318 Shawshank Redemption, The 1994 Crime… 3 5 1.30e9
## 8 356 Forrest Gump 1994 Comed… 3 5 1.30e9
## 9 1197 Princess Bride, The 1987 Actio… 3 5 1.30e9
## 10 260 Star Wars: Episode IV - A New H… 1977 Actio… 4 5 9.50e8
## # ℹ 8,376 more rows
While filtering for all movies that do not belong to the genre drama
above, I noticed something interesting. I want to filter for the same
thing again, this time selecting variables title and genres first,
and then everything else. But I want to do this in a robust way, so
that (for example) if I end up changing movieLens
to contain more or
less columns some time in the future, the code will still work. Hint:
there is a function to select “everything else”…
movieLens %>%
# Changed this
#filter(!genres == "Drama") %>%
# To this
filter(genres != "Drama") %>%
# Changed this
#select(title, genres, year, rating, timestamp)
# To this
select(title, genres, everything())
## # A tibble: 92,247 × 7
## title genres movieId year userId rating timestamp
## <chr> <fct> <int> <int> <int> <dbl> <int>
## 1 Dumbo Animat… 1029 1941 1 3 1.26e9
## 2 Sleepers Thrill… 1061 1996 1 3 1.26e9
## 3 Escape from New York Action… 1129 1981 1 2 1.26e9
## 4 Deer Hunter, The Drama|… 1263 1978 1 2 1.26e9
## 5 Ben-Hur Action… 1287 1959 1 2 1.26e9
## 6 Dracula (Bram Stoker's Dracula) Fantas… 1339 1992 1 3.5 1.26e9
## 7 Cape Fear Thrill… 1343 1991 1 2 1.26e9
## 8 Star Trek: The Motion Picture Advent… 1371 1979 1 2.5 1.26e9
## 9 Beavis and Butt-Head Do America Advent… 1405 1996 1 1 1.26e9
## 10 French Connection, The Action… 1953 1971 1 4 1.26e9
## # ℹ 92,237 more rows
Some of the variables in the movieLens
dataset are in camelCase (in
fact, movieLens is in camelCase). Let’s clean these two variables to
use snake_case instead, and assign our post-rename object back to
“movieLens”.
movieLens <- movieLens %>%
# Changed this
#rename(user_id == userId,
# movie_id == movieId)
# To this
rename(user_id = userId,
movie_id = movieId)
head(movielens)
## movieId title year
## 1 31 Dangerous Minds 1995
## 2 1029 Dumbo 1941
## 3 1061 Sleepers 1996
## 4 1129 Escape from New York 1981
## 5 1172 Cinema Paradiso (Nuovo cinema Paradiso) 1989
## 6 1263 Deer Hunter, The 1978
## genres userId rating timestamp
## 1 Drama 1 2.5 1260759144
## 2 Animation|Children|Drama|Musical 1 3.0 1260759179
## 3 Thriller 1 3.0 1260759182
## 4 Action|Adventure|Sci-Fi|Thriller 1 2.0 1260759185
## 5 Drama 1 4.0 1260759205
## 6 Drama|War 1 2.0 1260759151
As you already know, mutate()
defines and inserts new variables into a
tibble. There is another mystery function similar to mutate()
that
adds the new variable, but also drops existing ones. I wanted to create
an average_rating
column that takes the mean(rating)
across all
entries, and I only want to see that variable (i.e drop all others!) but
I forgot what that mystery function is. Can you remember?
# Most likely, the prompt of the question refers to transmute, which "creates a new data frame containing only the specified computations"
# (see: https://dplyr.tidyverse.org/reference/transmute.html)
# Changed this
#mutate(movieLens,
# average_rating = mean(rating))
# To this
transmute(movieLens,
average_rating = mean(rating))
## # A tibble: 100,004 × 1
## average_rating
## <dbl>
## 1 3.54
## 2 3.54
## 3 3.54
## 4 3.54
## 5 3.54
## 6 3.54
## 7 3.54
## 8 3.54
## 9 3.54
## 10 3.54
## # ℹ 99,994 more rows
Alone, tally()
is a short form of summarise()
. count()
is
short-hand for group_by()
and tally()
.
Each entry of the movieLens table corresponds to a movie rating by a user. Therefore, if more than one user rated the same movie, there will be several entries for the same movie. I want to find out how many times each movie has been reviewed, or in other words, how many times each movie title appears in the dataset.
movieLens %>%
group_by(title) %>%
tally()
## # A tibble: 8,832 × 2
## title n
## <chr> <int>
## 1 "\"Great Performances\" Cats" 2
## 2 "$9.99" 3
## 3 "'Hellboy': The Seeds of Creation" 1
## 4 "'Neath the Arizona Skies" 1
## 5 "'Round Midnight" 2
## 6 "'Salem's Lot" 1
## 7 "'Til There Was You" 4
## 8 "'burbs, The" 19
## 9 "'night Mother" 3
## 10 "(500) Days of Summer" 45
## # ℹ 8,822 more rows
Without using group_by()
, I want to find out how many movie reviews
there have been for each year.
#movieLens %>%
# tally(year)
#Changed to
movieLens %>% # Tally is used for grouped data, count is a short-hand for group_by()
count(year)
## # A tibble: 104 × 2
## year n
## <int> <int>
## 1 1902 6
## 2 1915 2
## 3 1916 1
## 4 1917 2
## 5 1918 2
## 6 1919 1
## 7 1920 15
## 8 1921 12
## 9 1922 28
## 10 1923 3
## # ℹ 94 more rows
Both count()
and tally()
can be grouped by multiple columns. Below,
I want to count the number of movie reviews by title and rating, and
sort the results.
#movieLens %>%
# count(c(title, rating), sort = TRUE)
# changed to:
movieLens %>%
count(title, rating, sort = TRUE) # c() function call should not be passed into count()
## # A tibble: 28,297 × 3
## title rating n
## <chr> <dbl> <int>
## 1 Shawshank Redemption, The 5 170
## 2 Pulp Fiction 5 138
## 3 Star Wars: Episode IV - A New Hope 5 122
## 4 Forrest Gump 4 113
## 5 Schindler's List 5 109
## 6 Godfather, The 5 107
## 7 Forrest Gump 5 102
## 8 Silence of the Lambs, The 4 102
## 9 Fargo 5 100
## 10 Silence of the Lambs, The 5 100
## # ℹ 28,287 more rows
Not only do count()
and tally()
quickly allow you to count items
within your dataset, add_tally()
and add_count()
are handy shortcuts
that add an additional columns to your tibble, rather than collapsing
each group.
We can calculate the mean rating by year, and store it in a new column
called avg_rating
:
movieLens %>%
group_by(year) %>%
summarize(avg_rating = mean(rating))
## # A tibble: 104 × 2
## year avg_rating
## <int> <dbl>
## 1 1902 4.33
## 2 1915 3
## 3 1916 3.5
## 4 1917 4.25
## 5 1918 4.25
## 6 1919 3
## 7 1920 3.7
## 8 1921 4.42
## 9 1922 3.80
## 10 1923 4.17
## # ℹ 94 more rows
Using summarize()
, we can find the minimum and the maximum rating by
title, stored under columns named min_rating
, and max_rating
,
respectively.
#movieLens %>%
# mutate(min_rating = min(rating),
# max_rating = max(rating))
# Changed to: (the pipeline did not group data by title)
movieLens %>%
group_by(title) %>%
summarize(min_rating = min(rating, na.rm = TRUE), max_rating = max(rating, na.rm = TRUE))
## # A tibble: 8,832 × 3
## title min_rating max_rating
## <chr> <dbl> <dbl>
## 1 "\"Great Performances\" Cats" 0.5 3
## 2 "$9.99" 2.5 4.5
## 3 "'Hellboy': The Seeds of Creation" 2 2
## 4 "'Neath the Arizona Skies" 0.5 0.5
## 5 "'Round Midnight" 0.5 4
## 6 "'Salem's Lot" 3.5 3.5
## 7 "'Til There Was You" 0.5 4
## 8 "'burbs, The" 1.5 4.5
## 9 "'night Mother" 5 5
## 10 "(500) Days of Summer" 0.5 5
## # ℹ 8,822 more rows
across()
is a newer dplyr function (dplyr
1.0.0) that allows you to
apply a transformation to multiple variables selected with the
select()
and rename()
syntax. For this section, we will use the
starwars
dataset, which is built into R. First, let’s transform it
into a tibble and store it under the variable starWars
.
starWars <- as_tibble(starwars)
We can find the mean for all columns that are numeric, ignoring the missing values:
starWars %>%
summarise(across(where(is.numeric), function(x) mean(x, na.rm=TRUE)))
## # A tibble: 1 × 3
## height mass birth_year
## <dbl> <dbl> <dbl>
## 1 174. 97.3 87.6
We can find the minimum height and mass within each species, ignoring the missing values:
### ERROR HERE (fixed) ###
starWars %>%
group_by(species) %>%
summarise(across(c("height", "mass"), function(x) min(x, na.rm=TRUE)))
## Warning: There were 6 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `across(c("height", "mass"), function(x) min(x, na.rm = TRUE))`.
## ℹ In group 4: `species = "Chagrian"`.
## Caused by warning in `min()`:
## ! no non-missing arguments to min; returning Inf
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 5 remaining warnings.
## # A tibble: 38 × 3
## species height mass
## <chr> <int> <dbl>
## 1 Aleena 79 15
## 2 Besalisk 198 102
## 3 Cerean 198 82
## 4 Chagrian 196 Inf
## 5 Clawdite 168 55
## 6 Droid 96 32
## 7 Dug 112 40
## 8 Ewok 88 20
## 9 Geonosian 183 80
## 10 Gungan 196 66
## # ℹ 28 more rows
Note that here R has taken the convention that the minimum value of a
set of NA
s is Inf
.
Manually create a tibble with 4 columns:
birth_year
should contain years 1998 to 2005 (inclusive);birth_weight
should take thebirth_year
column, subtract 1995, and multiply by 0.45;birth_location
should contain three locations (Liverpool, Seattle, and New York).- Modification: add , after
birth_location
, add ““ for birth_location value.
fakeStarWars <- tribble(
~name, ~birth_weight, ~birth_year, ~birth_location,
"Luke Skywalker", 1.35 , 1998 , "Liverpool, England",
"C-3PO" , 1.80 , 1999 , "Liverpool, England",
"R2-D2" , 2.25 , 2000 , "Seattle, WA",
"Darth Vader" , 2.70 , 2001 , "Liverpool, England",
"Leia Organa" , 3.15 , 2002 , "New York, NY",
"Owen Lars" , 3.60 , 2003 , "Seattle, WA",
"Beru Whitesun Iars", 4.05 , 2004 , "Liverpool, England",
"R5-D4" , 4.50 , 2005 , "New York, NY"
)
fakeStarWars
## # A tibble: 8 × 4
## name birth_weight birth_year birth_location
## <chr> <dbl> <dbl> <chr>
## 1 Luke Skywalker 1.35 1998 Liverpool, England
## 2 C-3PO 1.8 1999 Liverpool, England
## 3 R2-D2 2.25 2000 Seattle, WA
## 4 Darth Vader 2.7 2001 Liverpool, England
## 5 Leia Organa 3.15 2002 New York, NY
## 6 Owen Lars 3.6 2003 Seattle, WA
## 7 Beru Whitesun Iars 4.05 2004 Liverpool, England
## 8 R5-D4 4.5 2005 New York, NY
Thanks to Icíar Fernández-Boyano for writing most of this document, and Albina Gibadullina, Diana Lin, Yulia Egorova, and Vincenzo Coia for their edits.