From be663866c52143cecbbdaac559fb86dfebe7c855 Mon Sep 17 00:00:00 2001 From: wangyubo <1179041704@qq.com> Date: Tue, 26 Sep 2023 10:36:39 -0700 Subject: [PATCH 1/2] fix exercise6 bugs --- collaborative-group24.Rproj | 13 --- troubleshooting-2.Rmd | 186 ++++++++++++++++++++++++++++++++++++ 2 files changed, 186 insertions(+), 13 deletions(-) delete mode 100644 collaborative-group24.Rproj create mode 100644 troubleshooting-2.Rmd diff --git a/collaborative-group24.Rproj b/collaborative-group24.Rproj deleted file mode 100644 index 8e3c2eb..0000000 --- a/collaborative-group24.Rproj +++ /dev/null @@ -1,13 +0,0 @@ -Version: 1.0 - -RestoreWorkspace: Default -SaveWorkspace: Default -AlwaysSaveHistory: Default - -EnableCodeIndexing: Yes -UseSpacesForTab: Yes -NumSpacesForTab: 2 -Encoding: UTF-8 - -RnwWeave: Sweave -LaTeX: pdfLaTeX diff --git a/troubleshooting-2.Rmd b/troubleshooting-2.Rmd new file mode 100644 index 0000000..c14d323 --- /dev/null +++ b/troubleshooting-2.Rmd @@ -0,0 +1,186 @@ +--- +title: "Team Troubleshooting Deliverable 2" +output: github_document +--- + +```{r include = FALSE} +knitr::opts_chunk$set(error = TRUE) +``` + +There are **11 code chunks with errors** in this Rmd. Your objective is to fix all of the errors in this worksheet. For the purpose of grading, each erroneous code chunk is equally weighted. + +Note that errors are not all syntactic (i.e., broken code)! Some are logical errors as well (i.e. code that does not do what it was intended to do). + +## Exercise 1: Exploring with `select()` and `filter()` + +[MovieLens](https://dl.acm.org/doi/10.1145/2827872) are a series of datasets widely used in education, that describe movie ratings from the MovieLens [website](https://movielens.org/). There are several MovieLens datasets, collected by the [GroupLens Research Project](https://grouplens.org/datasets/movielens/) at the University of Minnesota. Here, we load the MovieLens 100K dataset from Rafael Irizarry and Amy Gill's R package, [dslabs](https://cran.r-project.org/web/packages/dslabs/dslabs.pdf), which contains datasets useful for data analysis practice, homework, and projects in data science courses and workshops. We'll also load other required packages. + +```{r} +### ERROR HERE ### +load.packages(dslabs) +load.packages(tidyverse) +load.packages(stringr) +install.packages("devtools") # Do not run this if you already have this package installed! +devtools::install_github("JoeyBernhardt/singer") +load.packages(gapminder) +``` + +Let's have a look at the dataset! My goal is to: + +- Find out the "class" of the dataset. +- If it isn't a tibble already, coerce it into a tibble and store it in the variable "movieLens". +- Have a quick look at the tibble, using a *dplyr function*. + +```{r} +### ERROR HERE ### +class(dslabs::movielens) +movieLens <- as_tibble(dslabs::movielens) +dim(movieLens) +``` + +Now that we've had a quick look at the dataset, it would be interesting to explore the rows (observations) in some more detail. I'd like to consider the movie entries that... + +- belong *exclusively* to the genre *"Drama"*; +- don't belong *exclusively* to the genre *"Drama"*; +- were filmed *after* the year 2000; +- were filmed in 1999 *or* 2000; +- have *more than* 4.5 stars, and were filmed *before* 1995. + +```{r} +### ERROR HERE ### +filter(movieLens, genres == "Drama") +filter(movieLens, !genres == "Drama") +filter(movieLens, year >= 2000) +filter(movieLens, year == 1999 | month == 2000) +filter(movieLens, rating > 4.5, year < 1995) +``` + +While filtering for *all movies that do not belong to the genre drama* above, I noticed something interesting. I want to filter for the same thing again, this time selecting variables **title and genres first,** and then *everything else*. But I want to do this in a robust way, so that (for example) if I end up changing `movieLens` to contain more or less columns some time in the future, the code will still work. Hint: there is a function to select "everything else"... + +```{r} +### ERROR HERE ### +movieLens %>% + filter(!genres == "Drama") %>% + select(title, genres, year, rating, timestamp) +``` + +## Exercise 2: Calculating with `mutate()`-like functions + +Some of the variables in the `movieLens` dataset are in *camelCase* (in fact, *movieLens* is in camelCase). Let's clean these two variables to use *snake_case* instead, and assign our post-rename object back to "movieLens". + +```{r} +### ERROR HERE ### +movieLens <- movieLens %>% + rename(user_id == userId, + movie_id == movieId) +``` + +As you already know, `mutate()` defines and inserts new variables into a tibble. There is *another mystery function similar to `mutate()`* that adds the new variable, but also drops existing ones. I wanted to create an `average_rating` column that takes the `mean(rating)` across all entries, and I only want to see that variable (i.e drop all others!) but I forgot what that mystery function is. Can you remember? + +```{r} +### ERROR HERE ### +mutate(movieLens, + average_rating = mean(rating)) +``` + +## Exercise 3: Calculating with `summarise()`-like functions + +Alone, `tally()` is a short form of `summarise()`. `count()` is short-hand for `group_by()` and `tally()`. + +Each entry of the movieLens table corresponds to a movie rating by a user. Therefore, if more than one user rated the same movie, there will be several entries for the same movie. I want to find out how many times each movie has been reviewed, or in other words, how many times each movie title appears in the dataset. + +```{r} +movieLens %>% + group_by(title) %>% + tally() +``` + +Without using `group_by()`, I want to find out how many movie reviews there have been for each year. + +```{r} +### ERROR HERE ### +movieLens %>% + tally(year) +``` + +Both `count()` and `tally()` can be grouped by multiple columns. Below, I want to count the number of movie reviews by title and rating, and sort the results. + +```{r} +### ERROR HERE ### +movieLens %>% + count(c(title, rating), sort = TRUE) +``` + +Not only do `count()` and `tally()` quickly allow you to count items within your dataset, `add_tally()` and `add_count()` are handy shortcuts that add an additional columns to your tibble, rather than collapsing each group. + +## Exercise 4: Calculating with `group_by()` + +We can calculate the mean rating by year, and store it in a new column called `avg_rating`: + +```{r} +movieLens %>% + group_by(year) %>% + summarize(avg_rating = mean(rating)) +``` + +Using `summarize()`, we can find the minimum and the maximum rating by title, stored under columns named `min_rating`, and `max_rating`, respectively. + +```{r} +### ERROR HERE ### +movieLens %>% + mutate(min_rating = min(rating), + max_rating = max(rating)) +``` + +## Exercise 5: Scoped variants with `across()` + +`across()` is a newer dplyr function (`dplyr` 1.0.0) that allows you to apply a transformation to multiple variables selected with the `select()` and `rename()` syntax. For this section, we will use the `starwars` dataset, which is built into R. First, let's transform it into a tibble and store it under the variable `starWars`. + +```{r} +starWars <- as_tibble(starwars) +``` + +We can find the mean for all columns that are numeric, ignoring the missing values: + +```{r} +starWars %>% + summarise(across(where(is.numeric), function(x) mean(x, na.rm=TRUE))) +``` + +We can find the minimum height and mass within each species, ignoring the missing values: + +```{r} +### ERROR HERE ### +starWars %>% + group_by(species) %>% + summarise(across("height", "mass", function(x) min(x, na.rm=TRUE))) +``` + +Note that here R has taken the convention that the minimum value of a set of `NA`s is `Inf`. + +## Exercise 6: Making tibbles + +Manually create a tibble with 4 columns: + +- `birth_year` should contain years 1998 to 2005 (inclusive); +- `birth_weight` should take the `birth_year` column, subtract 1995, and multiply by 0.45; +- `birth_location` should contain three locations (Liverpool, Seattle, and New York). + +```{r} +### ERROR HERE ### +fakeStarWars <- tribble( + ~name, ~birth_weight, ~birth_year, ~birth_location + "Luke Skywalker", 1.35 , 1998 , Liverpool, England, + "C-3PO" , 1.80 , 1999 , Liverpool, England, + "R2-D2" , 2.25 , 2000 , Seattle, WA, + "Darth Vader" , 2.70 , 2001 , Liverpool, England, + "Leia Organa" , 3.15 , 2002 , New York, NY, + "Owen Lars" , 3.60 , 2003 , Seattle, WA, + "Beru Whitesun Iars", 4.05 , 2004 , Liverpool, England, + "R5-D4" , 4.50 , 2005 , New York, NY, +) +``` + +## Attributions + +Thanks to Icíar Fernández-Boyano for writing most of this document, and Albina Gibadullina, Diana Lin, Yulia Egorova, and Vincenzo Coia for their edits. \ No newline at end of file From 284d2e05da46d1b2b7a31359d035f629fffed42d Mon Sep 17 00:00:00 2001 From: wangyubo <1179041704@qq.com> Date: Wed, 27 Sep 2023 17:02:12 -0700 Subject: [PATCH 2/2] do exercise 6 with kinit, add md file and remove ###ERROR HERE### headers --- troubleshooting-2.Rmd | 26 ++-- troubleshooting-2.md | 327 ++++++++++++++++++++++++++++++------------ 2 files changed, 250 insertions(+), 103 deletions(-) diff --git a/troubleshooting-2.Rmd b/troubleshooting-2.Rmd index 5e33f64..37d3f14 100644 --- a/troubleshooting-2.Rmd +++ b/troubleshooting-2.Rmd @@ -15,7 +15,7 @@ Note that errors are not all syntactic (i.e., broken code)! Some are logical err [MovieLens](https://dl.acm.org/doi/10.1145/2827872) are a series of datasets widely used in education, that describe movie ratings from the MovieLens [website](https://movielens.org/). There are several MovieLens datasets, collected by the [GroupLens Research Project](https://grouplens.org/datasets/movielens/) at the University of Minnesota. Here, we load the MovieLens 100K dataset from Rafael Irizarry and Amy Gill's R package, [dslabs](https://cran.r-project.org/web/packages/dslabs/dslabs.pdf), which contains datasets useful for data analysis practice, homework, and projects in data science courses and workshops. We'll also load other required packages. -```{r} +```{r eval=FALSE, include=FALSE} # Changed previous install.packages to have "" install.packages("dslabs") install.packages("tidyverse") @@ -24,6 +24,12 @@ install.packages("devtools") # Do not run this if you already have this package devtools::install_github("JoeyBernhardt/singer") install.packages("gapminder") ``` +```{r message=FALSE, warning=FALSE} +library("dslabs") +library("tidyverse") +library("stringr") +library("gapminder") +``` Let's have a look at the dataset! My goal is to: @@ -32,7 +38,7 @@ Let's have a look at the dataset! My goal is to: - Have a quick look at the tibble, using a *dplyr function*. ```{r} -### ERROR HERE ### + class(dslabs::movielens) movieLens <- as_tibble(dslabs::movielens) dim(movieLens) @@ -50,7 +56,7 @@ Now that we've had a quick look at the dataset, it would be interesting to explo - have *more than* 4.5 stars, and were filmed *before* 1995. ```{r} -### ERROR HERE ### + filter(movieLens, genres == "Drama") # Changed this #filter(movieLens, !genres == "Drama") @@ -73,7 +79,7 @@ filter(movieLens, rating > 4.5, year < 1995) While filtering for *all movies that do not belong to the genre drama* above, I noticed something interesting. I want to filter for the same thing again, this time selecting variables **title and genres first,** and then *everything else*. But I want to do this in a robust way, so that (for example) if I end up changing `movieLens` to contain more or less columns some time in the future, the code will still work. Hint: there is a function to select "everything else"... ```{r} -### ERROR HERE ### + movieLens %>% # Changed this #filter(!genres == "Drama") %>% @@ -90,7 +96,7 @@ movieLens %>% Some of the variables in the `movieLens` dataset are in *camelCase* (in fact, *movieLens* is in camelCase). Let's clean these two variables to use *snake_case* instead, and assign our post-rename object back to "movieLens". ```{r} -### ERROR HERE ### + movieLens <- movieLens %>% # Changed this #rename(user_id == userId, @@ -106,7 +112,7 @@ head(movielens) As you already know, `mutate()` defines and inserts new variables into a tibble. There is *another mystery function similar to `mutate()`* that adds the new variable, but also drops existing ones. I wanted to create an `average_rating` column that takes the `mean(rating)` across all entries, and I only want to see that variable (i.e drop all others!) but I forgot what that mystery function is. Can you remember? ```{r} -### ERROR HERE ### + # Most likely, the prompt of the question refers to transmute, which "creates a new data frame containing only the specified computations" # (see: https://dplyr.tidyverse.org/reference/transmute.html) # Changed this @@ -132,7 +138,7 @@ movieLens %>% Without using `group_by()`, I want to find out how many movie reviews there have been for each year. ```{r} -### ERROR HERE ### + #movieLens %>% # tally(year) #Changed to @@ -143,7 +149,7 @@ movieLens %>% # Tally is used for grouped data, count is a short-hand for grou Both `count()` and `tally()` can be grouped by multiple columns. Below, I want to count the number of movie reviews by title and rating, and sort the results. ```{r} -### ERROR HERE ### + #movieLens %>% # count(c(title, rating), sort = TRUE) # changed to: @@ -166,7 +172,7 @@ movieLens %>% Using `summarize()`, we can find the minimum and the maximum rating by title, stored under columns named `min_rating`, and `max_rating`, respectively. ```{r} -### ERROR HERE ### + #movieLens %>% # mutate(min_rating = min(rating), # max_rating = max(rating)) @@ -214,7 +220,7 @@ Manually create a tibble with 4 columns: - Modification: add *,* after `birth_location`, add *""* for birth_location value. ```{r} -### ERROR HERE ### + fakeStarWars <- tribble( ~name, ~birth_weight, ~birth_year, ~birth_location, "Luke Skywalker", 1.35 , 1998 , "Liverpool, England", diff --git a/troubleshooting-2.md b/troubleshooting-2.md index 73f5a44..f1ed73e 100644 --- a/troubleshooting-2.md +++ b/troubleshooting-2.md @@ -24,57 +24,12 @@ projects in data science courses and workshops. We’ll also load other required packages. ``` r -# Changed previous install.packages to have "" -install.packages("dslabs") +library("dslabs") +library("tidyverse") +library("stringr") +library("gapminder") ``` - ## Installing package into '/opt/homebrew/lib/R/4.2/site-library' - ## (as 'lib' is unspecified) - - ## Error in contrib.url(repos, type): trying to use CRAN without setting a mirror - -``` r -install.packages("tidyverse") -``` - - ## Installing package into '/opt/homebrew/lib/R/4.2/site-library' - ## (as 'lib' is unspecified) - - ## Error in contrib.url(repos, type): trying to use CRAN without setting a mirror - -``` r -install.packages("stringr") -``` - - ## Installing package into '/opt/homebrew/lib/R/4.2/site-library' - ## (as 'lib' is unspecified) - - ## Error in contrib.url(repos, type): trying to use CRAN without setting a mirror - -``` r -install.packages("devtools") # Do not run this if you already have this package installed! -``` - - ## Installing package into '/opt/homebrew/lib/R/4.2/site-library' - ## (as 'lib' is unspecified) - - ## Error in contrib.url(repos, type): trying to use CRAN without setting a mirror - -``` r -devtools::install_github("JoeyBernhardt/singer") -``` - - ## Error in loadNamespace(x): there is no package called 'devtools' - -``` r -install.packages("gapminder") -``` - - ## Installing package into '/opt/homebrew/lib/R/4.2/site-library' - ## (as 'lib' is unspecified) - - ## Error in contrib.url(repos, type): trying to use CRAN without setting a mirror - Let’s have a look at the dataset! My goal is to: - Find out the “class” of the dataset. @@ -83,23 +38,17 @@ Let’s have a look at the dataset! My goal is to: - Have a quick look at the tibble, using a *dplyr function*. ``` r -### ERROR HERE ### class(dslabs::movielens) ``` - ## Error in loadNamespace(x): there is no package called 'dslabs' + ## [1] "data.frame" ``` r movieLens <- as_tibble(dslabs::movielens) -``` - - ## Error in as_tibble(dslabs::movielens): could not find function "as_tibble" - -``` r dim(movieLens) ``` - ## Error in eval(expr, envir, enclos): object 'movieLens' not found + ## [1] 100004 7 ``` r # In addition to dim() (which is a part of base R), I used the dplyr function glipmse() @@ -107,7 +56,15 @@ dim(movieLens) glimpse(movieLens) ``` - ## Error in glimpse(movieLens): could not find function "glimpse" + ## Rows: 100,004 + ## Columns: 7 + ## $ movieId 31, 1029, 1061, 1129, 1172, 1263, 1287, 1293, 1339, 1343, 13… + ## $ title "Dangerous Minds", "Dumbo", "Sleepers", "Escape from New Yor… + ## $ year 1995, 1941, 1996, 1981, 1989, 1978, 1959, 1982, 1992, 1991, … + ## $ genres Drama, Animation|Children|Drama|Musical, Thriller, Action|Ad… + ## $ userId 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … + ## $ rating 2.5, 3.0, 3.0, 2.0, 4.0, 2.0, 2.0, 2.0, 3.5, 2.0, 2.5, 1.0, … + ## $ timestamp 1260759144, 1260759179, 1260759182, 1260759185, 1260759205, … Now that we’ve had a quick look at the dataset, it would be interesting to explore the rows (observations) in some more detail. I’d like to @@ -120,11 +77,23 @@ consider the movie entries that… - have *more than* 4.5 stars, and were filmed *before* 1995. ``` r -### ERROR HERE ### filter(movieLens, genres == "Drama") ``` - ## Error in as.ts(x): object 'movieLens' not found + ## # A tibble: 7,757 × 7 + ## movieId title year genres userId rating timestamp + ## + ## 1 31 Dangerous Minds 1995 Drama 1 2.5 1.26e9 + ## 2 1172 Cinema Paradiso (Nuovo cinema P… 1989 Drama 1 4 1.26e9 + ## 3 1293 Gandhi 1982 Drama 1 2 1.26e9 + ## 4 62 Mr. Holland's Opus 1995 Drama 2 3 8.35e8 + ## 5 261 Little Women 1994 Drama 2 4 8.35e8 + ## 6 300 Quiz Show 1994 Drama 2 3 8.35e8 + ## 7 508 Philadelphia 1993 Drama 2 4 8.35e8 + ## 8 537 Sirens 1994 Drama 2 4 8.35e8 + ## 9 2702 Summer of Sam 1999 Drama 3 3.5 1.30e9 + ## 10 3949 Requiem for a Dream 2000 Drama 3 5 1.30e9 + ## # ℹ 7,747 more rows ``` r # Changed this @@ -133,7 +102,20 @@ filter(movieLens, genres == "Drama") filter(movieLens, genres != "Drama") ``` - ## Error in as.ts(x): object 'movieLens' not found + ## # A tibble: 92,247 × 7 + ## movieId title year genres userId rating timestamp + ## + ## 1 1029 Dumbo 1941 Animat… 1 3 1.26e9 + ## 2 1061 Sleepers 1996 Thrill… 1 3 1.26e9 + ## 3 1129 Escape from New York 1981 Action… 1 2 1.26e9 + ## 4 1263 Deer Hunter, The 1978 Drama|… 1 2 1.26e9 + ## 5 1287 Ben-Hur 1959 Action… 1 2 1.26e9 + ## 6 1339 Dracula (Bram Stoker's Dracula) 1992 Fantas… 1 3.5 1.26e9 + ## 7 1343 Cape Fear 1991 Thrill… 1 2 1.26e9 + ## 8 1371 Star Trek: The Motion Picture 1979 Advent… 1 2.5 1.26e9 + ## 9 1405 Beavis and Butt-Head Do America 1996 Advent… 1 1 1.26e9 + ## 10 1953 French Connection, The 1971 Action… 1 4 1.26e9 + ## # ℹ 92,237 more rows ``` r # Changed this @@ -142,7 +124,20 @@ filter(movieLens, genres != "Drama") filter(movieLens, year > 2000) ``` - ## Error in as.ts(x): object 'movieLens' not found + ## # A tibble: 25,481 × 7 + ## movieId title year genres userId rating timestamp + ## + ## 1 5349 Spider-Man 2002 Actio… 3 3 1.30e9 + ## 2 5669 Bowling for Columbine 2002 Docum… 3 3.5 1.30e9 + ## 3 6377 Finding Nemo 2003 Adven… 3 3 1.30e9 + ## 4 7153 Lord of the Rings: The Return o… 2003 Actio… 3 2.5 1.30e9 + ## 5 7361 Eternal Sunshine of the Spotles… 2004 Drama… 3 3 1.30e9 + ## 6 8622 Fahrenheit 9/11 2004 Docum… 3 3.5 1.30e9 + ## 7 8636 Spider-Man 2 2004 Actio… 3 3 1.30e9 + ## 8 44191 V for Vendetta 2006 Actio… 3 3.5 1.30e9 + ## 9 48783 Flags of Our Fathers 2006 Drama… 3 4.5 1.30e9 + ## 10 50068 Letters from Iwo Jima 2006 Drama… 3 4.5 1.30e9 + ## # ℹ 25,471 more rows ``` r # Changed this @@ -151,13 +146,39 @@ filter(movieLens, year > 2000) filter(movieLens, year == 1999 | year == 2000) ``` - ## Error in as.ts(x): object 'movieLens' not found + ## # A tibble: 9,088 × 7 + ## movieId title year genres userId rating timestamp + ## + ## 1 2694 Big Daddy 1999 Comedy 3 3 1.30e9 + ## 2 2702 Summer of Sam 1999 Drama 3 3.5 1.30e9 + ## 3 2762 Sixth Sense, The 1999 Drama… 3 3.5 1.30e9 + ## 4 2841 Stir of Echoes 1999 Horro… 3 4 1.30e9 + ## 5 2858 American Beauty 1999 Drama… 3 4 1.30e9 + ## 6 2959 Fight Club 1999 Actio… 3 5 1.30e9 + ## 7 3510 Frequency 2000 Drama… 3 4 1.30e9 + ## 8 3949 Requiem for a Dream 2000 Drama 3 5 1.30e9 + ## 9 27369 Daria: Is It Fall Yet? 2000 Anima… 3 3.5 1.30e9 + ## 10 2628 Star Wars: Episode I - The Phan… 1999 Actio… 4 5 9.50e8 + ## # ℹ 9,078 more rows ``` r filter(movieLens, rating > 4.5, year < 1995) ``` - ## Error in match.arg(method): object 'year' not found + ## # A tibble: 8,386 × 7 + ## movieId title year genres userId rating timestamp + ## + ## 1 265 Like Water for Chocolate (Como … 1992 Drama… 2 5 8.35e8 + ## 2 266 Legends of the Fall 1994 Drama… 2 5 8.35e8 + ## 3 551 Nightmare Before Christmas, The 1993 Anima… 2 5 8.35e8 + ## 4 589 Terminator 2: Judgment Day 1991 Actio… 2 5 8.35e8 + ## 5 590 Dances with Wolves 1990 Adven… 2 5 8.35e8 + ## 6 592 Batman 1989 Actio… 2 5 8.35e8 + ## 7 318 Shawshank Redemption, The 1994 Crime… 3 5 1.30e9 + ## 8 356 Forrest Gump 1994 Comed… 3 5 1.30e9 + ## 9 1197 Princess Bride, The 1987 Actio… 3 5 1.30e9 + ## 10 260 Star Wars: Episode IV - A New H… 1977 Actio… 4 5 9.50e8 + ## # ℹ 8,376 more rows While filtering for *all movies that do not belong to the genre drama* above, I noticed something interesting. I want to filter for the same @@ -168,7 +189,6 @@ less columns some time in the future, the code will still work. Hint: there is a function to select “everything else”… ``` r -### ERROR HERE ### movieLens %>% # Changed this #filter(!genres == "Drama") %>% @@ -180,7 +200,20 @@ movieLens %>% select(title, genres, everything()) ``` - ## Error in movieLens %>% filter(genres != "Drama") %>% select(title, genres, : could not find function "%>%" + ## # A tibble: 92,247 × 7 + ## title genres movieId year userId rating timestamp + ## + ## 1 Dumbo Animat… 1029 1941 1 3 1.26e9 + ## 2 Sleepers Thrill… 1061 1996 1 3 1.26e9 + ## 3 Escape from New York Action… 1129 1981 1 2 1.26e9 + ## 4 Deer Hunter, The Drama|… 1263 1978 1 2 1.26e9 + ## 5 Ben-Hur Action… 1287 1959 1 2 1.26e9 + ## 6 Dracula (Bram Stoker's Dracula) Fantas… 1339 1992 1 3.5 1.26e9 + ## 7 Cape Fear Thrill… 1343 1991 1 2 1.26e9 + ## 8 Star Trek: The Motion Picture Advent… 1371 1979 1 2.5 1.26e9 + ## 9 Beavis and Butt-Head Do America Advent… 1405 1996 1 1 1.26e9 + ## 10 French Connection, The Action… 1953 1971 1 4 1.26e9 + ## # ℹ 92,237 more rows ## Exercise 2: Calculating with `mutate()`-like functions @@ -190,7 +223,6 @@ use *snake_case* instead, and assign our post-rename object back to “movieLens”. ``` r -### ERROR HERE ### movieLens <- movieLens %>% # Changed this #rename(user_id == userId, @@ -198,15 +230,24 @@ movieLens <- movieLens %>% # To this rename(user_id = userId, movie_id = movieId) -``` - - ## Error in movieLens %>% rename(user_id = userId, movie_id = movieId): could not find function "%>%" -``` r head(movielens) ``` - ## Error in head(movielens): object 'movielens' not found + ## movieId title year + ## 1 31 Dangerous Minds 1995 + ## 2 1029 Dumbo 1941 + ## 3 1061 Sleepers 1996 + ## 4 1129 Escape from New York 1981 + ## 5 1172 Cinema Paradiso (Nuovo cinema Paradiso) 1989 + ## 6 1263 Deer Hunter, The 1978 + ## genres userId rating timestamp + ## 1 Drama 1 2.5 1260759144 + ## 2 Animation|Children|Drama|Musical 1 3.0 1260759179 + ## 3 Thriller 1 3.0 1260759182 + ## 4 Action|Adventure|Sci-Fi|Thriller 1 2.0 1260759185 + ## 5 Drama 1 4.0 1260759205 + ## 6 Drama|War 1 2.0 1260759151 As you already know, `mutate()` defines and inserts new variables into a tibble. There is *another mystery function similar to `mutate()`* that @@ -216,7 +257,6 @@ entries, and I only want to see that variable (i.e drop all others!) but I forgot what that mystery function is. Can you remember? ``` r -### ERROR HERE ### # Most likely, the prompt of the question refers to transmute, which "creates a new data frame containing only the specified computations" # (see: https://dplyr.tidyverse.org/reference/transmute.html) # Changed this @@ -227,7 +267,20 @@ transmute(movieLens, average_rating = mean(rating)) ``` - ## Error in transmute(movieLens, average_rating = mean(rating)): could not find function "transmute" + ## # A tibble: 100,004 × 1 + ## average_rating + ## + ## 1 3.54 + ## 2 3.54 + ## 3 3.54 + ## 4 3.54 + ## 5 3.54 + ## 6 3.54 + ## 7 3.54 + ## 8 3.54 + ## 9 3.54 + ## 10 3.54 + ## # ℹ 99,994 more rows ## Exercise 3: Calculating with `summarise()`-like functions @@ -246,13 +299,25 @@ movieLens %>% tally() ``` - ## Error in movieLens %>% group_by(title) %>% tally(): could not find function "%>%" + ## # A tibble: 8,832 × 2 + ## title n + ## + ## 1 "\"Great Performances\" Cats" 2 + ## 2 "$9.99" 3 + ## 3 "'Hellboy': The Seeds of Creation" 1 + ## 4 "'Neath the Arizona Skies" 1 + ## 5 "'Round Midnight" 2 + ## 6 "'Salem's Lot" 1 + ## 7 "'Til There Was You" 4 + ## 8 "'burbs, The" 19 + ## 9 "'night Mother" 3 + ## 10 "(500) Days of Summer" 45 + ## # ℹ 8,822 more rows Without using `group_by()`, I want to find out how many movie reviews there have been for each year. ``` r -### ERROR HERE ### #movieLens %>% # tally(year) #Changed to @@ -260,14 +325,26 @@ movieLens %>% # Tally is used for grouped data, count is a short-hand for grou count(year) ``` - ## Error in movieLens %>% count(year): could not find function "%>%" + ## # A tibble: 104 × 2 + ## year n + ## + ## 1 1902 6 + ## 2 1915 2 + ## 3 1916 1 + ## 4 1917 2 + ## 5 1918 2 + ## 6 1919 1 + ## 7 1920 15 + ## 8 1921 12 + ## 9 1922 28 + ## 10 1923 3 + ## # ℹ 94 more rows Both `count()` and `tally()` can be grouped by multiple columns. Below, I want to count the number of movie reviews by title and rating, and sort the results. ``` r -### ERROR HERE ### #movieLens %>% # count(c(title, rating), sort = TRUE) # changed to: @@ -275,7 +352,20 @@ movieLens %>% count(title, rating, sort = TRUE) # c() function call should not be passed into count() ``` - ## Error in movieLens %>% count(title, rating, sort = TRUE): could not find function "%>%" + ## # A tibble: 28,297 × 3 + ## title rating n + ## + ## 1 Shawshank Redemption, The 5 170 + ## 2 Pulp Fiction 5 138 + ## 3 Star Wars: Episode IV - A New Hope 5 122 + ## 4 Forrest Gump 4 113 + ## 5 Schindler's List 5 109 + ## 6 Godfather, The 5 107 + ## 7 Forrest Gump 5 102 + ## 8 Silence of the Lambs, The 4 102 + ## 9 Fargo 5 100 + ## 10 Silence of the Lambs, The 5 100 + ## # ℹ 28,287 more rows Not only do `count()` and `tally()` quickly allow you to count items within your dataset, `add_tally()` and `add_count()` are handy shortcuts @@ -293,14 +383,26 @@ movieLens %>% summarize(avg_rating = mean(rating)) ``` - ## Error in movieLens %>% group_by(year) %>% summarize(avg_rating = mean(rating)): could not find function "%>%" + ## # A tibble: 104 × 2 + ## year avg_rating + ## + ## 1 1902 4.33 + ## 2 1915 3 + ## 3 1916 3.5 + ## 4 1917 4.25 + ## 5 1918 4.25 + ## 6 1919 3 + ## 7 1920 3.7 + ## 8 1921 4.42 + ## 9 1922 3.80 + ## 10 1923 4.17 + ## # ℹ 94 more rows Using `summarize()`, we can find the minimum and the maximum rating by title, stored under columns named `min_rating`, and `max_rating`, respectively. ``` r -### ERROR HERE ### #movieLens %>% # mutate(min_rating = min(rating), # max_rating = max(rating)) @@ -310,7 +412,20 @@ movieLens %>% summarize(min_rating = min(rating, na.rm = TRUE), max_rating = max(rating, na.rm = TRUE)) ``` - ## Error in movieLens %>% group_by(title) %>% summarize(min_rating = min(rating, : could not find function "%>%" + ## # A tibble: 8,832 × 3 + ## title min_rating max_rating + ## + ## 1 "\"Great Performances\" Cats" 0.5 3 + ## 2 "$9.99" 2.5 4.5 + ## 3 "'Hellboy': The Seeds of Creation" 2 2 + ## 4 "'Neath the Arizona Skies" 0.5 0.5 + ## 5 "'Round Midnight" 0.5 4 + ## 6 "'Salem's Lot" 3.5 3.5 + ## 7 "'Til There Was You" 0.5 4 + ## 8 "'burbs, The" 1.5 4.5 + ## 9 "'night Mother" 5 5 + ## 10 "(500) Days of Summer" 0.5 5 + ## # ℹ 8,822 more rows ## Exercise 5: Scoped variants with `across()` @@ -324,8 +439,6 @@ into a tibble and store it under the variable `starWars`. starWars <- as_tibble(starwars) ``` - ## Error in as_tibble(starwars): could not find function "as_tibble" - We can find the mean for all columns that are numeric, ignoring the missing values: @@ -334,7 +447,10 @@ starWars %>% summarise(across(where(is.numeric), function(x) mean(x, na.rm=TRUE))) ``` - ## Error in starWars %>% summarise(across(where(is.numeric), function(x) mean(x, : could not find function "%>%" + ## # A tibble: 1 × 3 + ## height mass birth_year + ## + ## 1 174. 97.3 87.6 We can find the minimum height and mass within each species, ignoring the missing values: @@ -346,7 +462,28 @@ starWars %>% summarise(across(c("height", "mass"), function(x) min(x, na.rm=TRUE))) ``` - ## Error in starWars %>% group_by(species) %>% summarise(across(c("height", : could not find function "%>%" + ## Warning: There were 6 warnings in `summarise()`. + ## The first warning was: + ## ℹ In argument: `across(c("height", "mass"), function(x) min(x, na.rm = TRUE))`. + ## ℹ In group 4: `species = "Chagrian"`. + ## Caused by warning in `min()`: + ## ! no non-missing arguments to min; returning Inf + ## ℹ Run `dplyr::last_dplyr_warnings()` to see the 5 remaining warnings. + + ## # A tibble: 38 × 3 + ## species height mass + ## + ## 1 Aleena 79 15 + ## 2 Besalisk 198 102 + ## 3 Cerean 198 82 + ## 4 Chagrian 196 Inf + ## 5 Clawdite 168 55 + ## 6 Droid 96 32 + ## 7 Dug 112 40 + ## 8 Ewok 88 20 + ## 9 Geonosian 183 80 + ## 10 Gungan 196 66 + ## # ℹ 28 more rows Note that here R has taken the convention that the minimum value of a set of `NA`s is `Inf`. @@ -364,7 +501,6 @@ Manually create a tibble with 4 columns: birth_location value. ``` r -### ERROR HERE ### fakeStarWars <- tribble( ~name, ~birth_weight, ~birth_year, ~birth_location, "Luke Skywalker", 1.35 , 1998 , "Liverpool, England", @@ -376,15 +512,20 @@ fakeStarWars <- tribble( "Beru Whitesun Iars", 4.05 , 2004 , "Liverpool, England", "R5-D4" , 4.50 , 2005 , "New York, NY" ) -``` - - ## Error in tribble(~name, ~birth_weight, ~birth_year, ~birth_location, "Luke Skywalker", : could not find function "tribble" - -``` r fakeStarWars ``` - ## Error in eval(expr, envir, enclos): object 'fakeStarWars' not found + ## # A tibble: 8 × 4 + ## name birth_weight birth_year birth_location + ## + ## 1 Luke Skywalker 1.35 1998 Liverpool, England + ## 2 C-3PO 1.8 1999 Liverpool, England + ## 3 R2-D2 2.25 2000 Seattle, WA + ## 4 Darth Vader 2.7 2001 Liverpool, England + ## 5 Leia Organa 3.15 2002 New York, NY + ## 6 Owen Lars 3.6 2003 Seattle, WA + ## 7 Beru Whitesun Iars 4.05 2004 Liverpool, England + ## 8 R5-D4 4.5 2005 New York, NY ## Attributions