Merge pull request #25 from stat545ubc-2023/mile2ex6

Milestone 2 Exercise 6 (Graded)
stat545ubc-2023 · Sep 28, 2023 · 0e1ea27 · 0e1ea27
2 parents 3a79227 + 284d2e0
commit 0e1ea27
Show file tree

Hide file tree

Showing 2 changed files with 564 additions and 22 deletions.
diff --git a/troubleshooting-2.Rmd b/troubleshooting-2.Rmd
@@ -15,8 +15,7 @@ Note that errors are not all syntactic (i.e., broken code)! Some are logical err
 
 [MovieLens](https://dl.acm.org/doi/10.1145/2827872) are a series of datasets widely used in education, that describe movie ratings from the MovieLens [website](https://movielens.org/). There are several MovieLens datasets, collected by the [GroupLens Research Project](https://grouplens.org/datasets/movielens/) at the University of Minnesota. Here, we load the MovieLens 100K dataset from Rafael Irizarry and Amy Gill's R package, [dslabs](https://cran.r-project.org/web/packages/dslabs/dslabs.pdf), which contains datasets useful for data analysis practice, homework, and projects in data science courses and workshops. We'll also load other required packages.
 
-```{r}
-### ERROR HERE ###
+```{r eval=FALSE, include=FALSE}
 # Changed previous install.packages to have ""
 install.packages("dslabs")
 install.packages("tidyverse")
@@ -25,6 +24,12 @@ install.packages("devtools") # Do not run this if you already have this package
 devtools::install_github("JoeyBernhardt/singer")
 install.packages("gapminder")
 ```
+```{r message=FALSE, warning=FALSE}
+library("dslabs")
+library("tidyverse")
+library("stringr")
+library("gapminder")
+```
 
 Let's have a look at the dataset! My goal is to:
 
@@ -33,7 +38,7 @@ Let's have a look at the dataset! My goal is to:
 -   Have a quick look at the tibble, using a *dplyr function*.
 
 ```{r}
-### ERROR HERE ###
+
 class(dslabs::movielens)
 movieLens <- as_tibble(dslabs::movielens)
 dim(movieLens)
@@ -51,9 +56,8 @@ Now that we've had a quick look at the dataset, it would be interesting to explo
 -   have *more than* 4.5 stars, and were filmed *before* 1995.
 
 ```{r}
-### ERROR HERE ###
-filter(movieLens, genres == "Drama")
 
+filter(movieLens, genres == "Drama")
 # Changed this 
 #filter(movieLens, !genres == "Drama")
 # To this:
@@ -69,14 +73,13 @@ filter(movieLens, year > 2000)
 # To this:
 filter(movieLens, year == 1999 | year == 2000)
 
-
 filter(movieLens, rating > 4.5, year < 1995)
 ```
 
 While filtering for *all movies that do not belong to the genre drama* above, I noticed something interesting. I want to filter for the same thing again, this time selecting variables **title and genres first,** and then *everything else*. But I want to do this in a robust way, so that (for example) if I end up changing `movieLens` to contain more or less columns some time in the future, the code will still work. Hint: there is a function to select "everything else"...
 
 ```{r}
-### ERROR HERE ###
+
 movieLens %>%
   # Changed this
   #filter(!genres == "Drama") %>%
@@ -93,7 +96,7 @@ movieLens %>%
 Some of the variables in the `movieLens` dataset are in *camelCase* (in fact, *movieLens* is in camelCase). Let's clean these two variables to use *snake_case* instead, and assign our post-rename object back to "movieLens".
 
 ```{r}
-### ERROR HERE ###
+
 movieLens <- movieLens %>%
   # Changed this
   #rename(user_id == userId,
@@ -103,12 +106,13 @@ movieLens <- movieLens %>%
          movie_id = movieId)
 
 head(movielens)
+
 ```
 
 As you already know, `mutate()` defines and inserts new variables into a tibble. There is *another mystery function similar to `mutate()`* that adds the new variable, but also drops existing ones. I wanted to create an `average_rating` column that takes the `mean(rating)` across all entries, and I only want to see that variable (i.e drop all others!) but I forgot what that mystery function is. Can you remember?
 
 ```{r}
-### ERROR HERE ### 
+ 
 # Most likely, the prompt of the question refers to transmute, which "creates a new data frame containing only the specified computations"
 # (see: https://dplyr.tidyverse.org/reference/transmute.html)
 # Changed this
@@ -134,7 +138,7 @@ movieLens %>%
 Without using `group_by()`, I want to find out how many movie reviews there have been for each year.
 
 ```{r}
-### ERROR HERE ###
+
 #movieLens %>%
 #  tally(year)
 #Changed to 
@@ -145,7 +149,7 @@ movieLens %>%   # Tally is used for grouped data, count is a short-hand for grou
 Both `count()` and `tally()` can be grouped by multiple columns. Below, I want to count the number of movie reviews by title and rating, and sort the results.
 
 ```{r}
-### ERROR HERE ###
+
 #movieLens %>%
 #  count(c(title, rating), sort = TRUE)
 # changed to: 
@@ -168,14 +172,15 @@ movieLens %>%
 Using `summarize()`, we can find the minimum and the maximum rating by title, stored under columns named `min_rating`, and `max_rating`, respectively.
 
 ```{r}
-### ERROR HERE ###
+
 #movieLens %>%
 #  mutate(min_rating = min(rating), 
 #         max_rating = max(rating))
 # Changed to: (the pipeline did not group data by title)
 movieLens %>%
   group_by(title) %>%
   summarize(min_rating = min(rating, na.rm = TRUE), max_rating = max(rating, na.rm = TRUE)) 
+
 ```
 
 ## Exercise 5: Scoped variants with `across()`
@@ -212,22 +217,25 @@ Manually create a tibble with 4 columns:
 -   `birth_year` should contain years 1998 to 2005 (inclusive);
 -   `birth_weight` should take the `birth_year` column, subtract 1995, and multiply by 0.45;
 -   `birth_location` should contain three locations (Liverpool, Seattle, and New York).
+-   Modification: add *,* after `birth_location`, add *""* for birth_location value.
 
 ```{r}
-### ERROR HERE ###
+
 fakeStarWars <- tribble(
-  ~name,            ~birth_weight,  ~birth_year, ~birth_location
-  "Luke Skywalker",  1.35      ,   1998        ,  Liverpool, England,
-  "C-3PO"         ,  1.80      ,   1999        ,  Liverpool, England,
-  "R2-D2"         ,  2.25      ,   2000        ,  Seattle, WA,
-  "Darth Vader"   ,  2.70      ,   2001        ,  Liverpool, England,
-  "Leia Organa"   ,  3.15      ,   2002        ,  New York, NY,
-  "Owen Lars"     ,  3.60      ,   2003        ,  Seattle, WA,
-  "Beru Whitesun Iars", 4.05   ,   2004        ,  Liverpool, England,
-  "R5-D4"         ,  4.50      ,   2005        ,  New York, NY,
+  ~name,            ~birth_weight,  ~birth_year, ~birth_location,
+  "Luke Skywalker",  1.35      ,   1998        ,  "Liverpool, England",
+  "C-3PO"         ,  1.80      ,   1999        ,  "Liverpool, England",
+  "R2-D2"         ,  2.25      ,   2000        ,  "Seattle, WA",
+  "Darth Vader"   ,  2.70      ,   2001        ,  "Liverpool, England",
+  "Leia Organa"   ,  3.15      ,   2002        ,  "New York, NY",
+  "Owen Lars"     ,  3.60      ,   2003        ,  "Seattle, WA",
+  "Beru Whitesun Iars", 4.05   ,   2004        ,  "Liverpool, England",
+  "R5-D4"         ,  4.50      ,   2005        ,  "New York, NY"
 )
+fakeStarWars
 ```
 
 ## Attributions
 
 Thanks to Icíar Fernández-Boyano for writing most of this document, and Albina Gibadullina, Diana Lin, Yulia Egorova, and Vincenzo Coia for their edits.
+