core-methods-in-edm · vivianwang0213 · Oct 6, 2020
diff --git a/Assignment 2-2020.Rmd b/Assignment 2-2020.Rmd
@@ -1,7 +1,7 @@
 ---
 title: "Assignment 2"
-author: "Charles Lang"
-date: "September 24, 2020"
+author: "Yurui Wang"
+date: "October 5, 2021"
 output: html_document
 ---
 #Part I
@@ -16,14 +16,16 @@ watch.time = how long the student watched the video for
 confusion.points = how many times a student rewatched a section of a video
 key,points = how many times a student skipped or increased the speed of a video
 
-```{r}
+```{r, message=FALSE}
 #Install the 'tidyverse' package or if that does not work, install the 'dplyr' and 'tidyr' packages.
 
 #Load the package(s) you just installed
 
 library(tidyverse)
 library(tidyr)
 library(dplyr)
+library(car)
+library(janitor)
 
 D1 <- read.csv("video-data.csv", header = TRUE)
 
@@ -91,26 +93,40 @@ pairs(D5)
 1. Create a simulated data set containing 100 students, each with a score from 1-100 representing performance in an educational game. The scores should tend to cluster around 75. Also, each student should be given a classification that reflects one of four interest groups: sport, music, nature, literature.
 
 ```{r}
-#rnorm(100, 75, 15) creates a random sample with a mean of 75 and standard deviation of 20
-#pmax sets a maximum value, pmin sets a minimum value
-#round rounds numbers to whole number values
-#sample draws a random samples from the groups vector according to a uniform distribution
-
-
+#score <- rnorm(100, 75, 15)
+#S1 <- data.frame(score)
+#S1 <- filter(S1, score <= 100)
+#S2<- data.frame(rep(100,100-NROW(S1))
+#names(S2) <- "score"
+#S3<-bind_rows(S1,S2)
+#interest<-c("sport", "music", "nature", "literature")
+#S3$interest<-sample(interest, 100, replace=TRUE)
+#S3$stid <- seq(1,100,1)
+score <- rnorm(100, 75, 15)
+S1 <- data.frame(score)
+S1 <- filter(S1, score <= 100)
+S2 <- data.frame(rep(100, 100-NROW(S1)))
+names(S2) <- "score"
+S3 <- bind_rows(S1,S2)
+interest <- c("sport", "music", "nature", "literature")
+S3$interest <- sample(interest, 100, replace = TRUE)
+S3$stid <- seq(1,100,1)
 ```
 
 2. Using base R commands, draw a histogram of the scores. Change the breaks in your histogram until you think they best represent your data.
 
 ```{r}
-
-```
+hist(S3$score, breaks = 10)
+``` 
 
 
 3. Create a new variable that groups the scores according to the breaks in your histogram.
 
 ```{r}
 #cut() divides the range of scores into intervals and codes the values in scores according to which interval they fall. We use a vector called `letters` as the labels, `letters` is a vector made up of the letters of the alphabet.
 
+label<-letters[1:10]
+S3$breaks<-cut(S3$score, breaks=10, labels=label)
 ```
 
 4. Now using the colorbrewer package (RColorBrewer; http://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3) design a pallette and assign it to the groups in your data on the histogram.
@@ -121,45 +137,50 @@ library(RColorBrewer)
 
 #The top section of palettes are sequential, the middle section are qualitative, and the lower section are diverging.
 #Make RColorBrewer palette available to R and assign to your bins
-
+S3$colors<-brewer.pal(10, "Set3")
 #Use named palette in histogram
-
+hist(S3$score, col=S3$colors)
 ```
 
 
 5. Create a boxplot that visualizes the scores for each interest group and color each interest group a different color.
 
 ```{r}
 #Make a vector of the colors from RColorBrewer
-
+interest.col<-brewer.pal(4, "Dark2")
+boxplot(score ~ interest, S3, col=interest.col)
 ```
 
 
 6. Now simulate a new variable that describes the number of logins that students made to the educational game. They should vary from 1-25.
 
 ```{r}
-
+S3$login<-sample(1:25, 100, replace=TRUE)
 ```
 
 7. Plot the relationships between logins and scores. Give the plot a title and color the dots according to interest group.
 
 ```{r}
+login_score <- ggplot(S3, aes(x=login, y=score, color=interest)) +
+  geom_point()
 
-
+print(login_score + labs(title = "Student Logins vs. Scores"))
 ```
 
-
 8. R contains several inbuilt data sets, one of these in called AirPassengers. Plot a line graph of the the airline passengers over time using this data set.
 
 ```{r}
-
+AP<-data.frame(AirPassengers)
+plot(AP)
 ```
 
 
 9. Using another inbuilt data set, iris, plot the relationships between all of the variables in the data set. Which of these relationships is it appropraiet to run a correlation on? 
 
 ```{r}
-
+IR<-data.frame(iris)
+plot(iris)
+#It is appropriate to run a correlation on the relationship between Sepal.Length and Sepal.Width, Sepal.Length and Petal.Length, Sepal. Length and Petal.Width, Sepal.Width and Petal.Length, Sepal Width and Petal.Width, and Petal.Length and Petal.Width.
 ```
 
 # Part III - Analyzing Swirl
@@ -183,18 +204,44 @@ The variables are:
 `skipped` - whether the student skipped the question  
 `datetime` - the date and time the student attempted the question  
 `hash` - anonymyzed student ID  
+```{r}
+DF1 <- read.csv("swirl-data.csv", header = TRUE)
+```
 
 3. Create a new data frame that only includes the variables `hash`, `lesson_name` and `attempt` called `DF2`
+```{r}
+DF2 <- select(DF1, hash, lesson_name, attempt)
+```
 
 4. Use the `group_by` function to create a data frame that sums all the attempts for each `hash` by each `lesson_name` called `DF3`
+```{r}
+DF3 <- DF2 %>% group_by(hash, lesson_name) %>% summarise(sum_attempt = sum(attempt))
+```
 
 5. On a scrap piece of paper draw what you think `DF3` would look like if all the lesson names were column names
+![All text](/Users/wyr/Desktop/HUDK 4050 Core methods in educational data mining/assignment2/assignment 2.jpeg)
 
 6. Convert `DF3` to this format  
+```{r}
+DF3 <- DF3[-c(14, 43, 53, 54, 91, 118, 128, 139, 166, 207, 226), ]
+DF3 %>%
+  pivot_wider(names_from = lesson_name, values_from = sum_attempt)
+```
+
+
+
 
 7. Create a new data frame from `DF1` called `DF4` that only includes the variables `hash`, `lesson_name` and `correct`
+```{r}
+DF4 <- select(DF1, hash, lesson_name, correct)
+```
+
 
 8. Convert the `correct` variable so that `TRUE` is coded as the **number** `1` and `FALSE` is coded as `0`  
+```{r}
+DF4$correct [DF4$correct == "TRUE"] <- 1
+DF4$correct [DF4$correct == "FALSE"] <- 0
+```
 
 9. Create a new data frame called `DF5` that provides a mean score for each student on each course