Updated wrangling from master

codeasmanuscript · Nov 9, 2015 · 48d501f · 48d501f
1 parent 703606f
commit 48d501f
Show file tree

Hide file tree

Showing 5 changed files with 94 additions and 55 deletions.
diff --git a/lessons/r-wrangling/assignment.md b/lessons/r-wrangling/assignment.md
@@ -23,34 +23,46 @@ output:
 ## Challenges: Try these out for yourself! ##
 
 Try each of these challenges using only one continuous chain of `%>%` pipes,
-from raw data to final output.
+from raw data to final output. Create a file in the `practice` repo under
+`your-name/wrangling` called `challenge.R`. The file location should look like
+`your-name/wrangling/challenge.R`. To get more practice with Git, **add and
+commit** after completing each challenge.
 
-1. Make a new dataframe with the means of Agriculture, Examination, Education, and 
-Infant.Mortality for each category of Fertility (hint: convert it into a factor
-by values of >50 vs <50), when Catholic is less than 60 (hint, use `dplyr` commands
-and `gather`).  Have the Fertility groups as two columns.
+1. Make a new dataframe with the means of Agriculture, Examination, Education,
+and Infant.Mortality for each category of Fertility (hint: convert it into a
+factor by values of >50 vs <50) and when Catholic is less than 60 (hint, use
+`dplyr` commands and `gather`).  Have the Fertility groups as two columns in the
+new dataframe ('Fertile', 'Infertile').
 
-2. Do the same thing as above, but instead make a new dataframe with one column
-with that contains the mean and standard deviation in this format: '00.00 (00.00
-SD)'. Notice that there are two digits after the period.
+2. Do the same thing as above, but instead make a new dataframe with one column 
+that contains the mean and standard deviation in this format: '00.00 (00.00 
+SD)'. Notice that there are two digits after the period. Hint: You'll need to
+use the `paste0()` function to combine means and SD.
 
-3. Create a new dataframe with the first column containing the variable names,
-the second column containing which county has the lowest value of the variable,
-and a third column containing the county with the highest value of the variable.
-For example, this is how it should approximately look like:
+3. Create a new dataframe that has only counties that have either the lowest or
+the highest value within *each* variable (ie. there should be *at least* 12
+counties). The final dataframe should have the columns County, Variable, and
+Value, with at least 12 rows (there may be 1 or 2 more).
 
+4. Find the mininum, mean, and max of each of the variables using `dplyr`
+commands and pipes. The final dataframe should have 4 columns: 1) variable
+names, 2) min, 3) mean, and 4) max.
 
-|Variable    |Lowest   |Highest    |
-|:-----------|:--------|:----------|
-|Fertility   |Moutier  |Courtelary |
-|Agriculture |Delemont |Porrentruy |
+5. List all counties with measures that would typically be associated with 
+health: high education (greater than or equal to 8), low infant mortality (less
+than 18), and mid fertility (between 50 to 60). Keep only the county names in
+the final dataframe.
 
-### Creating plots (based on the last workshops material)
+6. (Advanced) Get a dataframe of correlation coefficients of all numeric
+variables by a new variable called `Educated`, which is split at greater than 8
+for Education and will be a factor. Hint: You will need to use the `do()` and
+`broom::tidy()` commands as seen at the bottom of the [Introduction page](../intro/). The final dataframe should have four columns: Educated ('yes'
+and 'no'), Var1, Var2, and Value (correlation coefficient). The dataframe should
+*not* contain any comparisons of the same variable (eg. no Fertility with
+Fertility, which is the same as having a correlation coefficient of 1)
 
-1. Create a point plot of the means of each variable (not the county).  Have the
-variable on the y-axis and the means on the x-axis.  As a bonus/option, make the
-graph prettier.
-
-2. Expand on challenge 4, but split the means up by fertility (like in challenge
-1).  The graph should have two dots for each variable representing the means for
-each group of fertility.
+7. (Advanced) For those looking for a bigger challenge, try to run a linear
+regression on multiple independent variables and multiple dependent variables
+using what I described in [my blog post](http://www.lukewjohnston.com/blog/loops-forests-multiple-linear-regressions/).
+Include additional covariates. It's also briefly shown in the [Introduction to Data Wrangling page](../intro/). Work through and describe what is going on at
+each step.
diff --git a/lessons/r-wrangling/cheatsheet.md b/lessons/r-wrangling/cheatsheet.md
@@ -243,7 +243,7 @@ swiss %>%
     group_by(EarlyDeath)
 ```
 
-## `summarise` ##
+## `summarise`, `mean`, `sd`, `median`, `quantile` ##
 
 > Create a new column of values, usually using a descriptive statistic function
 such as `mean()` or `median()`, as well as informational functions like `n()`
@@ -257,18 +257,26 @@ library(dplyr)
 swiss %>%
     mutate(Educated = ifelse(Education >= 50, 'yes', 'no')) %>%
     group_by(Educated) %>%
-    str()
-    summarise(mean = mean(Agriculture))
+    summarise(
+        mean = mean(Agriculture),
+        sd = sd(Agriculture),
+        median = median(Agriculture),
+        firstQuartile = quantile(Agriculture, 1 / 4),
+        meanSD = paste0(round(mean(Agriculture), 2), 
+                        ' (', round(sd(Agriculture), 2)
+                        , ')')
+        )
 ```
 
 
-## `gather` ##
+## `gather`, `add_rownames` ##
 
 > Convert a wide dataframe to a long dataframe.  This creates a dataframe with
 at least two new variables, one containing the names of the original variables
 and the other containing the values of the variables.  You can include or
 exclude certain variables by selecting the variable you want to include or
-exclude (with a `-`) after the name of the two new variables.
+exclude (with a `-`) after the name of the two new variables. The `add_rownames`
+function is also useful if the dataframe as a rowname attribute.
 
 > Example code:
 

diff --git a/lessons/r-wrangling/intro.md b/lessons/r-wrangling/intro.md
@@ -36,14 +36,14 @@ etc).
 1. How to import/export your data
 2. How to view the structure of your data
 3. How to wrangle data into an analyzable format
+4. How to use basic statistics to summarize your data
 
 # Materials for this lesson:
 
 - [Slides](slides/)
 - [Cheatsheet](cheatsheet/)
 - [Assignment](assignment/)
-
-Other resources can be found [here](../resources/).
+- [Resources](../resources.)
 
 # Let's get wrangling, the basics
 
@@ -130,21 +130,14 @@ function (via the `magrittr` package), which works similar to how the Bash shell
 `|` pipe works (for those familiar with Bash, ie. those who use Mac or Linux).
 The command on the right-hand side takes the output from the command on the
 left-hand side, just like how a plumbing pipe works for water.  `tbl_df` makes
-the object into a `tbl` class, making printing of the output nicer.
+the object into a `tbl` class, making printing of the output nicer. The other
+nice thing about `dplyr` is that it can connect to SQL and other type of
+databases and is very fast at wrangling data, unlike base R. Check out the
+[resources page](../resources/) for links to more about this.
 
 
 ```r
 library(dplyr)
-## 
-## Attaching package: 'dplyr'
-## 
-## The following objects are masked from 'package:stats':
-## 
-##     filter, lag
-## 
-## The following objects are masked from 'package:base':
-## 
-##     intersect, setdiff, setequal, union
 library(tidyr)
 
 ## Compare
@@ -567,17 +560,19 @@ ds2 %>%
   gather(Measure, Value) %>%
   group_by(Measure) %>%
   summarise(mean = mean(Value),
+            sd = sd(Value),
+            median = median(Value),
             sampleSize = n())
-## Source: local data frame [6 x 3]
+## Source: local data frame [6 x 5]
 ## 
-##            Measure     mean sampleSize
-##             (fctr)    (dbl)      (int)
-## 1        Fertility 70.14255         47
-## 2      Agriculture 50.65957         47
-## 3      Examination 16.48936         47
-## 4        Education 10.97872         47
-## 5         Catholic 41.14383         47
-## 6 Infant.Mortality 19.94255         47
+##            Measure     mean        sd median sampleSize
+##             (fctr)    (dbl)     (dbl)  (dbl)      (int)
+## 1        Fertility 70.14255 12.491697  70.40         47
+## 2      Agriculture 50.65957 22.711218  54.10         47
+## 3      Examination 16.48936  7.977883  16.00         47
+## 4        Education 10.97872  9.615407   8.00         47
+## 5         Catholic 41.14383 41.704850  15.14         47
+## 6 Infant.Mortality 19.94255  2.912697  20.00         47
 ```
 
 ## Other useful and powerful examples
@@ -600,5 +595,26 @@ ds2 %>%
     group_by(Dep, Indep) %>% 
     do(lm(Yvalue ~ Xvalue + Infant.Mortality + Examination, data = .) %>% 
            broom::tidy())
-## Error in tidy.lm(.): could not find function "is"
+## Source: local data frame [16 x 7]
+## Groups: Dep, Indep [4]
+## 
+##          Dep       Indep             term    estimate   std.error
+##       (fctr)      (fctr)            (chr)       (dbl)       (dbl)
+## 1  Education   Fertility      (Intercept) 17.62086823  9.48591843
+## 2  Education   Fertility           Xvalue -0.34172055  0.11177490
+## 3  Education   Fertility Infant.Mortality  0.44332139  0.36837072
+## 4  Education   Fertility      Examination  0.51463767  0.16015339
+## 5  Education Agriculture      (Intercept) 12.75076261  9.86830392
+## 6  Education Agriculture           Xvalue -0.13540736  0.06135889
+## 7  Education Agriculture Infant.Mortality -0.21468852  0.35014860
+## 8  Education Agriculture      Examination  0.56818921  0.17549554
+## 9   Catholic   Fertility      (Intercept) 35.13701855 51.27597790
+## 10  Catholic   Fertility           Xvalue  0.39943317  0.60419739
+## 11  Catholic   Fertility Infant.Mortality  1.00336670  1.99122191
+## 12  Catholic   Fertility      Examination -2.54831847  0.86570652
+## 13  Catholic Agriculture      (Intercept) 48.62020731 51.22779001
+## 14  Catholic Agriculture           Xvalue  0.08447534  0.31852283
+## 15  Catholic Agriculture Infant.Mortality  1.69138478  1.81767192
+## 16  Catholic Agriculture      Examination -2.75852964  0.91102265
+## Variables not shown: statistic (dbl), p.value (dbl)
 ```
diff --git a/lessons/r-wrangling/slides.md b/lessons/r-wrangling/slides.md
@@ -22,11 +22,13 @@ output: slidy_presentation
 - How to import/export your data
 - How to view the structure of your data
 - How to wrangle data into an analyzable format
+- How to use basic statistics to summarize your data
 
-## 4 main concepts: ##
+## 5 main concepts: ##
 
 - **Getting the data**: read.table, write.table
 - **View the data**: str, summary, names, head
-- **Working the data (dplyr):** (tbl\_df), select, filter, mutate,
-  summarise, arrange, rename, group\_by, `%>%` pipe
+- **Working the data (dplyr):** (tbl\_df), select, filter, mutate, summarise,
+arrange, rename, group\_by, `%>%` pipe
 - **(Re)Organize the data (tidyr):** gather, spread
+- **Basic statistics**: mean, sd, median, quantile, var, max, min, range
diff --git a/lessons/resources.md b/lessons/resources.md
@@ -96,6 +96,7 @@ categories:
 * [Shorter intro to `tidyr`](http://blog.rstudio.org/2014/07/22/introducing-tidyr/)
 * [Regular expressions](http://www.regular-expressions.info/)
 * [Regular expression symbol meaning](http://www.endmemo.com/program/R/gsub.php)
+* [`dplyr` with databases in SQL and other formats](https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html)
 
 ## R Markdown ##