From 48d501fdf7ec9c67be1b2b35170a371e7af58343 Mon Sep 17 00:00:00 2001 From: Luke W Johnston Date: Mon, 9 Nov 2015 11:14:26 -0500 Subject: [PATCH] Updated wrangling from master --- lessons/r-wrangling/assignment.md | 60 ++++++++++++++++++------------ lessons/r-wrangling/cheatsheet.md | 18 ++++++--- lessons/r-wrangling/intro.md | 62 +++++++++++++++++++------------ lessons/r-wrangling/slides.md | 8 ++-- lessons/resources.md | 1 + 5 files changed, 94 insertions(+), 55 deletions(-) diff --git a/lessons/r-wrangling/assignment.md b/lessons/r-wrangling/assignment.md index dd9f82e..a8ffa0c 100644 --- a/lessons/r-wrangling/assignment.md +++ b/lessons/r-wrangling/assignment.md @@ -23,34 +23,46 @@ output: ## Challenges: Try these out for yourself! ## Try each of these challenges using only one continuous chain of `%>%` pipes, -from raw data to final output. +from raw data to final output. Create a file in the `practice` repo under +`your-name/wrangling` called `challenge.R`. The file location should look like +`your-name/wrangling/challenge.R`. To get more practice with Git, **add and +commit** after completing each challenge. -1. Make a new dataframe with the means of Agriculture, Examination, Education, and -Infant.Mortality for each category of Fertility (hint: convert it into a factor -by values of >50 vs <50), when Catholic is less than 60 (hint, use `dplyr` commands -and `gather`). Have the Fertility groups as two columns. +1. Make a new dataframe with the means of Agriculture, Examination, Education, +and Infant.Mortality for each category of Fertility (hint: convert it into a +factor by values of >50 vs <50) and when Catholic is less than 60 (hint, use +`dplyr` commands and `gather`). Have the Fertility groups as two columns in the +new dataframe ('Fertile', 'Infertile'). -2. Do the same thing as above, but instead make a new dataframe with one column -with that contains the mean and standard deviation in this format: '00.00 (00.00 -SD)'. Notice that there are two digits after the period. +2. Do the same thing as above, but instead make a new dataframe with one column +that contains the mean and standard deviation in this format: '00.00 (00.00 +SD)'. Notice that there are two digits after the period. Hint: You'll need to +use the `paste0()` function to combine means and SD. -3. Create a new dataframe with the first column containing the variable names, -the second column containing which county has the lowest value of the variable, -and a third column containing the county with the highest value of the variable. -For example, this is how it should approximately look like: +3. Create a new dataframe that has only counties that have either the lowest or +the highest value within *each* variable (ie. there should be *at least* 12 +counties). The final dataframe should have the columns County, Variable, and +Value, with at least 12 rows (there may be 1 or 2 more). +4. Find the mininum, mean, and max of each of the variables using `dplyr` +commands and pipes. The final dataframe should have 4 columns: 1) variable +names, 2) min, 3) mean, and 4) max. -|Variable |Lowest |Highest | -|:-----------|:--------|:----------| -|Fertility |Moutier |Courtelary | -|Agriculture |Delemont |Porrentruy | +5. List all counties with measures that would typically be associated with +health: high education (greater than or equal to 8), low infant mortality (less +than 18), and mid fertility (between 50 to 60). Keep only the county names in +the final dataframe. -### Creating plots (based on the last workshops material) +6. (Advanced) Get a dataframe of correlation coefficients of all numeric +variables by a new variable called `Educated`, which is split at greater than 8 +for Education and will be a factor. Hint: You will need to use the `do()` and +`broom::tidy()` commands as seen at the bottom of the [Introduction page](../intro/). The final dataframe should have four columns: Educated ('yes' +and 'no'), Var1, Var2, and Value (correlation coefficient). The dataframe should +*not* contain any comparisons of the same variable (eg. no Fertility with +Fertility, which is the same as having a correlation coefficient of 1) -1. Create a point plot of the means of each variable (not the county). Have the -variable on the y-axis and the means on the x-axis. As a bonus/option, make the -graph prettier. - -2. Expand on challenge 4, but split the means up by fertility (like in challenge -1). The graph should have two dots for each variable representing the means for -each group of fertility. +7. (Advanced) For those looking for a bigger challenge, try to run a linear +regression on multiple independent variables and multiple dependent variables +using what I described in [my blog post](http://www.lukewjohnston.com/blog/loops-forests-multiple-linear-regressions/). +Include additional covariates. It's also briefly shown in the [Introduction to Data Wrangling page](../intro/). Work through and describe what is going on at +each step. diff --git a/lessons/r-wrangling/cheatsheet.md b/lessons/r-wrangling/cheatsheet.md index b7a197d..9bd41c2 100644 --- a/lessons/r-wrangling/cheatsheet.md +++ b/lessons/r-wrangling/cheatsheet.md @@ -243,7 +243,7 @@ swiss %>% group_by(EarlyDeath) ``` -## `summarise` ## +## `summarise`, `mean`, `sd`, `median`, `quantile` ## > Create a new column of values, usually using a descriptive statistic function such as `mean()` or `median()`, as well as informational functions like `n()` @@ -257,18 +257,26 @@ library(dplyr) swiss %>% mutate(Educated = ifelse(Education >= 50, 'yes', 'no')) %>% group_by(Educated) %>% - str() - summarise(mean = mean(Agriculture)) + summarise( + mean = mean(Agriculture), + sd = sd(Agriculture), + median = median(Agriculture), + firstQuartile = quantile(Agriculture, 1 / 4), + meanSD = paste0(round(mean(Agriculture), 2), + ' (', round(sd(Agriculture), 2) + , ')') + ) ``` -## `gather` ## +## `gather`, `add_rownames` ## > Convert a wide dataframe to a long dataframe. This creates a dataframe with at least two new variables, one containing the names of the original variables and the other containing the values of the variables. You can include or exclude certain variables by selecting the variable you want to include or -exclude (with a `-`) after the name of the two new variables. +exclude (with a `-`) after the name of the two new variables. The `add_rownames` +function is also useful if the dataframe as a rowname attribute. > Example code: diff --git a/lessons/r-wrangling/intro.md b/lessons/r-wrangling/intro.md index 98f7af0..4fb4f08 100644 --- a/lessons/r-wrangling/intro.md +++ b/lessons/r-wrangling/intro.md @@ -36,14 +36,14 @@ etc). 1. How to import/export your data 2. How to view the structure of your data 3. How to wrangle data into an analyzable format +4. How to use basic statistics to summarize your data # Materials for this lesson: - [Slides](slides/) - [Cheatsheet](cheatsheet/) - [Assignment](assignment/) - -Other resources can be found [here](../resources/). +- [Resources](../resources.) # Let's get wrangling, the basics @@ -130,21 +130,14 @@ function (via the `magrittr` package), which works similar to how the Bash shell `|` pipe works (for those familiar with Bash, ie. those who use Mac or Linux). The command on the right-hand side takes the output from the command on the left-hand side, just like how a plumbing pipe works for water. `tbl_df` makes -the object into a `tbl` class, making printing of the output nicer. +the object into a `tbl` class, making printing of the output nicer. The other +nice thing about `dplyr` is that it can connect to SQL and other type of +databases and is very fast at wrangling data, unlike base R. Check out the +[resources page](../resources/) for links to more about this. ```r library(dplyr) -## -## Attaching package: 'dplyr' -## -## The following objects are masked from 'package:stats': -## -## filter, lag -## -## The following objects are masked from 'package:base': -## -## intersect, setdiff, setequal, union library(tidyr) ## Compare @@ -567,17 +560,19 @@ ds2 %>% gather(Measure, Value) %>% group_by(Measure) %>% summarise(mean = mean(Value), + sd = sd(Value), + median = median(Value), sampleSize = n()) -## Source: local data frame [6 x 3] +## Source: local data frame [6 x 5] ## -## Measure mean sampleSize -## (fctr) (dbl) (int) -## 1 Fertility 70.14255 47 -## 2 Agriculture 50.65957 47 -## 3 Examination 16.48936 47 -## 4 Education 10.97872 47 -## 5 Catholic 41.14383 47 -## 6 Infant.Mortality 19.94255 47 +## Measure mean sd median sampleSize +## (fctr) (dbl) (dbl) (dbl) (int) +## 1 Fertility 70.14255 12.491697 70.40 47 +## 2 Agriculture 50.65957 22.711218 54.10 47 +## 3 Examination 16.48936 7.977883 16.00 47 +## 4 Education 10.97872 9.615407 8.00 47 +## 5 Catholic 41.14383 41.704850 15.14 47 +## 6 Infant.Mortality 19.94255 2.912697 20.00 47 ``` ## Other useful and powerful examples @@ -600,5 +595,26 @@ ds2 %>% group_by(Dep, Indep) %>% do(lm(Yvalue ~ Xvalue + Infant.Mortality + Examination, data = .) %>% broom::tidy()) -## Error in tidy.lm(.): could not find function "is" +## Source: local data frame [16 x 7] +## Groups: Dep, Indep [4] +## +## Dep Indep term estimate std.error +## (fctr) (fctr) (chr) (dbl) (dbl) +## 1 Education Fertility (Intercept) 17.62086823 9.48591843 +## 2 Education Fertility Xvalue -0.34172055 0.11177490 +## 3 Education Fertility Infant.Mortality 0.44332139 0.36837072 +## 4 Education Fertility Examination 0.51463767 0.16015339 +## 5 Education Agriculture (Intercept) 12.75076261 9.86830392 +## 6 Education Agriculture Xvalue -0.13540736 0.06135889 +## 7 Education Agriculture Infant.Mortality -0.21468852 0.35014860 +## 8 Education Agriculture Examination 0.56818921 0.17549554 +## 9 Catholic Fertility (Intercept) 35.13701855 51.27597790 +## 10 Catholic Fertility Xvalue 0.39943317 0.60419739 +## 11 Catholic Fertility Infant.Mortality 1.00336670 1.99122191 +## 12 Catholic Fertility Examination -2.54831847 0.86570652 +## 13 Catholic Agriculture (Intercept) 48.62020731 51.22779001 +## 14 Catholic Agriculture Xvalue 0.08447534 0.31852283 +## 15 Catholic Agriculture Infant.Mortality 1.69138478 1.81767192 +## 16 Catholic Agriculture Examination -2.75852964 0.91102265 +## Variables not shown: statistic (dbl), p.value (dbl) ``` diff --git a/lessons/r-wrangling/slides.md b/lessons/r-wrangling/slides.md index 7657f66..206e033 100644 --- a/lessons/r-wrangling/slides.md +++ b/lessons/r-wrangling/slides.md @@ -22,11 +22,13 @@ output: slidy_presentation - How to import/export your data - How to view the structure of your data - How to wrangle data into an analyzable format +- How to use basic statistics to summarize your data -## 4 main concepts: ## +## 5 main concepts: ## - **Getting the data**: read.table, write.table - **View the data**: str, summary, names, head -- **Working the data (dplyr):** (tbl\_df), select, filter, mutate, - summarise, arrange, rename, group\_by, `%>%` pipe +- **Working the data (dplyr):** (tbl\_df), select, filter, mutate, summarise, +arrange, rename, group\_by, `%>%` pipe - **(Re)Organize the data (tidyr):** gather, spread +- **Basic statistics**: mean, sd, median, quantile, var, max, min, range diff --git a/lessons/resources.md b/lessons/resources.md index 4f8d6db..1197234 100644 --- a/lessons/resources.md +++ b/lessons/resources.md @@ -96,6 +96,7 @@ categories: * [Shorter intro to `tidyr`](http://blog.rstudio.org/2014/07/22/introducing-tidyr/) * [Regular expressions](http://www.regular-expressions.info/) * [Regular expression symbol meaning](http://www.endmemo.com/program/R/gsub.php) +* [`dplyr` with databases in SQL and other formats](https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html) ## R Markdown ##