Skip to content

Commit

Permalink
Updated wrangling from master
Browse files Browse the repository at this point in the history
  • Loading branch information
lwjohnst86 committed Nov 9, 2015
1 parent 703606f commit 48d501f
Show file tree
Hide file tree
Showing 5 changed files with 94 additions and 55 deletions.
60 changes: 36 additions & 24 deletions lessons/r-wrangling/assignment.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,34 +23,46 @@ output:
## Challenges: Try these out for yourself! ##

Try each of these challenges using only one continuous chain of `%>%` pipes,
from raw data to final output.
from raw data to final output. Create a file in the `practice` repo under
`your-name/wrangling` called `challenge.R`. The file location should look like
`your-name/wrangling/challenge.R`. To get more practice with Git, **add and
commit** after completing each challenge.

1. Make a new dataframe with the means of Agriculture, Examination, Education, and
Infant.Mortality for each category of Fertility (hint: convert it into a factor
by values of >50 vs <50), when Catholic is less than 60 (hint, use `dplyr` commands
and `gather`). Have the Fertility groups as two columns.
1. Make a new dataframe with the means of Agriculture, Examination, Education,
and Infant.Mortality for each category of Fertility (hint: convert it into a
factor by values of >50 vs <50) and when Catholic is less than 60 (hint, use
`dplyr` commands and `gather`). Have the Fertility groups as two columns in the
new dataframe ('Fertile', 'Infertile').

2. Do the same thing as above, but instead make a new dataframe with one column
with that contains the mean and standard deviation in this format: '00.00 (00.00
SD)'. Notice that there are two digits after the period.
2. Do the same thing as above, but instead make a new dataframe with one column
that contains the mean and standard deviation in this format: '00.00 (00.00
SD)'. Notice that there are two digits after the period. Hint: You'll need to
use the `paste0()` function to combine means and SD.

3. Create a new dataframe with the first column containing the variable names,
the second column containing which county has the lowest value of the variable,
and a third column containing the county with the highest value of the variable.
For example, this is how it should approximately look like:
3. Create a new dataframe that has only counties that have either the lowest or
the highest value within *each* variable (ie. there should be *at least* 12
counties). The final dataframe should have the columns County, Variable, and
Value, with at least 12 rows (there may be 1 or 2 more).

4. Find the mininum, mean, and max of each of the variables using `dplyr`
commands and pipes. The final dataframe should have 4 columns: 1) variable
names, 2) min, 3) mean, and 4) max.

|Variable |Lowest |Highest |
|:-----------|:--------|:----------|
|Fertility |Moutier |Courtelary |
|Agriculture |Delemont |Porrentruy |
5. List all counties with measures that would typically be associated with
health: high education (greater than or equal to 8), low infant mortality (less
than 18), and mid fertility (between 50 to 60). Keep only the county names in
the final dataframe.

### Creating plots (based on the last workshops material)
6. (Advanced) Get a dataframe of correlation coefficients of all numeric
variables by a new variable called `Educated`, which is split at greater than 8
for Education and will be a factor. Hint: You will need to use the `do()` and
`broom::tidy()` commands as seen at the bottom of the [Introduction page](../intro/). The final dataframe should have four columns: Educated ('yes'
and 'no'), Var1, Var2, and Value (correlation coefficient). The dataframe should
*not* contain any comparisons of the same variable (eg. no Fertility with
Fertility, which is the same as having a correlation coefficient of 1)

1. Create a point plot of the means of each variable (not the county). Have the
variable on the y-axis and the means on the x-axis. As a bonus/option, make the
graph prettier.

2. Expand on challenge 4, but split the means up by fertility (like in challenge
1). The graph should have two dots for each variable representing the means for
each group of fertility.
7. (Advanced) For those looking for a bigger challenge, try to run a linear
regression on multiple independent variables and multiple dependent variables
using what I described in [my blog post](http://www.lukewjohnston.com/blog/loops-forests-multiple-linear-regressions/).
Include additional covariates. It's also briefly shown in the [Introduction to Data Wrangling page](../intro/). Work through and describe what is going on at
each step.
18 changes: 13 additions & 5 deletions lessons/r-wrangling/cheatsheet.md
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,7 @@ swiss %>%
group_by(EarlyDeath)
```

## `summarise` ##
## `summarise`, `mean`, `sd`, `median`, `quantile` ##

> Create a new column of values, usually using a descriptive statistic function
such as `mean()` or `median()`, as well as informational functions like `n()`
Expand All @@ -257,18 +257,26 @@ library(dplyr)
swiss %>%
mutate(Educated = ifelse(Education >= 50, 'yes', 'no')) %>%
group_by(Educated) %>%
str()
summarise(mean = mean(Agriculture))
summarise(
mean = mean(Agriculture),
sd = sd(Agriculture),
median = median(Agriculture),
firstQuartile = quantile(Agriculture, 1 / 4),
meanSD = paste0(round(mean(Agriculture), 2),
' (', round(sd(Agriculture), 2)
, ')')
)
```


## `gather` ##
## `gather`, `add_rownames` ##

> Convert a wide dataframe to a long dataframe. This creates a dataframe with
at least two new variables, one containing the names of the original variables
and the other containing the values of the variables. You can include or
exclude certain variables by selecting the variable you want to include or
exclude (with a `-`) after the name of the two new variables.
exclude (with a `-`) after the name of the two new variables. The `add_rownames`
function is also useful if the dataframe as a rowname attribute.

> Example code:
Expand Down
62 changes: 39 additions & 23 deletions lessons/r-wrangling/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,14 +36,14 @@ etc).
1. How to import/export your data
2. How to view the structure of your data
3. How to wrangle data into an analyzable format
4. How to use basic statistics to summarize your data

# Materials for this lesson:

- [Slides](slides/)
- [Cheatsheet](cheatsheet/)
- [Assignment](assignment/)

Other resources can be found [here](../resources/).
- [Resources](../resources.)

# Let's get wrangling, the basics

Expand Down Expand Up @@ -130,21 +130,14 @@ function (via the `magrittr` package), which works similar to how the Bash shell
`|` pipe works (for those familiar with Bash, ie. those who use Mac or Linux).
The command on the right-hand side takes the output from the command on the
left-hand side, just like how a plumbing pipe works for water. `tbl_df` makes
the object into a `tbl` class, making printing of the output nicer.
the object into a `tbl` class, making printing of the output nicer. The other
nice thing about `dplyr` is that it can connect to SQL and other type of
databases and is very fast at wrangling data, unlike base R. Check out the
[resources page](../resources/) for links to more about this.


```r
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)

## Compare
Expand Down Expand Up @@ -567,17 +560,19 @@ ds2 %>%
gather(Measure, Value) %>%
group_by(Measure) %>%
summarise(mean = mean(Value),
sd = sd(Value),
median = median(Value),
sampleSize = n())
## Source: local data frame [6 x 3]
## Source: local data frame [6 x 5]
##
## Measure mean sampleSize
## (fctr) (dbl) (int)
## 1 Fertility 70.14255 47
## 2 Agriculture 50.65957 47
## 3 Examination 16.48936 47
## 4 Education 10.97872 47
## 5 Catholic 41.14383 47
## 6 Infant.Mortality 19.94255 47
## Measure mean sd median sampleSize
## (fctr) (dbl) (dbl) (dbl) (int)
## 1 Fertility 70.14255 12.491697 70.40 47
## 2 Agriculture 50.65957 22.711218 54.10 47
## 3 Examination 16.48936 7.977883 16.00 47
## 4 Education 10.97872 9.615407 8.00 47
## 5 Catholic 41.14383 41.704850 15.14 47
## 6 Infant.Mortality 19.94255 2.912697 20.00 47
```

## Other useful and powerful examples
Expand All @@ -600,5 +595,26 @@ ds2 %>%
group_by(Dep, Indep) %>%
do(lm(Yvalue ~ Xvalue + Infant.Mortality + Examination, data = .) %>%
broom::tidy())
## Error in tidy.lm(.): could not find function "is"
## Source: local data frame [16 x 7]
## Groups: Dep, Indep [4]
##
## Dep Indep term estimate std.error
## (fctr) (fctr) (chr) (dbl) (dbl)
## 1 Education Fertility (Intercept) 17.62086823 9.48591843
## 2 Education Fertility Xvalue -0.34172055 0.11177490
## 3 Education Fertility Infant.Mortality 0.44332139 0.36837072
## 4 Education Fertility Examination 0.51463767 0.16015339
## 5 Education Agriculture (Intercept) 12.75076261 9.86830392
## 6 Education Agriculture Xvalue -0.13540736 0.06135889
## 7 Education Agriculture Infant.Mortality -0.21468852 0.35014860
## 8 Education Agriculture Examination 0.56818921 0.17549554
## 9 Catholic Fertility (Intercept) 35.13701855 51.27597790
## 10 Catholic Fertility Xvalue 0.39943317 0.60419739
## 11 Catholic Fertility Infant.Mortality 1.00336670 1.99122191
## 12 Catholic Fertility Examination -2.54831847 0.86570652
## 13 Catholic Agriculture (Intercept) 48.62020731 51.22779001
## 14 Catholic Agriculture Xvalue 0.08447534 0.31852283
## 15 Catholic Agriculture Infant.Mortality 1.69138478 1.81767192
## 16 Catholic Agriculture Examination -2.75852964 0.91102265
## Variables not shown: statistic (dbl), p.value (dbl)
```
8 changes: 5 additions & 3 deletions lessons/r-wrangling/slides.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,13 @@ output: slidy_presentation
- How to import/export your data
- How to view the structure of your data
- How to wrangle data into an analyzable format
- How to use basic statistics to summarize your data

## 4 main concepts: ##
## 5 main concepts: ##

- **Getting the data**: read.table, write.table
- **View the data**: str, summary, names, head
- **Working the data (dplyr):** (tbl\_df), select, filter, mutate,
summarise, arrange, rename, group\_by, `%>%` pipe
- **Working the data (dplyr):** (tbl\_df), select, filter, mutate, summarise,
arrange, rename, group\_by, `%>%` pipe
- **(Re)Organize the data (tidyr):** gather, spread
- **Basic statistics**: mean, sd, median, quantile, var, max, min, range
1 change: 1 addition & 0 deletions lessons/resources.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ categories:
* [Shorter intro to `tidyr`](http://blog.rstudio.org/2014/07/22/introducing-tidyr/)
* [Regular expressions](http://www.regular-expressions.info/)
* [Regular expression symbol meaning](http://www.endmemo.com/program/R/gsub.php)
* [`dplyr` with databases in SQL and other formats](https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html)

## R Markdown ##

Expand Down

0 comments on commit 48d501f

Please sign in to comment.