-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy pathintro2R.Rmd
267 lines (150 loc) · 17 KB
/
intro2R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
# Introduction to R
This introduction to R is very brief and only geared toward providing some basics so that one can understand and run the code associated with the content. It is geared toward an audience that likely has no programming experience whatsoever, but may have had some exposure to traditional statistics packages. If you have some basic familiarity with R you may skip this chapter, though it might serve as a refresher for some.
## Getting Started
### Installation
As mentioned previously, to begin with R for your own machine, you just need to go to the [R website](https://www.r-project.org/), download it for your operating system, and install. Then go to [RStudio](https://www.rstudio.com/), download and install it. From there on you only need RStudio to use R.
Updates occur every few months, and you should update R whenever a new version is released.
### Packages
As soon as you install it, R is already the most powerful statistical environment within which to work. However, its real strength comes from the community, which has added thousands of <span class="emph">packages</span> that provide additional or enhanced functionality. You will regularly find packages that specifically do something you want, and will need to install them in order to use them.
RStudio provides a **Packages** tab, but it is usually just as or more efficient to use the <span class="func">install.packages</span> function.
<img src="img/packagesTab.png" style="display:block; margin: 0 auto;">
```{r installpackage, eval=F}
install.packages('mynewfavorite')
```
At this point there are over 7000 packages available through standard sources, and many more through unofficial ones. To start getting some ideas of what you want to use, you will want to spend time at places like [CRAN Task Views](https://cran.r-project.org/web/views/), [Rdocumentation](http://www.rdocumentation.org/), or with a list [like this one](https://github.com/qinwf/awesome-R).
The main thing to note is that if you want to use a package, you have to load it with the <span class="func">library</span> function.
```{r loadPackage, eval=FALSE}
library(lazerhawk)
```
Sometimes, you only need the package for one thing and there is no reason to keep it loaded, in which case you can use the following approach.
```{r packageoneoff, eval=FALSE}
packagename::packagefunction(args)
```
I suggest using the <span class="func">library</span> function for newbies.
You'll get the hang of finding, installing, and using packages quite quickly. However, note that the increasing popularity and ease of using R means that packages can vary quite a bit in terms of quality, so you may need to try out a couple packages with seemingly similar functionality to find the best for your situation.
### RStudio
RStudio is an <span class="emph">integrated development environment</span> (**IDE**) specifically geared toward R (though it works for other languages too). At the very least it will make your programming far easier and more efficient, at best you can create publish-ready documents, manage projects, create interactive website regarding your research, use version control, and much more. I have an overview [here](https://m-clark.github.io/workshops/introRstudio.html).
See [Emacs Speaks Statistics](http://ess.r-project.org/) for an alternative. The point is, base R is not an efficient way to use R, and you have at least two very powerful options to make your coding experience easier and more efficient.
### Importing Data
It is as easy to import data into R from other programs and text files as it is any other statistical program. As a first step, feel free to use the menu-based approach via the Environment tab.
<img src="img/importData.png" style="display:block; margin: 0 auto;">
Note that you have easy access to the code that actually does the job. You should quickly get used to a code-based approach as soon as possible, that way you don't waste time clicking and hunting for files, and instead can quickly run a single line of code. It is easier to select specific options as well. Some key packages to note:
- <span class="pack">base</span>: Base R already can read in tab-delimited, comma-separated etc. text files
- <span class="pack">foreign</span>: Base R comes with the foreign package that can read in files from other statistical packages. However support for Stata ended with version 12.
Newer packages
- <span class="pack">readr</span>: enhanced versions of Base R read functionality
- <span class="pack">haven</span>: replacement for foreign that can also read current Stata files
- <span class="pack">readxl</span>: for Excel specifically
There are many other packages that would be of use for special situations and other less common file types and structures.
## Key things to know about R
### R is a programming language, not a 'stats package'
The first thing to note for those new to R is that R is a language that is oriented toward, but not specific to, dealing with statistics. This means it's highly flexible, but does take some getting used to. If you go in with a spreadsheet style mindset, you'll have difficulty, as well as miss out on the power it has to offer.
### Never ask if R can do what you want. It can.
The only limitation to R is the user's programming sophistication. The better you get at statistical programming the further you can explore your data and take your research. This holds for any statistical endeavor whether using R or not.
### Main components: script, console, graphics device
With R, the *script* is where you write your R code. While you could do everything at the console, this would be difficult at best and unreproducible. The *console* is where the results are produced from running the script. Again you can do one-liner stuff there, such as getting help for a function. The *graphics device* is where visualizations are produced, and in RStudio you have two, one for static plots, and a **viewer** for potentially interactive ones.
### R is easy to use, but difficult to master.
If you only want to use R in an applied fashion, as we will do here, R can be very easy to use. As an example the following code would read in data and run a standard linear regression with the <span class="func">lm</span> (i.e. linear model) function.
```{r Reasy, eval=F}
mydata = read.csv(file='location/myfile.csv') # some file with an 'x' and 'y' variable
regModel = lm(y ~ x, data=mydata) # standard regression
summary(regModel) # summary of results
```
The above code demonstrates that R can be as easy to use as anything else, and in my opinion, it is for any standard analysis. In addition, the formula syntax is utilized the vast majority of modeling packages (including <span class="pack">lavaan</span>), as is the <span class="func">summary</span> function for printing model results. The nice part is that R can still be *easier* to use with more complex analyses you either won't find elsewhere or will only get with crippled functionality.
### Object-oriented
This section may be a little more than a newcomer to R wants or even needs to know to get started with R. However, most of the newbie's errors will be due not knowing a little about this aspect of R programming, and I've seen some make the same mistakes even after using R for a long time as a result.
R is <span class="emph">object-oriented</span>. For our purposes this means that we create what are called objects and use <span class="emph">functions</span> to manipulate those objects in some fashion. In the above code we created two objects, an object that held the data and an object that held the regression results.
Objects can be ***anything***, a character string, a list of 1000 data sets, the results of an analysis, ***anything***.
#### Assignment
In order to create an object, we must <span class="emph">assign</span> something to it. You'll come across two ways to do this.
```{r assign, eval=F}
myObject = something
myObject <- something
```
These result in the same thing, an object called <span class="objclass">myObject</span> that contains 'something' in it. For all appropriate practical use, whether you use `=` or `<-`is a matter of personal preference. The point is that typical practice in R entails that one assigns something to an object and then uses functions on that object to get something more from it.
#### Functions
Functions are objects that take input of various kinds and produce some output (simply called a *value*). In the above code we used three functions: <span class="func">read.csv</span>, <span class="func">lm</span>, and <span class="func">summary</span>. Functions (usually) come with named <span class="emph">arguments</span>, that note what types of inputs the function can take. With <span class="func">read.csv</span>, we merely gave it one argument, the file name, but there are actually a couple dozen, each with some default. Type `?read.csv` at the console to see the help file.
#### Classes
All objects have a certain class, which means that some functions will work in certain ways depending on the class. Consider the following code.
```{r classDemo}
x = rnorm(100)
y = x + rnorm(100)
mod = lm(y ~ x)
summary(x)
summary(mod)
```
In both cases we used the `summary` function, but get different results. There are two <span class="emph">methods</span> for summary at work here: <span class="">summary.default</span> and <span class="">summary.lm</span>. Which is used depends on the class of the object given as an argument to the summary function. Most of the time however, there are unique functions that will only work on one class of object.
```{r classDemo2}
class(x)
class(mod)
```
One of the most common classes of objects used is the <span class="objclass">data.frame</span>. Data frames are matrices that contain multiple types of vectors that are the variables and observations of interest. The contents can be numeric, factors (categorical variables), character strings and other types of objects.
```{r dataframeDemo}
nams = c('Bob Ross', 'Bob Marley', 'Bob Odenkirk')
places = c('Dirt', 'Dirt', 'New Mexico')
yod = c(1995, 1981, 2028)
bobs = data.frame(nams, places, yod)
bobs
```
### Case sensitive
A great deal of the errors you will get when you start learning R will result from not typing something correctly. For example, what if we try `summary(X)`?
```{r error, error=TRUE}
summary(X)
```
Errors are merely messages or calls for help from R. It can't do what you ask and needs something else in order to perform the task. In this case, R is telling you it can't find anything called X, which makes sense because the object name is lowercase $x$. This error message is somewhat straightforward, but error messages for all programming languages typically aren't, and depending on what you're trying to do, it may be fairly vague. If you get an error, your first thought as you start out with R should be to check the names of things.
### The lavaan package
You'll see more later, but in order to use <span class="pack">lavaan</span> for structural equation modeling, you'll need two things primarily: an object that represents the **data**, and an object that represents the **model**. It will look something like the following.
```{r lavaanExample0, eval=FALSE}
modelCode = "
y ~ x1 + x2 + lv # structural/regression model
lv =~ z1 + z2 + z3 + z4 # measurement model with latent variable lv
"
library(lavaan)
mymodel = sem(modelCode, data=mydata)
```
The character string is our model, and it can actually be kept as a separate file (without the assignment) if desired. Using the <span class="pack">lavaan</span> library, we can then use one of its specific functions like <span class="func">cfa</span>, <span class="func">sem</span>, or <span class="func">growth</span> to use the model code object (<span class="objclass">modelCode</span> above) and the data object (<span class="objclass">mydata</span> above).
The way to define things in lavaan can be summarized as follows.
```{r lavaanCodeTable, echo=FALSE}
formulaType = c('latent variable definition', 'regression', '(residual) (co)variance', 'intercept', 'fixed parameter', 'named parameter')
operator = c('=~', '~', '~~', '~ 1', '_*', 'abc*')
mnemonic = c('is measured by', 'is regressed on', 'is correlated with', 'intercept', 'is fixed at the value of _', 'is named abc')
data.frame(formulaType, operator, mnemonic) %>%
kable(align = 'ccc', format = 'html') %>%
kable_styling(bootstrap_options = c('condensed','basic'), position = 'center', full_width = F)
```
##### Matching lavaan and R to MPlus
With <span class="pack">lavaan</span> and related packages there is almost complete overlap with MPlus. However, there are some things to note.
```{r lavaanvsmplus, echo=FALSE}
model = c('standard CFA, path, SEM, growth', 'missing values', 'multigroup',
'mixture models', 'Bayesian', 'survey', 'multilevel', 'factor score regression', 'visualization')
packages = c('lavaan', 'lavaan (FIML), semTools (MI +)', 'lavaan, semTools', 'flexmix, poLCA, many others', 'blavaan', 'lavaan, survey.lavaan', 'lavaan (new), mediation', 'lavaan', 'lavaanPlot')
data.frame(model, packages) %>%
kable(align = 'lc', format = 'html') %>%
kable_styling(bootstrap_options = c('condensed','basic'), position = 'center', full_width = F)
```
The one glaring lack <span class="pack">lavaan</span> has regards mixture models, i.e. those with latent categorical variables. However, I'd argue there is far more functionality in R generally, with results that are far more easy to work with than you would have with MPlus. Note that <span class="pack">lavaan</span> handles *observed* categorical variables just fine, though oddly with only a probit link function (no logit).
Up until version 0.6-1 <span class="pack">lavaan</span> had no support for multilevel models. However, it now can do two-level SEM, and the <span class="pack">mediation</span> package has long been able to do single mediator mixed/multilevel models[^multilevelsem]. In addition, <span class="pack">lavaan</span> has added some survey support, but you'll have plenty with <span class="pack">survey.lavaan</span>.
Latent variable interactions are not conducted automatically in <span class="pack">lavaan</span>. One way to do it is to create a product term for all pairs of items pertaining to two factors, and then specify a latent variable where the manifest variables are the product terms. There is an example provided [here](https://rpubs.com/njaalf/workshop2014), and while the syntax is inefficient, it is conceptually clear.
Unless you have complicated mediation structure (rarely sufficiently justified by SEM practitioners), there is no reason that I can see to do growth curve models. There is far, far more functionality in R for mixed models (e.g. <span class="pack">lme4</span>, <span class="pack">mgcv</span>, <span class="pack">brms</span>), which are the far more common and flexible approach to modeling longitudinal data (and otherwise equivalent). Such models handle nonlinear trends, complicated residual structure, alternative distributions, spatial and phylogenetic correlational structures, and more. MPlus simply doesn't compete in this realm, and has terribly tedious syntax for either growth curve or multilevel models.
On top of everything else, the [lavaan ecosystem](https://m-clark.github.io/R-models/#lavaan-extensions), i.e. packages that work with and enhance its models, continues to grow. In addition, with R there are utilities for SEM you won't find with MPlus, such as model search, regularization, and more. And finally, with the <span class="pack">MPlusAutomation</span> package it's even easier to use MPlus with R than it is to use MPlus itself!
### Getting help
Use `?` to get the help file for a specific function, or `??` to do a search, possibly on a quoted phrase.
Examples:
```{r gethelp, eval=FALSE}
?lm
??regression
??'nonlinear regression'
```
## Moving forward
Hopefully you'll get the hang of things as we run the code in later chapters. You'll essentially only be asked to tweak code that's already provided. This document (and associated workshop) is not the place to learn R. However, here are some exercises to make sure we start to get comfortable with it.
### Exercises
Interactive version [here](exercises/exercise_introR.nb.html)
1. Create an object that consists of the numbers one through five.
using `c(1,2,3,4,5)` or `1:5`
2. Create a different object, that is the same as that object, but plus 1 (i.e. the numbers two through six).
3. Without creating an object, use <span class="func">cbind</span> and <span class="func">rbind</span>, feeding the objects you just created as arguments. For example `cbind(obj1, obj2)`.
4. Create a new object using <span class="objclass">data.frame</span> just as you did <span class="func">rbind</span> or <span class="func">cbind</span> in #3.
5. Inspect the class of the object, and use the <span class="func">summary</span> function on it.
### Summary
There's a lot to learn with R, and the more time you spend with R the easier your research process will likely go, and the more you will learn about statistical analysis.
[^multilevelsem]: In practice, these models are problematic even for MPlus.