-
Notifications
You must be signed in to change notification settings - Fork 15
/
Copy pathchapter4.Rmd
440 lines (296 loc) · 14.5 KB
/
chapter4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
---
courseTitle : Introduction to R v2
chapterTitle : Factors
description : Data falls into a limited number of categories very often. For example, humans are either male or female. In R, categorical data is stored in factors. Given the importance of factors in data analysis, start learning how to create, subset and compare factors now!
framework : datamind
mode: selfcontained
---
## What's a factor and why would you use it?
In this chapter you dive into the wonderful world of **factors**.
The term factor refers to a statistical data type used to store categorical variables. The difference between a categorical variable and a continuous variable is that a categorical variable can belong to a **limited number of categories**. A continuous variable on the other hand, can correspond to an infinite number of values.
It is important that R knows whether it's dealing with a continuous or a categorical variable. This, since the statistical models you will develop in the future treat both in a different way. (You will see later why this is the case).
A good example of a categorical variable, is the variable 'Gender'. As you hopefully know by now, a human individual can either be "Male" or "Female". So "Male" and "Female" are here the (only two) values of the categorical variable "Gender", and every observation can be assigned to either the value "Male" of "Female".
*** =instructions
1. Assign to variable `theory` the value `"R uses factors for categorical variables!"`
*** =hint
Just click the Run button and look at the console.
*** =sample_code
```{r eval=FALSE}
# Assign to the variable theory what this chapter is about!
```
*** =solution
```{r eval=FALSE}
# Assign to the variable theory what this chapter is about!
theory <- "R uses factors for categorical variables!"
```
*** =sct
```{r eval=FALSE}
DM.result <- closed_test("theory","c('R uses factors for categorical variables!')")
```
*** =pre_exercise_code
```{r eval=FALSE}
```
---
## What's a factor and why would you use it? (2)
To create factors in R you make use of the function `factor()`. First thing to do is create a vector that contains all the observations belonging to a limited number of categories. For example, gender vector contains the sex of 5 different individuals: `gender.vector <- c("Male","Female","Female","Male","Male")`.
It is clear, that there are 2 categories, or in R-terms **'factor levels'**, here: "Male" and "Female".
The function `factor()` will encode the vector as a factor: `factor.gender.vector <- factor(gender.vector)`.
*** =instructions
1. Assign to `factor.gender.vector` the character vector `gender.vector` converted to a **factor**. Look at the console and note that R prints out the factor levels below the actual values.
*** =hint
Just click the Run button and look at the console.
*** =sample_code
```{r eval=FALSE}
gender.vector <- c("Male","Female","Female","Male","Male")
factor.gender.vector <-
factor.gender.vector
```
*** =solution
```{r eval=FALSE}
gender.vector <- c("Male","Female","Female","Male","Male")
factor.gender.vector <- factor(gender.vector)
factor.gender.vector
```
*** =sct
```{r eval=FALSE}
DM.result <- closed_test( "factor.gender.vector", "factor(gender.vector)" )
```
*** =pre_exercise_code
```{r eval=FALSE}
```
---
## What's a factor and why would you use it? (3)
There are two types of categorical variabels: a **nominal categorical variable** and an **ordinal categorical variable**.
A nominal variable is a categorical variable without an implied order. This means it is impossible to say that 'one is worth more than the other'. Think for example of the categorical variable `hair.color.vector`, with the categories `"Ginger", "Blonde", "Brunette"` and `"Grey"`. Here, it is impossible to say one stands above or below the other. (Note: some of you might disagree ;-) ).
In contrast, ordinal variables do have a natural ordering, consider for example the categorical variable `temperature.vector` with the categories: `"Low"`, `"Medium"` and `"High"`. Here it is easy to see that `"Medium"` stand above `"Low"`, and `"High"` stands above `"Medium"`.
*** =instructions
1. Click Submit Answer to check how R contructs and prints nominal and ordinal variables. Don't worry if you don't understand all the code just yet, we'll get to that.
*** =hint
Just click the Submit Answer button and look at the console. Notice how R indicates the ordering of the factor levels for ordinal categorical variables.
*** =sample_code
```{r eval=FALSE}
hair.color.vector <- c("Blonde","Blonde","Brunette","Ginger","Grey","Brunette")
temperature.vector <- c("High","Low","High","Low","Medium")
factor.hair.color.vector <- factor(hair.color.vector)
factor.temperature.vector <- factor( temperature.vector,order=TRUE,levels=c("Low","Medium","High") )
factor.temperature.vector
factor.hair.color.vector
```
*** =solution
```{r eval=FALSE}
# Just click the Submit Answer button and look at the console.
```
*** =sct
```{r eval=FALSE}
DM.result <- TRUE
```
*** =pre_exercise_code
```{r eval=FALSE}
```
---
## Factor levels
When you first get a dataset, you often see that it contains factors with specific factor levels. However, sometimes you will want to change the names of these levels for clarity or other reasons. R allows you to do this with the function `levels()`:
`levels(factor.vector) <- c("name1","name2",...)`
A good illustration is the raw data provided to you by a survey. A standard question for every questionnaire is the gender of the respondent. You remember from the previous question this is a factor and when performing the questionnaire on the streets its levels are often coded as `"M"` and `"F"`.
`survey.vector<- c("M","F","F","M","M")`
Next, when you want to start your data analysis your main concern is to keep a nice overview of all the variables and what they mean. At that point, you will often want to change the factor levels to `"Male"` and `"Female"` instead of `"M"` and `"F"` to make your life easier.
*** =instructions
1. Convert the character vector `survey.vector` into a factor vector. Assign to `factor.survey.vector`
2. Change the factor levels of `factor.survey.vector` to `"Male"` and `"Female"`.
*** =hint
Mind the order in which you have to type in the factor levels. Hint: look at the order in which the levels are printed).
*** =sample_code
```{r eval=FALSE}
survey.vector <- c("M","F","F","M","M")
factor.survey.vector <- factor( survey.vector )
factor.survey.vector #Print to console
levels(factor.survey.vector) <- #Your code here
factor.survey.vector
```
*** =solution
```{r eval=FALSE}
survey.vector <- c("M","F","F","M","M")
factor.survey.vector <- factor( survey.vector )
factor.survey.vector #Print to console
levels(factor.survey.vector) <- c("Female","Male")
factor.survey.vector
```
*** =sct
```{r eval=FALSE}
names <- c("survey.vector","factor.survey.vector","levels(factor.survey.vector)")
values <- c("survey.vector","factor.survey.vector","c('Female','Male')")
DM.result <- closed_test(names,values,check.existence=c(T,T,F))
```
*** =pre_exercise_code
```{r eval=FALSE}
```
---
## Summarizing a factor
After finishing this course, one of your favourite functions in R will be `summary()`. This will give you a quick overview of `some.variable`:
`summary(some.variable)`
Going back to our survey, you would like to know how many `"Male"` responses you have in your study, and how many `"Female"`. The `summary()` function gives the answer to this question.
*** =instructions
1. Ask a `summary()` of the `survey.vector` and `factor.survey.vector`. Interpret the result for both vectors, are they both equally useful in this case?
The fact that you identified `"Male"` and `"Female"` as factor levels in `factor.survey.vector`, enables R to show the number of elements for each category.
*** =hint
Type this in the console: `summary(survey.vector)` `summary(factor.survey.vector)`
*** =sample_code
```{r eval=FALSE}
survey.vector <- c("M","F","F","M","M")
factor.survey.vector <- factor( survey.vector )
levels(factor.survey.vector) <- c("Female","Male")
factor.survey.vector
# Type your code here for survey.vector
# Type your code here for factor.survey.vector
```
*** =solution
```{r eval=FALSE}
survey.vector <- c("M","F","F","M","M")
factor.survey.vector <- factor( survey.vector )
levels(factor.survey.vector) <- c("Female","Male")
factor.survey.vector
summary(survey.vector)
summary(factor.survey.vector)
```
*** =sct
```{r eval=FALSE}
DM.result <- code_test(c("summary(survey.vector)","summary(factor.survey.vector)"))
```
*** =pre_exercise_code
```{r eval=FALSE}
```
---
## Battle of the sexes
> "All animals are equal, but some animals are more equal than others"
>--George Orwell
In `factor.survey.vector` we have a factor with two levels: male & female. But how does R value these relatively to each other? In other words, who sees R to be worth more, males or females?
*** =instructions
1. Read the code in the editor and click Submit Answer to see whether males are worth more than females.
*** =hint
Just click Run.
*** =sample_code
```{r eval=FALSE}
survey.vector <- c("M","F","F","M","M")
factor.survey.vector <- factor( survey.vector )
levels(factor.survey.vector) <- c("Female","Male")
# Battle of the sexes:
factor.survey.vector[1] #Male
factor.survey.vector[2] #Female
factor.survey.vector[1] > factor.survey.vector[2] #Male larger than female?
```
*** =solution
```{r eval=FALSE}
survey.vector <- c("M","F","F","M","M")
factor.survey.vector <- factor( survey.vector )
levels(factor.survey.vector) <- c("Female","Male")
# Battle of the sexes:
factor.survey.vector[1] #Male
factor.survey.vector[2] #Female
factor.survey.vector[1] > factor.survey.vector[2] #Male larger than female?
```
*** =sct
```{r eval=FALSE}
DM.result <- list(TRUE,"Phew, it seems that R is gender-neutral. Or maybe just wants to stay out of these discussions ;-).")
```
*** =pre_exercise_code
```{r eval=FALSE}
```
---
## Ordered factors
Since `"Male"` and `"Female"` are unordered (or nominal) factor levels, R returns a error message. As seen before, for such factors R attaches an equal value to the levels.
This is not always the case! Sometimes you will deal with factors that do have a natural ordering between its categories. If this is the case, we have to make sure we pass this information to R...
Let's say your leading a research team of 5 data analysts and want to evaluate their performance. To do this, you track their speed, evaluate each analyst as `"Slow"`, `"Fast"` or `"Ultra-Fast"`, and save the results in `speed.vector`.
*** =instructions
1. As a first step, assign `speed.vector` knowing that:
Analyst 1: Fast,
Analyst 2: Slow,
Analyst 3: Slow,
Analyst 4: Fast and
Analyst 5: Ultra-fast,
(Just as characters for now)
*** =hint
Assign to `speed.vector` : `c("Fast","Slow",?`
*** =sample_code
```{r eval=FALSE}
# Create speed.vector
speed.vector #<-
```
*** =solution
```{r eval=FALSE}
# Create speed.vector
speed.vector <- c("Fast","Slow","Slow","Fast","Ultra-fast")
```
*** =sct
```{r eval=FALSE}
names <- "speed.vector"
values <- 'c("Fast","Slow","Slow","Fast","Ultra-fast")'
DM.result <- closed_test(names,values)
```
*** =pre_exercise_code
```{r eval=FALSE}
```
---
## Ordered factors (2)
`speed.vector` should be converted to an ordinal factor since its categories have a natural ordening. By default, the function `factor()` transforms `speed.vector` into an unordered factor. To create an ordered factor, you have to add two additional arguments:
`factor(some.vector,ordered=TRUE,levels=c("Level 1","Level 2"...))`
By setting the argument `ordered=TRUE` in the function `factor()`, you indicate the factor is ordered. With the argument `levels` you give the values of the factor in the correct order, e.g. `levels = c("Low","Medium","High")`.
*** =instructions
1. Rewrite the code for `speed.factor.vector`, this time taking into account that there is a specific order for the factor levels.
*** =hint
Use the function `factor()` to create `speed.factor.vector` based on `speed.character.vector`. The argument order should be set to TRUE since there is a natural ordering. The factor levels are in this case: `c("Slow","Fast","Ultra-fast")`.
*** =sample_code
```{r eval=FALSE}
speed.vector <- c("Fast","Slow","Slow","Fast","Ultra-fast")
factor.speed.vector <- #Your code here!
factor.speed.vector
summary(factor.speed.vector) # R prints automagically in the right order
```
*** =solution
```{r eval=FALSE}
speed.vector <- c("Fast","Slow","Slow","Fast","Ultra-fast")
factor.speed.vector <- factor(speed.vector, ordered=TRUE,levels=c("Slow","Fast","Ultra-fast") )
factor.speed.vector
summary(factor.speed.vector) # R prints automagically in the right order
```
*** =sct
```{r eval=FALSE}
names <- "factor.speed.vector"
values <- 'factor(speed.vector, ordered=TRUE,levels=c("Slow","Fast","Ultra-fast") )'
DM.result <- closed_test(names,values)
```
*** =pre_exercise_code
```{r eval=FALSE}
```
---
## Comparing ordered factors
Having a bad day at work, 'data analyst number two' enters your office and starts complaining that 'data analyst number five' is slowing down the entire project. Since you know 'data analyst number two' has the reputation of being a smarty-pants, you first decide to check if his statement is true.
The fact that `speed.factor.vector` is now ordered, enables us to compare different elements (i.e. data analysts in this case). You can simply do this by using the well-known operators.
*** =instructions
1. Check whether data analyst 2 is faster than data analyst 5 and assign the result to `compare.them`. Remember the `>` operator allows to check whether one element is larger than the other.
*** =hint
You can compare the elements of the vector with the `>` operator. Relevant for this case: `vector[1] > vector[2]` checks whether the first elements of vector is larger than the second element.
*** =sample_code
```{r eval=FALSE}
speed.vector <- c("Fast","Slow","Slow","Fast","Ultra-fast")
speed.factor.vector <- factor(speed.vector, ordered=TRUE,levels=c("Slow","Fast","Ultra-fast") )
compare.them <- # Your code here
# Is data analyst 2 faster than data analyst 5?
compare.them
```
*** =solution
```{r eval=FALSE}
speed.vector <- c("Fast","Slow","Slow","Fast","Ultra-fast")
speed.factor.vector <- factor(speed.vector, ordered=TRUE,levels=c("Slow","Fast","Ultra-fast") )
compare.them <- speed.factor.vector[2] > speed.factor.vector[5]
# Is data analyst 2 faster than data analyst 5?
compare.them
```
*** =sct
```{r eval=FALSE}
names <- "compare.them"
values <- "speed.factor.vector[2] > speed.factor.vector[5]"
DM.result <- closed_test(names,values)
```
*** =pre_exercise_code
```{r eval=FALSE}
```