-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrbasic_20190909_152_chang.Rmd
610 lines (492 loc) · 21.2 KB
/
rbasic_20190909_152_chang.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
---
title: "R Notebook"
output: html_notebook
---
# 3.7 Vectors
In R, the most basic objects available to store data are vectors. In a data frame, each column is a vector.
## 3.7.1 Creating vectors
We can create vectors using the function `c`, which stands for concatenate.
```{r}
codes<-c(380, 124, 818)
codes
country<-c("italy", "canada", "egypt")
#single quote ' is okay, not back quote `
```
## 3.7.2 Names
Sometimes it is useful to name the entries of a vector. For example, when defining a vector of country codes, we can use the names to connect the two:
```{r}
codes<-c(italy=380, canada=124, egypt=818)
# it's okay to use quotes
codes
```
```{r}
class(codes)
names(codes)
```
```{r}
codes<-c(380, 124, 818)
country<-c("italy", "canada", "egypt")
names(codes)<-country
codes
```
## 3.7.3 Sequences
Another useful function for creating vectors generates sequences:
```{r}
seq(1,10)
```
The default is to go up in increments of 1, but a third argument lets us tell it how much to jump by:
```{r}
seq(1,10,2)
```
If we want consecutive integers, we can use the following shorthand:
```{r}
1:10
```
When we use these functions, R produces integers, not numerics, because they are typically used to index something.
```{r}
class(1:10)
```
However, if we create a sequence including non-integers, the class changes:
```{r}
class(seq(1, 10, 0.5))
```
## 3.7.4 Subsetting
We use square brackets to access specific elements of a vector. For the vector codes we defined above, we can access the second element using:
```{r}
codes[2]
codes[c(1,3)]
codes[1:2]
```
If the elements have names, we can also access the entries using these names.
```{r}
codes["canada"]
codes[c("egypt","italy")]
```
# 3.8 Coercion
In general, coercion is an attempt by R to be flexible with data types. When an entry does not match the expected, some of the prebuilt R functions try to guess what was meant before throwing an error. This can also lead to confusion. Failing to understand coercion can drive programmers crazy when attempting to code in R since it behaves quite differently from most other languages in this regard.
We said that vectors must be all of the same type. So if we try to combine, say, numbers and characters, you might expect an error. But we don’t get one, not even a warning! What happened? Look at `x` and its class:
```{r}
x<-c(1,"canada",3)
x
class(x)
```
R coerced the data into characters. It guessed that because you put a character string in the vector, you meant the 1 and 3 to actually be character strings `"1"` and `“3”`. The fact that not even a warning is issued is an example of how coercion can cause many unnoticed errors in R.
R also offers functions to change from one type to another. For example, you can turn numbers into characters with:
```{r}
x<-1:5
y<-as.character(x)
y
```
You can turn it back with `as.numeric`:
```{r}
as.numeric(y)
```
This function is actually quite useful since datasets that include numbers as character strings are common.
## 3.8.1 Not availables(NA)
When a function tries to coerce one type to another and encounters an impossible case, it usually gives us a warning and turns the entry into a special value called an `NA` for “not available”. For example:
```{r}
x<-c("1","b","3")
as.numeric(x)
```
R does not have any guesses for what number you want when you type `b`, so it does not try.
As a data scientist you will encounter the `NA`s often as they are generally used for missing data, a common problem in real-world datasets.
# 3.9 Exercises
1. Use the function `c` to create a vector with the average high temperatures in January for Beijing, Lagos, Paris, Rio de Janeiro, San Juan and Toronto, which are 35, 88, 42, 84, 81, and 30 degrees Fahrenheit. Call the object `temp`.
```{r}
temp<-c(35,88,42,84,81,30)
```
2. Now create a vector with the city names and call the object `city`.
```{r}
city<-c("Beijing", "Lagos", "Paris","Rio de Janeiro","San Juan","Toronto")
```
3. Use the `names` function and the objects defined in the previous exercises to associate the temperature data with its corresponding city.
```{r}
names(temp)<-city
temp
```
4. Use the `[` and `:` operators to access the temperature of the first three cities on the list.
```{r}
temp[1:3]
```
5. Use the `[` operator to access the temperature of Paris and San Juan.
```{r}
temp[c("Paris","San Juan")]
```
6. Use the `:` operator to create a sequence of numbers
12, 13, 14, ..., 73.
```{r}
12:73
```
7. Create a vector containing all the positive odd numbers smaller than 100.
```{r}
odd<-seq(1,100,2)
odd
```
8. Create a vector of numbers that starts at 6, does not pass 55, and adds numbers in increments of 4/7: 6, 6+4/7, 6+8/7, etc.. How many numbers does the list have?
```{r}
even_odd<-seq(6,55,4/7)
length(even_odd)
```
9. What is the class of the following object `a <- seq(1, 10, 0.5)`?
```{r}
a<-seq(1,10,0.5)
class(a)
```
10. What is the class of the following object `a <- seq(1, 10)`?
```{r}
a<-seq(1,10)
class(a)
```
11. The class of class(a<-1) is numeric, not integer. R defaults to numeric and to force an integer, you need to add the letter L. Confirm that the class of 1L is integer.
```{r}
class(a<-1L)
```
12. Define the following vector:
```{r}
x<-c("1","3","5")
as.integer(x)
```
# 3.10 Sorting
Now that we have mastered some basic R knowledge, let’s try to gain some insights into the safety of different states in the context of gun murders.
## 3.10.1 `sort`
```{r}
library(dslabs)
data(murders)
sort(murders$total)
```
However, this does not give us information about which states have which murder totals. For example, we don’t know which state had 1257.
## 3.10.2 `order`
The function `order` is closer to what we want. It takes a vector as input and returns the vector of indexes that sorts the input vector. This may sound confusing so let’s look at a simple example. We can create a vector and sort it:
```{r}
x<-c(31,4,15,92,65)
sort(x)
```
Rather than sort the input vector, the function `order` returns the index that sorts input vector:
```{r}
index<-order(x)
x[index]
```
This is the same output as that returned by `sort(x)`. If we look at this index, we see why it works:
```{r}
x
order(x)
```
The second entry of `x` is the smallest, so `order(x)` starts with `2`. The next smallest is the third entry, so the second entry is `3` and so on.
How does this help us order the states by murders? First, remember that the entries of vectors you access with `$` follow the same order as the rows in the table. For example, these two vectors containing state names and abbreviations respectively are matched by their order:
```{r}
murders$state[1:10]
murders$abb[1:10]
```
This means we can order the state names by their total murders. We first obtain the index that orders the vectors according to murder totals and then index the state names vector:
```{r}
ind<-order(murders$total)
murders$abb[ind]
```
According to the above, California had the most murders.
## 3.10.3 `max` and `which.max`
If we are only interested in the entry with the largest value, we can use `max` for the value:
```{r}
max(murders$total)
```
and which.max for the index of the largest value:
```{r}
i_max<-which.max(murders$total)
murders$state[i_max]
```
For the minimum, we can use `min` and `which.min` in the same way.
Does this mean California the most dangerous state? In an upcoming section, we argue that we should be considering rates instead of totals. Before doing that, we introduce one last order-related function: `rank`.
## 3.10.4 `rank`
Although not as frequently used as `order` and `sort`, the function `rank` is also related to order and can be useful. For any given vector it returns a vector with the rank of the first entry, second entry, etc., of the input vector. Here is a simple example:
```{r}
x <- c(31, 4, 15, 92, 65)
rank(x)
```
To summarize, let’s look at the results of the three functions we have introduced:
original sort order rank
31 4 2 3
4 15 3 1
15 31 1 2
92 65 5 5
65 92 4 4
## 3.10.5 Beware of recycling
Another common source of unnoticed errors in R is the use of recycling. We saw that vectors are added elementwise. So if the vectors don’t match in length, it is natural to assume that we should get an error. But we don’t. Notice what happens:
```{r}
x <- c(1,2,3)
y <- c(10, 20, 30, 40, 50, 60, 70)
x+y
```
We do get a warning but no error. For the output, R has recycled the numbers in `x`. Notice the last digit of numbers in the output.
# 3.11 Exercise
```{r}
library(dslabs)
data("murders")
```
1. Use the `$` operator to access the population size data and store it as the object `pop`. Then use the `sort` function to redefine `pop` so that it is sorted. Finally, use the `[` operator to report the smallest population size.
```{r}
pop<-murders$population
s_pop<-sort(pop)
s_pop[1]
```
2. Now instead of the smallest population size, find the index of the entry with the smallest population size. Hint: use `order` instead of `sort`.
```{r}
o_pop<-order(pop)
o_pop[1]
```
3. We can actually perform the same operation as in the previous exercise using the function `which.min`. Write one line of code that does this.
```{r}
which.min(murders$population)
```
4. Now we know how small the smallest state is and we know which row represents it. Which state is it? Define a variable `states` to be the state names from the `murders` data frame. Report the name of the state with the smallest population.
```{r}
states<-murders$state
states[51]
```
5. You can create a data frame using the `data.frame` function.
Use the `rank` function to determine the population rank of each state from smallest population size to biggest. Save these `ranks` in an object called ranks, then create a data frame with the state name and its rank. Call the data frame `my_df`.
```{r}
ranks<-rank(murders$population)
my_df<-data.frame(name=states, pop_rank=ranks)
```
6. Repeat the previous exercise, but this time order `my_df` so that the states are ordered from least populous to most populous. Hint: create an object `ind` that stores the indexes needed to order the population values. Then use the bracket operator `[` to re-order each column in the data frame.
```{r}
ind<-order(murders$population)
my_df$name[ind]
```
7. The `na_example` vector represents a series of counts. You can quickly examine the object using:
```{r}
data("na_example")
str(na_example)
```
However, when we compute the average with the function mean, we obtain an `NA`:
```{r}
mean(na_example)
```
The `is.na` function returns a logical vector that tells us which entries are `NA`. Assign this logical vector to an object called `ind` and determine how many NAs does na_example have.
```{r}
ind<-is.na(na_example)
ind
```
8. Now compute the average again, but only for the entries that are not `NA`. Hint: remember the `!` operator.
```{r}
mean(na_example[!ind])
```
# 3.12 Vector arithmetics
California had the most murders, but does this mean it is the most dangerous state? What if it just has many more people than any other state? We can quickly confirm that California indeed has the largest population:
```{r}
library(dslabs)
data("murders")
murders$state[which.max(murders$population)]
```
with over 37 million inhabitants. It is therefore unfair to compare the totals if we are interested in learning how safe the state is. What we really should be computing is the murders per capita. The reports we describe in the motivating section used murders per 100,000 as the unit. To compute this quantity, the powerful vector arithmetic capabilities of R come in handy.
## 3.12.1 Rescaling a vector
n R, arithmetic operations on vectors occur element-wise. For a quick example, suppose we have height in inches:
```{r}
inches<-c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)
```
and want to convert to centimeters. Notice what happens when we multiply `inches` by 2.54:
```{r}
inches * 2.54
```
In the line above, we multiplied each element by 2.54. Similarly, if for each entry we want to compute how many inches taller or shorter than 69 inches, the average height for males, we can subtract it from every entry like this:
```{r}
inches-69
```
## 3.12.2 Two vectors
If we have two vectors of the same length, and we sum them in R, they will be added entry by entry.
The same holds for other mathematical operations, such as `-`, `*` and `/`.
This implies that to compute the murder rates we can simply type:
```{r}
murder_rate<-murders$total/murders$population*100000
```
Once we do this, we notice that California is no longer near the top of the list. In fact, we can use what we have learned to order the states by murder rate:
```{r}
murders$state[order(murder_rate)]
```
# 3.13 Exercises
1. Previously we created this data frame:
```{r}
temp <- c(35, 88, 42, 84, 81, 30)
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
city_temps <- data.frame(name = city, temperature = temp)
```
Remake the data frame using the code above, but add a line that converts the temperature from Fahrenheit to Celsius.
```{r}
temp <- c(35, 88, 42, 84, 81, 30)
temp<-5/9*(temp-32)
city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
city_temps <- data.frame(name = city, temperature = temp)
city_temps$temperature
```
2. What is the following sum 1+1/2^2+...1/100^2?
```{r}
n<-1:100
sum(1/n^2)
pi^2/6
```
3. Compute the per 100,000 murder rate for each state and store it in the object `murder_rate`. Then compute the average murder rate for the US using the function `mean`. What is the average?
```{r}
murder_rate<-murders$total/murders$population*100000
mean(murder_rate)
```
# 3.14 Indexing
R provides a powerful and convenient way of indexing vectors. We can, for example, subset a vector based on properties of another vector.
## 3.14.1 Subsetting with logicals
We have now calculated the murder rate.
Imagine you are moving from Italy where, according to an ABC news report, the murder rate is only 0.71 per 100,000. You would prefer to move to a state with a similar murder rate. Another powerful feature of R is that we can use logicals to index vectors. If we compare a vector to a single number, it actually performs the test for each entry. The following is an example related to the question above:
```{r}
ind<-murder_rate<0.71
ind
ind<-murder_rate<=0.71
ind
```
Note that we get back a logical vector with `TRUE` for each entry smaller than or equal to 0.71. To see which states these are, we can leverage the fact that vectors can be indexed with logicals.
```{r}
murders$state[ind]
```
In order to count how many are TRUE, the function `sum` returns the sum of the entries of a vector and logical vectors get coerced to numeric with `TRUE` coded as 1 and `FALSE` as 0. Thus we can count the states using:
```{r}
sum(ind)
```
## 3.14.2 Logical operators
Suppose we like the mountains and we want to move to a safe state in the western region of the country. We want the murder rate to be at most 1. In this case, we want two different things to be true. Here we can use the logical operator and, which in R is represented with `&`. This operation results in `TRUE` only when both logicals are `TRUE`. To see this, consider this example:
```{r}
TRUE & TRUE
TRUE & FALSE
FALSE & FALSE
```
For our example, we can form two logicals:
```{r}
west<-murders$region=="West"
safe<-murder_rate<=1
```
and we can use the `&` to get a vector of logicals that tells us which states satisfy both conditions:
```{r}
ind<-safe&west
murders$state[ind]
```
## 3.14.3 `which`
Suppose we want to look up California’s murder rate. For this type of operation, it is convenient to convert vectors of logicals into indexes instead of keeping long vectors of logicals. The function `which` tells us which entries of a logical vector are TRUE. So we can type:
```{r}
ind<-which(murders$state=="California")
murder_rate[ind]
```
## 3.14.4 `match`
If instead of just one state we want to find out the murder rates for several states, say New York, Florida, and Texas, we can use the function `match`. This function tells us which indexes of a second vector match each of the entries of a first vector:
```{r}
ind<-match(c("New York","Florida","Texas"),murders$state)
ind
```
Now we can look at the murder rates:
```{r}
murder_rate[ind]
```
## 3.14.5 `%in%`
If rather than an index we want a logical that tells us whether or not each element of a first vector is in a second, we can use the function `%in%`. Let’s imagine you are not sure if Boston, Dakota and Washington are states. You can find out like this:
```{r}
c("Boston", "Dakota", "Washington") %in% murders$state
```
Advanced: There is a connection between `match` and `%in%` through `which`. To see this, notice that the following two lines produce the same index (although in different order):
```{r}
match(c("New York", "Florida", "Texas"), murders$state)
which(murders$state%in%c("New York", "Florida", "Texas"))
```
# 3.15 Exercises
1. Compute the per 100,000 murder rate for each state and store it in an object called `murder_rate`. Then use logical operators to create a logical vector named `low` that tells us which entries of `murder_rate` are lower than 1.
```{r}
murder_rate<-murders$total/murders$population*100000
low<-murder_rate<1
low
```
2. Now use the results from the previous exercise and the function `which` to determine the indices of `murder_rate` associated with values lower than 1.
```{r}
which(low)
```
3. Use the results from the previous exercise to report the names of the states with murder rates lower than 1.
```{r}
murders$state[which(low)]
```
4. Now extend the code from exercise 2 and 3 to report the states in the Northeast with murder rates lower than 1. Hint: use the previously defined logical vector `low` and the logical operator `&`.
```{r}
north<-murders$region=="Northeast"
ind<-north&low
murders$state[ind]
```
5. In a previous exercise we computed the murder rate for each state and the average of these numbers. How many states are below the average?
```{r}
ind<-murder_rate<mean(murder_rate)
sum(ind)
```
6. Use the match function to identify the states with abbreviations AK, MI, and IA. Hint: start by defining an index of the entries of `murders$abb` that match the three abbreviations, then use the `[` operator to extract the states.
```{r}
ind<-match(c("AK","MI","IA"),murders$abb)
murders$state[ind]
```
7. Use the %in% operator to create a logical vector that answers the question: which of the following are actual abbreviations: MA, ME, MI, MO, MU ?
```{r}
c("MA","ME","MI","MO","MU") %in% murders$abb
```
8. Extend the code you used in exercise 7 to report the one entry that is not an actual abbreviation. Hint: use the `!` operator, which turns `FALSE` into `TRUE` and vice versa, then `which` to obtain an index.
```{r}
actual<-c("MA","ME","MI","MO","MU") %in% murders$abb
which(!actual)
```
# 3.16 Basic plots
In the chapter 8 we describe an add-on package that provides a powerful approach to producing plots in R. We then have an entire part on Data Visualization in which we provide many examples. Here we briefly describe some of the functions that are available in a basic R installation.
## 3.16.1 `plot`
The `plot` function can be used to make scatterplots. Here is a plot of total murders versus population.
```{r}
x<-murders$population/10^6
y<-murders$total
plot(x,y)
```
For a quick plot that avoids accessing variables twice, we can use the with function:
```{r}
with(murders, plot(population,total))
```
## 3.16.2 `hist`
We will describe histograms as they relate to distributions in the Data Visualization part of the book. Here we will simply note that histograms are a powerful graphical summary of a list of numbers that gives you a general overview of the types of values you have. We can make a histogram of our murder rates by simply typing:
```{r}
x<-with(murders,total/population*100000)
hist(x)
```
We can see that there is a wide range of values with most of them between 2 and 3 and one very extreme case with a murder rate of more than 15:
```{r}
murders$state[which.max(x)]
```
## 3.16.3 `boxplot`
Boxplots will also be described in the Data Visualization part of the book. They provide a more terse summary than histograms, but they are easier to stack with other boxplots. For example, here we can use them to compare the different regions:
```{r}
murders$rate<-with(murders,total/population*100000)
boxplot(rate~region,data=murders)
```
We can see that the South has higher murder rates than the other three regions.
## 3.16.4 `image`
The image function displays the values in a matrix using color. Here is a quick example:
```{r}
x<-matrix(1:120,12,10)
image(x)
```
# 3.17 Exercises
1. We made a plot of total murders versus population and noted a strong relationship. Not surprisingly, states with larger populations had more murders.
```{r}
library(dslabs)
data(murders)
population_in_millions <- murders$population/10^6
total_gun_murders <- murders$total
plot(population_in_millions, total_gun_murders)
```
Keep in mind that many states have populations below 5 million and are bunched up. We may gain further insights from making this plot in the log scale. Transform the variables using the `log10` transformation and then plot them.
```{r}
log_pop<-log(population_in_millions,10)
log_total<-log(total_gun_murders,10)
plot(log_pop, log_total)
```
2. Create a histogram of the state populations.
```{r}
population<-with(murders,population)
hist(population)
```
3. Generate boxplots of the state populations by region.
```{r}
boxplot(population~region, data=murders)
```