-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrbasic_20190906_123_chang.Rmd
395 lines (303 loc) · 12 KB
/
rbasic_20190906_123_chang.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
---
title: "R Notebook"
output: html_notebook
---
# 3.2 The very basics
## 3.2.1 Objects
When solving quadratic equations, the solutions change depending on the values of a, b and c. Programming language can define variables and write expressions with these variables.
```{r}
a <- 1
b <- 1
c <- -1
```
We can also use = , but <- is recommended.
To see the value stored in variable,
```{r}
a
```
A more explicit way
```{r}
print(a)
```
We use the term 'object' to describe stuff that is stored in R. Objects can also be functions.
## 3.2.2 The workspace
As we define objects in the console, we are actually changing the workspace.
```{r}
ls()
```
the Environment tab shows the values.
```{r}
(-b + sqrt(b^2 - 4*a*c)) / (2*a)
(-b - sqrt(b^2 - 4*a*c)) / (2*a)
```
## 3.2.3 Functions
The data analysis process can usually be described as a series of functions applied to the data. R includes predefined functions. They are available for immediate use. Unlike ls, most functions require one or more arguments.
```{r}
log(8)
log(a)
```
```{r}
help("log")
?log
```
The help page will show you what arguments the function is expecting. You can determine which arguments are optional by noting in the help document that a default value is assigned with =.
If you want a quick look at the arguments without opening the help system,
```{r}
args(log)
```
You can change the default values by
```{r}
log(8, base = 2)
log(x=8, base=2)
```
by not using the names, it assumes the arguments are x, base
```{r}
log(8,2)
```
If using the argumenst' names, then we can include them in whatever order we want
```{r}
log(base =2, x=8)
```
To specify arguments, we must use =, and cannot use <-.
```{r}
2^3
```
```{r}
help("+")
help("<")
```
## 3.2.4 Other prebuilt objects
You can see available datasets by
```{r}
data()
```
```{r}
co2
pi
Inf+1
```
## 3.2.5 Variable names
Some basic rules in R: variable names have to start with a letter, can't contain spaces and should not be predefined variables.
Convention: use meaningful words, use only lower case, and use underscores(_) as a substitute for spaces.
For example,
```{r}
solution_1 <- (-b + sqrt(b^2 - 4*a*c)) / (2*a)
solution_2 <- (-b - sqrt(b^2 - 4*a*c)) / (2*a)
```
## 3.2.6 Saving your workspace
Values remain in the workspace until you end or erase them with the function 'rm'.
Assign the workspace a specific name. To load, use the function 'load'. When saving a workspace, use the suffix 'rda' or 'RData'.
## 3.2.7 Motivating scripts
To solve another equation, we can copy and paste the code and redefine the variables and recompute the solution
```{r}
a <- 3
b <- 2
c <- -1
(-b + sqrt(b^2 - 4*a*c)) / (2*a)
(-b - sqrt(b^2 - 4*a*c)) / (2*a)
```
## 3.2.8 Commenting your code
If a line of R code starts with the symbol #, it is not evaluated. We can use this to write reminders of why we wrote particular code.
For example,
```{r}
##Code to compute solution to quadratic equation of the form ~~
##define the variables
a <- 3
b <- 2
c <- -1
## now compute the solution
(-b + sqrt(b^2 - 4*a*c)) / (2*a)
(-b - sqrt(b^2 - 4*a*c)) / (2*a)
```
# 3.3 Exercises
1. What is the sum of the first 100 positive integers?
```{r}
n <- 100
n*(n+1)/2
```
2. Now use the same formula to compute the sum of the integers from 1 through 1,000.
```{r}
n <- 1000
n*(n+1)/2
```
3. Look at the result of typing the following code into R:
```{r}
n <- 1000
x <- seq(1, n)
sum(x)
```
Based on the result, what do you think the functions seq and sum do?
<B.> seq creates a list of numbers and sum adds them up.
4. In math and programming, we say that we evaluate a function when we replace the argument with a given number. So if we type sqrt(4), we evaluate the sqrt function. In R, you can evaluate a function inside another function. The evaluations happen from the inside out. Use one line of code to compute the log, in base 10, of the square root of 100.
```{r}
log(sqrt(100),10)
```
5. Which of the following will always return the numeric value stored in x? You can try out examples and use the help system if you want.
<C.> log(exp(x))
# 3.4 Data types
Variables in R can be of different types. For example, we need to distinguish numbers from character strings and tables from simple lists of numbers. The function class helps us determine what type of object we have:
```{r}
a <- 2
class(a)
```
To work efficiently in R, it is important to learn the different types of variables and what we can do with these.
# 3.5 Data frames
Up to now, the variables we have defined are just one number. This is not very useful for storing data. The most common way of storing a dataset in R is in a data frame. Conceptually, we can think of a data frame as a table with rows representing observations and the different variables reported for each observation defining the columns. Data frames are particularly useful for datasets because we can combine different data types into one object.
A large proportion of data analysis challenges start with data stored in a data frame. For example, we stored the data for our motivating example in a data frame. You can access this dataset by loading the dslabs library and loading the murders dataset using the data function:
```{r}
library(dslabs)
data(murders)
```
```{r}
class(murders)
```
## 3.5.1 Examinig an Object
The function 'str' is useful for finding out more about the structure of an object:
```{r}
str(murders)
```
```{r}
head(murders)
```
## 3.5.2 The accessor: $
```{r}
murders$population
```
```{r}
names(murders)
```
It is important to know that the order of the entries in murders$population preserves the order of the rows in our data table. This will later permit us to manipulate one variable based on the results of another. For example, we will be able to order the state names by the number of murders.
## 3.5.3 Vectors: numerics, characters, and logical
The object 'murders$population' is not one number but several. We call these types of objects 'vectors'. A single number is technically a vector of length 1, but in general we use the term vectors to refer to objects with several entries. The function 'length' tells you how many entries are in the vector:
```{r}
pop<-murders$population
length(pop)
```
This particular vector is numeric since population sizes are numbers:
```{r}
class(pop)
```
In a numeric vector, every entry must be a number.
To store character strings, vectors can also be of class character. For example, the state names are characters:
```{r}
class(murders$state)
```
As with numeric vectors, all entries in a character vector need to be a character.
Another important type of vectors are logical vectors. These must be either TRUE or FALSE.
```{r}
z<-3 ==2
z
class(z)
```
Here the == is a relational operator asking if 3 is equal to 2. In R, if you just use one =, you actually assign a variable, but if you use two == you test for equality.
You can see the other relational operators by typing:
```{r}
?Comparison
```
Advanced: Mathematically, the values in pop are integers and there is an integer class in R. However, by default, numbers are assigned class numeric even when they are round integers. For example, class(1) returns numeric. You can turn them into class integer with the 'as.integer()' function or by adding an L like this: 1L. Note the class by typing:
```{r}
class(1L)
class(1)
```
## 3.5.4 Factors
In the murders dataset, we might expect the region to also be a character vector. However, it is not:
```{r}
class(murders$region)
levels(murders$region)
```
In the background, R stores these levels as integers and keeps a map to keep track of the labels. This is more memory efficient than storing all the characters.
Note that the levels have an order that is different from the order of appearance in the factor object. The default is for the levels to follow alphabetical order. However, often we want the levels to follow a different order. We will see several examples of this in the Data Visualization part of the book. The function 'reorder' lets us change the order of the levels of a factor variable based on a summary computed on a numeric vector.
Suppose we want the levels of the region by the total number of murders rather than alphabetical order. If there are values associated with each level, we can use the 'reorder' and specify a data summary to determine the order. The following code takes the sum of the total murders in each region, and reorders the factor following these sums.
```{r}
region<-murders$region
value<-murders$total
region<-reorder(region, value, FUN = sum)
levels(region)
```
Warning:
Factors can be a source of confusion since sometimes they behave like characters and sometimes they do not. As a result, confusing factors and characters are a common source of bugs.
## 3.5.5 Lists
Data frames are a special case of lists. We will cover lists in more detail later, but know that they are useful because you can store any combination of different types. Below is an example of a list:
```{r}
record<-list("John Doe", 1234, c(95, 82, 91, 97, 93), "A")
class(record)
record
names(record)<-c("name", "student_id", "grades", "final_grade")
record
```
As with data frames, you can extract the components of a list with the accessor $. In fact, data frames are a type of list.
```{r}
record$student_id
record[["student_id"]]
```
## 3.5.6 Matrices
Matrices are another type of object that are common in R. Matrices are similar to data frames in that they are two-dimensional: they have rows and columns. However, like numeric, character and logical vectors, entries in matrices have to be all the same type. For this reason data frames are much more useful for storing data, since we can have characters, factors and numbers in them.
Yet matrices have a major advantage over data frames: we can perform a matrix algebra operations, a powerful type of mathematical technique.
We can define a matrix using the 'matrix' function. We need to specify the number of rows and columns.
```{r}
mat<-matrix(1:12,4,3)
mat
```
You can access specific entries in a matrix using square brackets ([). If you want the second row, third column, you use:
```{r}
mat[2,3]
```
If you want the entire second row, you leave the column spot empty:
```{r}
mat[2, ]
```
Notice that this returns a vector, not a matrix.
```{r}
mat[,3]
```
This is also a vector, not a matrix.
You can access more than one column or more than one row if you like. This will give you a new matrix.
```{r}
mat[,2:3]
```
You can subset both rows and columns:
```{r}
mat[1:2, 2:3]
```
We can convert matrices into data frames using the function 'as.data.frame':
```{r}
as.data.frame(mat)
```
You can also use single square brackets ([) to access rows and columns of a data frame:
```{r}
data("murders")
murders[25,1]
murders[2:3,]
```
# 3.6 Exercises
1. Load the US murders dataset.
```{r}
library(dslabs)
data(murders)
str(murders)
```
Use the function str to examine the structure of the murders object. We can see that this object is a data frame with 51 rows and five columns. Which of the following best describes the variables represented in this data frame?
<C.> The state name, the abbreviation of the state name, the state’s region, and the state’s population and total number of murders for 2010.
2. What are the column names used by the data frame for these five variables?
<"state", "abb", "region", "population", "total">
3. Use the accessor $ to extract the state abbreviations and assign them to the object a. What is the class of this object?
```{r}
a<-murders$abb
a
class(a)
```
<character>
4. Now use the square brackets to extract the state abbreviations and assign them to the object b. Use the identical function to determine if a and b are the same.
```{r}
b<-murders[["abb"]]
b
identical(a,b)
```
5. We saw that the region column stores a factor. With one line of code, use the function levels and length to determine the number of regions defined by this dataset.
```{r}
length(levels(murders$region))
```
6. The function table takes a vector and returns the frequency of each element. You can quickly see how many states are in each region by applying this function. Use this function in one line of code to create a table of states per region.
```{r}
table(murders$region)
```