-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathprogramming_basics_20190911_152_chang.Rmd
156 lines (125 loc) · 6.93 KB
/
programming_basics_20190911_152_chang.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
title: "R Notebook"
output: html_notebook
---
---
title: "R Notebook"
output: html_notebook
---
# 4 Programming basics
We teach R because it greatly facilitates data analysis, the main topic of this book. By coding in R, we can efficiently perform exploratory data analysis, build data analysis pipelines and prepare data visualization to communicate results. However, R is not just a data analysis environment but a programming language. Advanced R programmers can develop complex packages and even improve R itself, but we do not cover advanced programming in this book. Nonetheless, in this section, we introduce three key programming concepts: conditional expressions, for-loops and functions. These are not just key building blocks for advanced programming, but are sometimes useful during data analysis. We also note that there are several functions that are widely used to program in R but that we will not cover in this book. These include `split`, `cut`, `do.call` and `Reduce`. These are worth learning if you plan to become an expert R programmer.
# 4.1 Conditional expressions
Conditional expressions are one of the basic features of programming. They are used for what is called flow control. The most common conditional expression is the if-else statement. In R, we can actually perform quite a bit of data analysis without conditionals. However, they do come up occasionally, and you will need them once you start writing your own functions and packages.
Here is a very simple example showing the general structure of an if-else statement. The basic idea is to print the reciprocal of `a` unless `a` is 0:
```{r}
a<-0
if(a!=0){
print(1/a)
} else{
print("No reciprocal for 0.")
}
```
Let’s look at one more example using the US murders data frame:
```{r}
library(dslabs)
data(murders)
murder_rate<-murders$total/murders$population*100000
```
Here is a very simple example that tells us which states, if any, have a murder rate lower than 0.5 per 100,000. The `if` statement protects us from the case in which no state satisfies the condition.
```{r}
ind<-which.min(murder_rate)
if(murder_rate[ind]<0.5){
print(murders$state[ind])
} else{
print("No state has murder rate that low")
}
```
If we try it again with a rate of 0.25, we get a different answer:
```{r}
if(murder_rate[ind] < 0.25){
print(murders$state[ind])
} else{
print("No state has a murder rate that low.")
}
```
A related function that is very useful is `ifelse`. This function takes three arguments: a logical and two possible answers. If the logical is `TRUE`, the value in the second argument is returned and if `FALSE`, the value in the third argument is returned. Here is an example:
```{r}
a<-0
ifelse(a>0,1/a,NA)
```
The function is particularly useful because it works on vectors. It examines each entry of the logical vector and returns elements from the vector provided in the second argument, if the entry is `TRUE`, or elements from the vector provided in the third argument, if the entry is `FALSE`.
```{r}
a<-c(0,1,2,-4,5)
result<-ifelse(a>0,1/a,NA)
result
```
Here is an example of how this function can be readily used to replace all the missing values in a vector with zeros:
```{r}
data(na_example)
no_nas<-ifelse(is.na(na_example),0,na_example)
sum(is.na(no_nas))
```
Two other useful functions are `any` and `all`. The any function takes a vector of logicals and returns `TRUE` if any of the entries is `TRUE`. The `all` function takes a vector of logicals and returns `TRUE` if all of the entries are `TRUE`. Here is an example:
```{r}
z<-c(TRUE,TRUE,FALSE)
any(z)
all(z)
```
# 4.2 Defining functions
As you become more experienced, you will find yourself needing to perform the same operations over and over. A simple example is computing averages. We can compute the average of a vector `x` using the `sum` and `length` functions: `sum(x)/length(x)`. Because we do this repeatedly, it is much more efficient to write a function that performs this operation. This particular operation is so common that someone already wrote the `mean` function and it is included in base R. However, you will encounter situations in which the function does not already exist, so R permits you to write your own. A simple version of a function that computes the average can be defined like this:
```{r}
avg<-function(x){
s<-sum(x)
n<-length(x)
s/n
}
```
Now `avg` is a function that computes the mean:
```{r}
x<-1:100
identical(mean(x),avg(x))
```
Notice that variables defined inside a function are not saved in the workspace. So while we use `s` and `n` when we call `avg`, the values are created and changed only during the call. Here is an illustrative example:
```{r}
s<-3
avg(1:10)
s
```
Note how `s` is still 3 after we call `avg`.
In general, functions are objects, so we assign them to variable names with `<-`. The function `function` tells R you are about to define a function. The general form of a function definition looks like this:
```{r}
my_function <- function(VARIABLE_NAME){
perform operations on VARIABLE_NAME and calculate VALUE
VALUE
}
```
The functions you define can have multiple arguments as well as default values. For example, we can define a function that computes either the arithmetic or geometric average depending on a user defined variable like this:
```{r}
avg<-function(x,arithmatic=TRUE){
n<-length(x)
ifelse(arithmatic,sum(x)/n, prod(x)^(1/n))
}
```
# 4.3 Namespaces
Once you start becoming more of an R expert user, you will likely need to load several add-on packages for some of your analysis. Once you start doing this, it is likely that two packages use the same name for two different functions. And often these functions do completely different things. In fact, you have already encountered this becuase both dplyr and the R-base stats package define a `filter` function. There are five other examples in dplyr. We know this becasue when we first load dplyr we see the following message:
```{r}
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
```
So what does R do when we type `filter`? Does it use the dplyr function or the stats function? From our previous work we know it uses the dplyr one. But what if we want to use the stats version?
These function live in different namespaces. R will follow a certain order when searching for a function in these namespaces. You can see the order by typing:
```{r}
search()
```
The first entry in this list is the global environment which includes all the objects you define.
So what if we want to use the stats `filter` instead of the dplyr filter but dplyr appears first in the search list? You can force the use of a specific name space by using double colons (`::`) like this:
```{r}
stats::filter
```
If we want to be absolutely sure we use the dplyr `filter` we can use
```{r}
dplyr::filter
```
Also note that if we want to use a function in a package without loading the entire package, we can use the double colon as well.