-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathp8105_hw1_xl2836.Rmd
156 lines (96 loc) · 5.16 KB
/
p8105_hw1_xl2836.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
title: "Homework One"
author: "Xinyi Lin"
date: "9/18/2018"
output: github_document
---
```{r setup, include = FALSE}
library(tidyverse)
library(ggplot2)
```
## Problem 1
### Create data frame
First, I create a data frame comprised of:
* A random sample of size 10 from a uniform [0, 5] distribution
* A logical vector indicating whether elements of the sample are greater than 2
* A (length-10) character vector
* A (length-10) factor vector
```{r create_df_p1}
set.seed(13)
rs_p1 = runif(10,0,5) # create random_sample
lv_p1 = rs_p1 > 2 # create logical vector
cv_p1 = c("A", "N", "A", "P", "P", "L", "E", "A", "N", "D") # create character vector
fv_p1 = factor(c("A", "N", "A", "P", "P", "L", "E", "A", "N", "D")) # create factor vector
df_p1 = tibble(rs_p1, lv_p1, cv_p1, fv_p1) # create data frame
df_p1 # show data frame
```
The data frame df_p1 is shown above.
### Calculate the mean
Then, I try to calculate the mean of each variable.
```{r get_mean}
mean(rs_p1) #get the mean of rs_p1
mean(lv_p1) #get the mean of lv_p1
mean(cv_p1) #get the mean of cv_p1
mean(fv_p1) #get the mean of fv_p1
```
I found out that only the mean of the random sample and the logical vector can be calculated. As only numbers have a mean, I cannot get the mean of the character vector or the factor vector. But when I get the mean of the logical vector, as R convert "TRUE" to 1 and "FALSE" to 0, I can get one.
### Convert variables
I try to apply the `as.numeric` function to the logical, character, and factor variables, but only show the chunk without output. So only the R code can be seen. But if I run the R code, I found out that the factor vector and the logical vector can be converted to numeric vectors. Each element in factor vector was converted to the order of that level. While in logical vector, The "TRUE" was converted to 1 and the "FALSE" was converted to 0. The character vector canot be converted to any number, so the result is combination of "NA".
```{r test_as.numeric, eval = FALSE}
as.numeric(lv_p1)
as.numeric(cv_p1)
as.numeric(fv_p1)
```
Then, I try to convert character variable from character to factor to numeric and convert factor variable from factor to character to numeric. I found out that the character variable was converted to factor then to numeric successfuly, as characters can be converted to factors and factors can be converted to numbers. However, a character vector cannot be converted to a numeric vector as shown above. So the factor variable was converted to character variable but fail to convert to numeric variable.
```{r factor_convert}
cv_p1_fac = as.factor(cv_p1) # first convert to factor
cv_p1_fac
as.numeric(cv_p1_fac) # then convert to numeric
fv_p1_cha = as.character(fv_p1) # first convert to character
fv_p1_cha
as.numeric(fv_p1_cha) # then convert to numeric
```
## Problem 2
### Create data fram
First, I create a data frame comprised of:
* x: a random sample of size 1000 from a standard Normal distribution
* y: a random sample of size 1000 from a standard Normal distribution
* A logical vector indicating whether the x + y > 0
* A numeric vector created by coercing the above logical vector
* A factor vector created by coercing the above logical vector
```{r create_df_p2}
set.seed(13)
x_p2 = rnorm(1000) # create x
y_p2 = rnorm(1000) # create y
lv_p2 = x_p2 + y_p2 > 0 # create logical vector
nv_p2 = as.numeric(lv_p2) # create numeric vector
fv_p2 = as.factor(lv_p2) # create factor vector
df_p2 = tibble(x_p2, y_p2, lv_p2, nv_p2, fv_p2) # create data frame
head(df_p2) #show first few lines of data frame
```
First few lines of the data frame is shown above.
### Short discription
The observations number of data frame df_p2 is `r nrow(df_p2)` and the variables number of data frame df_p2 is `r ncol(df_p2)`.
The mean of x is `r mean(x_p2)` and the median of x is `r median(x_p2)`.
The proportion of cases for which the logical vector is TRUE is `r sum(lv_p2)/length(lv_p2)`
### Print scatterplots
I make first scatterplot of y vs x which use logical variable to decide color points.
```{r make_pic1_p2}
pic1_p2 = ggplot(df_p2, aes(x = x_p2, y = y_p2, color = lv_p2)) + geom_point()
pic1_p2
```
I make second scatterplot that color points using the numeric variables.
```{r make_pic2_p2}
pic2_p2 = ggplot(df_p2, aes(x = x_p2, y = y_p2, color = nv_p2)) + geom_point()
pic2_p2
```
I make third scatterplot that color points using the factor variables.
```{r make_pic3_p2}
pic3_p2 = ggplot(df_p2, aes(x = x_p2, y = y_p2, color = fv_p2)) + geom_point()
pic3_p2
```
Even though these three pictures look the same except for difference in color, the logic behind then are different. The first and third pictures' color points are decieded by logical and factor vectors, so the color points in these two pictures are both seperated into two groups, thus they both show in two different colors. However, second picture's color points use the numeric variable, so they are shown in different color shades which represent different numbers.
At last, I use `ggsave` to export my first scatterplot to my project directory.
```{r export_pic1_p2}
ggsave(filename = "pic1_p2.jpeg", plot = pic1_p2)
```