p8105_hw1_xl2836.Rmd

---
title: "Homework One"
author: "Xinyi Lin"
date: "9/18/2018"
output: github_document
---

```{r setup, include = FALSE}
library(tidyverse)
library(ggplot2)
```

## Problem 1

### Create data frame

First, I create a data frame comprised of:

* A random sample of size 10 from a uniform [0, 5] distribution

* A logical vector indicating whether elements of the sample are greater than 2

* A (length-10) character vector

* A (length-10) factor vector

```{r create_df_p1}
set.seed(13)
rs_p1 = runif(10,0,5)     # create random_sample

lv_p1 = rs_p1 > 2   # create logical vector

cv_p1 = c("A", "N", "A", "P", "P", "L", "E", "A", "N", "D")    # create character vector

fv_p1 = factor(c("A", "N", "A", "P", "P", "L", "E", "A", "N", "D"))      # create factor vector

df_p1 = tibble(rs_p1, lv_p1, cv_p1, fv_p1)   # create data frame

df_p1  # show data frame
```

The data frame df_p1 is shown above.

### Calculate the mean

Then, I try to calculate the mean of each variable.

```{r get_mean}
mean(rs_p1)   #get the mean of rs_p1

mean(lv_p1)   #get the mean of lv_p1

mean(cv_p1)   #get the mean of cv_p1

mean(fv_p1)   #get the mean of fv_p1
```

I found out that only the mean of the random sample and the logical vector can be calculated. As only numbers have a mean, I cannot get the mean of the character vector or the factor vector. But when I get the mean of the logical vector, as R convert "TRUE" to 1 and "FALSE" to 0, I can get one.

### Convert variables
I try to apply the `as.numeric` function to the logical, character, and factor variables, but only show the chunk without output. So only the R code can be seen. But if I run the R code, I found out that the factor vector and the logical vector can be converted to numeric vectors. Each element in factor vector was converted to the order of that level. While in logical vector, The "TRUE" was converted to 1 and the "FALSE" was converted to 0. The character vector canot be converted to any number, so the result is combination of "NA".

```{r test_as.numeric, eval = FALSE}
as.numeric(lv_p1)  

as.numeric(cv_p1)

as.numeric(fv_p1)
```

Then, I try to convert character variable from character to factor to numeric and convert factor variable from factor to character to numeric. I found out that the character variable was converted to factor then to numeric successfuly, as characters can be converted to factors and factors can be converted to numbers. However, a character vector cannot be converted to a numeric vector as shown above. So the factor variable was converted to character variable but fail to convert to numeric variable.

```{r factor_convert}
cv_p1_fac = as.factor(cv_p1)    # first convert to factor
cv_p1_fac
as.numeric(cv_p1_fac)    # then convert to numeric

fv_p1_cha = as.character(fv_p1)    # first convert to character
fv_p1_cha
as.numeric(fv_p1_cha)    # then convert to numeric
```

## Problem 2

### Create data fram

First, I create a data frame comprised of:

* x: a random sample of size 1000 from a standard Normal distribution

* y: a random sample of size 1000 from a standard Normal distribution

* A logical vector indicating whether the x + y > 0

* A numeric vector created by coercing the above logical vector

* A factor vector created by coercing the above logical vector

```{r create_df_p2}
set.seed(13)

x_p2 = rnorm(1000)       # create x

y_p2 = rnorm(1000)       # create y

lv_p2 = x_p2 + y_p2 > 0   # create logical vector

nv_p2 = as.numeric(lv_p2)     # create numeric vector

fv_p2 = as.factor(lv_p2)    # create factor vector

df_p2 = tibble(x_p2, y_p2, lv_p2, nv_p2, fv_p2)   # create data frame
head(df_p2)      #show first few lines of data frame
```

First few lines of the data frame is shown above.

### Short discription

The observations number of data frame df_p2 is `r nrow(df_p2)` and the variables number of data frame df_p2 is `r ncol(df_p2)`.

The mean of x is `r mean(x_p2)` and the median of x is `r median(x_p2)`.

The proportion of cases for which the logical vector is TRUE is `r sum(lv_p2)/length(lv_p2)`

### Print scatterplots

I make first scatterplot of y vs x which use logical variable to decide color points.

```{r make_pic1_p2}
pic1_p2 = ggplot(df_p2, aes(x = x_p2, y = y_p2, color = lv_p2)) + geom_point()
pic1_p2
```

I make second scatterplot that color points using the numeric variables.

```{r make_pic2_p2}
pic2_p2 = ggplot(df_p2, aes(x = x_p2, y = y_p2, color = nv_p2)) + geom_point()
pic2_p2
```

I make third scatterplot that color points using the factor variables.

```{r make_pic3_p2}
pic3_p2 = ggplot(df_p2, aes(x = x_p2, y = y_p2, color = fv_p2)) + geom_point()
pic3_p2
```

Even though these three pictures look the same except for difference in color, the logic behind then are different. The first and third pictures' color points are decieded by logical and factor vectors, so the color points in these two pictures are both seperated into two groups, thus they both show in two different colors. However, second picture's color points use the numeric variable, so they are shown in different color shades which represent different numbers.

At last, I use `ggsave` to export my first scatterplot to my project directory.

```{r export_pic1_p2}
ggsave(filename = "pic1_p2.jpeg", plot = pic1_p2)
```