assignment-2.Rmd

---
title: "assignment-2"
author: "Gabriele Chignoli"
date: "10/9/2017"
output: html_document
---

##Exercise 1: Writing R functions

All functions written for this assignment will be provided from the source file:
```{r}
source("functions.R")
```

You will find there all commentaries on functions, what they do and why.

###Question 1a: filling in the blanks

The codes below provide the results of the first filling.
In order to verify the presence of numeric values in a certain column of the dataset and then to sum them all, an ```If``` condition was added to the function ```sum_column```:

```{r}
print(sum_column(iris, "Sepal.Length"))
print(sum_column(iris, "Species"))
print(sum_column(warpbreaks, "breaks"))
```

###Question 1b: a function with one argument and one return value

As a result of cheating on first filling by adding the built-in **R** ```sum()```, here we use a basic code to recreate a function which can do the sum for us. We test it using the code chunk below:

```{r}
print(my_sum(iris$Sepal.Length))
print(my_sum(iris$Species))
print(my_sum(warpbreaks$breaks))
```

We lately add this function to the first one.

###Question 1c: a function with two arguments and one return value

This time we have a new function to make, it uses our sum function too and it divides the sum for a number. As done earlier here is a chunk in order to verify the result:

```{r}
print(sum_divided_by(iris$Sepal.Length, 12))
print(sum_divided_by(iris$Species, 22))
print(sum_divided_by(iris$Sepal.Length, "Not numeric"))
print(sum_divided_by(warpbreaks$breaks, -12))
```

###Question 1d: one more function

Last function we create for this exercise calculates the mean of a vector not cheating using the built-in ```mean()```. Chunk below show the result:

```{r}
print(my_mean(iris$Sepal.Length))
print(my_mean(iris$Species))
print(my_mean(warpbreaks$breaks))
```

##Exercise 2: Working with ggplot

Here we'll work with the ```ggplot``` package and as we did earlier all functions and code will be provided in

```{r}
source("functions.R")
```

###Question 2a: creating violin plots

We'll test this first question with the following chunk:

```{r}
print(grouped_violin_plot(iris, "Sepal.Length", "Species"))
```

###Question 2b: modifying plots

This time we will just modify the last plot in order to add a main title and to change plot colours, thanks to the chunk you see here:

```{r}
p <- grouped_violin_plot(iris, "Sepal.Length", "Species")
p <- p + ggplot2::scale_fill_brewer(palette="Accent")
p <- p + ggplot2::labs(title="Iris data")
print(p)
```

##Exercise 3: Permutation tests

As we did in Exercise 1, here we will add some new functions to our `functions.R` file and every new line will come with commentaries to explain what we wanted to obtain or what we do.

###Question 3a: writing a more generic function for taking a test statistic

Our first function calculate the difference between medians of a selected group. Median correspond to the middle value of a sequence, in our case of our column, but what happens when the sequence is unpair ? We take the two middle values and calculate the halfway between them, that's what happened to us, where our columns are length 50.  
In **R** a built-in function provide to estimate the median value, but we had to do some adjustments before calling it because our column presented a String list of numeric values.  
Once we did everything necessary to make possible our computing, we obtain :  

* Versicolor median = 2.8  
* Virginica median = 3.0  

```{r}
difference_in_medians(iris, "Sepal.Width", "Species", "versicolor", "virginica")
difference_in_medians(iris, "Sepal.Width", "Species", "virginica", "virginica")
```

###Question 3b: writing a more generic function of the randomization (i.e., permutation) function

This time we work on columns too but instead of using values in columns to obtain a result, we take all the column dataset and randomly replace them.

```{r}
iris$Sepal.Width[1:10]
if(!exists(".Random.seed")) set.seed(NULL)
previous_seed <- .Random.seed
set.seed(1)
randomize(iris, "Sepal.Width")$Sepal.Width[1:10]
randomize(iris, "Species")$Species[1:10]
randomize(iris, "Species")$Sepal.Width[1:10]
set.seed(previous_seed)
```

###Question 3c: a function to get a statistic for multiple permutations

Next functions does a permutation for two groups of our dataset, we use here all others functions we wrote. This time it took a little more for me to understand how to familiarize with parameters, how to place them and variables inside the main function but it finally works:

```{r}
if(!exists(".Random.seed")) set.seed(NULL)
previous_seed <- .Random.seed
set.seed(1)
ptest_1 <- permutation_twogroups(iris, "Sepal.Width", "Species", "versicolor",
                                 "virginica", difference_in_medians,
                                 n_samples=10)
ptest_2 <- permutation_twogroups(iris, "Sepal.Width", "Species", "versicolor",
                                 "virginica", difference_in_medians,
                                 n_samples=10)
ptest_3 <- permutation_twogroups(randomize(iris, "Sepal.Width"), "Sepal.Width",
                                 "Species", "versicolor", "virginica",
                                 difference_in_medians, n_samples=10)
set.seed(previous_seed)
print(ptest_1)
print(ptest_2)
print(ptest_3)
print(ptest_3$observed)
print(ptest_3[["observed"]])
```

###Question 3d

So here we discuss, just a bit, the change we obtain if we take in account `var` or `grouping_var` for the permutation test inside our function explained in question 3c.  
There is no better way to understand a result change than to see it, so here it is:

```
print(ptest_1)
## $observed
## [1] -0.2
## 
## $permuted
##  [1] 0.00 0.10 0.00 0.00 0.00 0.05 0.00 0.00 0.00 0.05
print(ptest_2)
## $observed
## [1] -0.2
## 
## $permuted
##  [1]  0.10  0.00  0.00  0.00  0.00  0.00 -0.05 -0.10  0.05  0.10
print(ptest_3)
## $observed
## [1] 0
## 
## $permuted
##  [1]  0.10  0.00  0.00 -0.10 -0.10  0.10  0.00  0.00  0.10  0.05
print(ptest_3$observed)
## [1] 0
print(ptest_3[["observed"]])
## [1] 0
```

If the last two results remain the same, mostly beacause they acces a value which doesn't change, the `$observed`result, we can see a great change in all `$permuted` results. This is preatty understandable if we think to the difference between the values we access using the two variables we consider or `grouping_var`.  

-  `var` give us the access to values contained in the "Sepal.Width" column in our case;  
-  `grouping_var` is associated to the "Species" column.

###Question 3e: plotting the sampling distribution

```{r echo=F, cache=T}
if(!exists(".Random.seed")) set.seed(NULL)
previous_seed <- .Random.seed
set.seed(1)
ptest <- permutation_twogroups(iris, "Sepal.Width", "Species", "versicolor",
                              "virginica", difference_in_medians)
set.seed(previous_seed)
```

```{r echo=F}
ptest_p <- tibble::as_tibble(ptest["permuted"])
ptest_o <- tibble::as_tibble(ptest["observed"])
plot_line_at_observed(ptest_p, ptest_o, "Permuted data plot")
```

We now observe the histogram of data we permuted earlier. We add a line to emphasise the value we also observed on our data.  
Mostly we can say that the data has a sort of symmetric distribution basically around the value 0.0. The value we observe corresponding to -0.2 it seems really difficult for the two groups we take in account to be similar.

###Question 3f: test statistics

```{r echo=F, cache=T}
if(!exists(".Random.seed")) set.seed(NULL)
previous_seed <- .Random.seed
set.seed(1)
ptest_new <- permutation_twogroups(iris, "Sepal.Width", "Species", "versicolor",
                                   "virginica", new_test_statistic)
set.seed(previous_seed)
```

```{r echo=F}
ptest_new_p <- tibble::as_tibble(ptest_new["permuted"])
ptest_new_o <- tibble::as_tibble(ptest_new["observed"])
plot_line_at_observed(ptest_new_p, ptest_new_o, "Permuted data plot")
```

Here we will test a new kind of statistic. We'll take in account just the values corresponding to intersection of the two groups in the dataset we analize. It would appear interesting to see how the common values change from a subset to another, this could be lead us to consider some more fundamental values which for example remain the same in all permutations or can reveal which permutation lose every link to values with the preceeding ones.  
In our case for exemple we see that value 0 is the one which doesn't change in the new set we take.

###Question 3g: calculating p-values

```{r}
permutation_pvalue_right(ptest)
permutation_pvalue_left(ptest)
```

- **Hypothesis A.** The two groups are drawn from equivalent sources of variability if we get a *p-value* "extreme" (a really small fraction) in positive or negative direction.  
- **Hypothesis B.** Our obtained *p-values* are not "extreme" or just in a certain direction if the two groups are drawn from different sources of variability.

Let's take the example of the surface form of the vowel "e" in italian, it corresponds to the phonemes /e/ and /ɛ/, we'd expect that all native speakers (taken from North and South regions) can make the two different realisations without problems because they are just two underlying forms in the phonological system of this language.  

We'd predict that the difference in the medians statistic would be extreme on the right side if our *Hypothesis B* were that our dataset presents native speakers who makes the realisation of just one phoneme not or never realising the other for the vowel "e". In our *Hypothesis A* we would expect a normal distribution of the two realisations.  
In which case, we would use the function `permutation_pvalue_left` or
`permutation_pvalue_right` ?  
If we assign two different median values to the two phonemes, based on their features or their distribution in language, we would consider the left value for a greater realisation of a phoneme, we say the /e/ one and the right one for the phoneme /ɛ/.

###Question 3h

```{r echo=F, cache=T}
#subset1
iris_subset_1 <- iris[c(89:94, 108:112),]
if(!exists(".Random.seed")) set.seed(NULL)
previous_seed <- .Random.seed
set.seed(1)
ptest_sub1 <- permutation_twogroups(iris_subset_1, "Sepal.Width", "Species",
                                    "versicolor", "virginica",
                                    difference_in_medians)
set.seed(previous_seed)
ptest_sub1_p <- tibble::as_tibble(ptest_sub1["permuted"])
ptest_sub1_o <- tibble::as_tibble(ptest_sub1["observed"])

#subset2
iris_subset_2 <- iris[88:114,]
if(!exists(".Random.seed")) set.seed(NULL)
previous_seed <- .Random.seed
set.seed(1)
ptest_sub2 <- permutation_twogroups(iris_subset_2, "Sepal.Width", "Species",
                                    "versicolor", "virginica",
                                    difference_in_medians)
set.seed(previous_seed)
ptest_sub2_p <- tibble::as_tibble(ptest_sub2["permuted"])
ptest_sub2_o <- tibble::as_tibble(ptest_sub2["observed"])
```

We'll now create two new subsets of our data and proceed as earlier in permutation, we take an observed value and we plot everything in an histogram in order to discuss later. We will show every observed value in every histogram, thanks to our improved plot function:

```{r echo=F}
plot_line_at_observed(ptest_sub1_p, ptest_o,
                      "Subset 1 data plot",
                      ptest_sub1_o, ptest_sub2_o)
```

For this first subset the observed line is the first appearing on the plot, the brown line at -0.3, the median difference look greater here than on the other two datasets. We can also notice a more harmonic distribution of values compared to the first permuted set we observed earlier, here we don't have any symmetric distribution but great populations of values in different places.

```{r echo=F}
plot_line_at_observed(ptest_sub2_p, ptest_o,
                      "Subset 2 data plot",
                      ptest_sub1_o, ptest_sub2_o)
```

Our second subset presents once again a different distribution both in median observation and in vlues themselves. We can see a slightly return of symmetrical representation with increasing values from (almost) -0.4 to almost (-0.2), a great decreasing count and an increaed one, we see the inverse path from little more than 0 to the last values.  
Median value here is between -0.2 and -0.3, perfectly between the two preceeding subsets. As the original dataset is the same for the three examples, we can firmly assume that our three permutation really produced something interesting to observe, three subsets completely different which can lead us to confirm that the original set is composed of data which basically differ in characters and values (species, sepal width, etc.) from one another.