This repository has been archived by the owner on Jan 18, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathassignment-2.Rmd
283 lines (220 loc) · 12.5 KB
/
assignment-2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
---
title: "assignment-2"
author: "Gabriele Chignoli"
date: "10/9/2017"
output: html_document
---
##Exercise 1: Writing R functions
All functions written for this assignment will be provided from the source file:
```{r}
source("functions.R")
```
You will find there all commentaries on functions, what they do and why.
###Question 1a: filling in the blanks
The codes below provide the results of the first filling.
In order to verify the presence of numeric values in a certain column of the dataset and then to sum them all, an ```If``` condition was added to the function ```sum_column```:
```{r}
print(sum_column(iris, "Sepal.Length"))
print(sum_column(iris, "Species"))
print(sum_column(warpbreaks, "breaks"))
```
###Question 1b: a function with one argument and one return value
As a result of cheating on first filling by adding the built-in **R** ```sum()```, here we use a basic code to recreate a function which can do the sum for us. We test it using the code chunk below:
```{r}
print(my_sum(iris$Sepal.Length))
print(my_sum(iris$Species))
print(my_sum(warpbreaks$breaks))
```
We lately add this function to the first one.
###Question 1c: a function with two arguments and one return value
This time we have a new function to make, it uses our sum function too and it divides the sum for a number. As done earlier here is a chunk in order to verify the result:
```{r}
print(sum_divided_by(iris$Sepal.Length, 12))
print(sum_divided_by(iris$Species, 22))
print(sum_divided_by(iris$Sepal.Length, "Not numeric"))
print(sum_divided_by(warpbreaks$breaks, -12))
```
###Question 1d: one more function
Last function we create for this exercise calculates the mean of a vector not cheating using the built-in ```mean()```. Chunk below show the result:
```{r}
print(my_mean(iris$Sepal.Length))
print(my_mean(iris$Species))
print(my_mean(warpbreaks$breaks))
```
##Exercise 2: Working with ggplot
Here we'll work with the ```ggplot``` package and as we did earlier all functions and code will be provided in
```{r}
source("functions.R")
```
###Question 2a: creating violin plots
We'll test this first question with the following chunk:
```{r}
print(grouped_violin_plot(iris, "Sepal.Length", "Species"))
```
###Question 2b: modifying plots
This time we will just modify the last plot in order to add a main title and to change plot colours, thanks to the chunk you see here:
```{r}
p <- grouped_violin_plot(iris, "Sepal.Length", "Species")
p <- p + ggplot2::scale_fill_brewer(palette="Accent")
p <- p + ggplot2::labs(title="Iris data")
print(p)
```
##Exercise 3: Permutation tests
As we did in Exercise 1, here we will add some new functions to our `functions.R` file and every new line will come with commentaries to explain what we wanted to obtain or what we do.
###Question 3a: writing a more generic function for taking a test statistic
Our first function calculate the difference between medians of a selected group. Median correspond to the middle value of a sequence, in our case of our column, but what happens when the sequence is unpair ? We take the two middle values and calculate the halfway between them, that's what happened to us, where our columns are length 50.
In **R** a built-in function provide to estimate the median value, but we had to do some adjustments before calling it because our column presented a String list of numeric values.
Once we did everything necessary to make possible our computing, we obtain :
* Versicolor median = 2.8
* Virginica median = 3.0
```{r}
difference_in_medians(iris, "Sepal.Width", "Species", "versicolor", "virginica")
difference_in_medians(iris, "Sepal.Width", "Species", "virginica", "virginica")
```
###Question 3b: writing a more generic function of the randomization (i.e., permutation) function
This time we work on columns too but instead of using values in columns to obtain a result, we take all the column dataset and randomly replace them.
```{r}
iris$Sepal.Width[1:10]
if(!exists(".Random.seed")) set.seed(NULL)
previous_seed <- .Random.seed
set.seed(1)
randomize(iris, "Sepal.Width")$Sepal.Width[1:10]
randomize(iris, "Species")$Species[1:10]
randomize(iris, "Species")$Sepal.Width[1:10]
set.seed(previous_seed)
```
###Question 3c: a function to get a statistic for multiple permutations
Next functions does a permutation for two groups of our dataset, we use here all others functions we wrote. This time it took a little more for me to understand how to familiarize with parameters, how to place them and variables inside the main function but it finally works:
```{r}
if(!exists(".Random.seed")) set.seed(NULL)
previous_seed <- .Random.seed
set.seed(1)
ptest_1 <- permutation_twogroups(iris, "Sepal.Width", "Species", "versicolor",
"virginica", difference_in_medians,
n_samples=10)
ptest_2 <- permutation_twogroups(iris, "Sepal.Width", "Species", "versicolor",
"virginica", difference_in_medians,
n_samples=10)
ptest_3 <- permutation_twogroups(randomize(iris, "Sepal.Width"), "Sepal.Width",
"Species", "versicolor", "virginica",
difference_in_medians, n_samples=10)
set.seed(previous_seed)
print(ptest_1)
print(ptest_2)
print(ptest_3)
print(ptest_3$observed)
print(ptest_3[["observed"]])
```
###Question 3d
So here we discuss, just a bit, the change we obtain if we take in account `var` or `grouping_var` for the permutation test inside our function explained in question 3c.
There is no better way to understand a result change than to see it, so here it is:
```
print(ptest_1)
## $observed
## [1] -0.2
##
## $permuted
## [1] 0.00 0.10 0.00 0.00 0.00 0.05 0.00 0.00 0.00 0.05
print(ptest_2)
## $observed
## [1] -0.2
##
## $permuted
## [1] 0.10 0.00 0.00 0.00 0.00 0.00 -0.05 -0.10 0.05 0.10
print(ptest_3)
## $observed
## [1] 0
##
## $permuted
## [1] 0.10 0.00 0.00 -0.10 -0.10 0.10 0.00 0.00 0.10 0.05
print(ptest_3$observed)
## [1] 0
print(ptest_3[["observed"]])
## [1] 0
```
If the last two results remain the same, mostly beacause they acces a value which doesn't change, the `$observed`result, we can see a great change in all `$permuted` results. This is preatty understandable if we think to the difference between the values we access using the two variables we consider or `grouping_var`.
- `var` give us the access to values contained in the "Sepal.Width" column in our case;
- `grouping_var` is associated to the "Species" column.
###Question 3e: plotting the sampling distribution
```{r echo=F, cache=T}
if(!exists(".Random.seed")) set.seed(NULL)
previous_seed <- .Random.seed
set.seed(1)
ptest <- permutation_twogroups(iris, "Sepal.Width", "Species", "versicolor",
"virginica", difference_in_medians)
set.seed(previous_seed)
```
```{r echo=F}
ptest_p <- tibble::as_tibble(ptest["permuted"])
ptest_o <- tibble::as_tibble(ptest["observed"])
plot_line_at_observed(ptest_p, ptest_o, "Permuted data plot")
```
We now observe the histogram of data we permuted earlier. We add a line to emphasise the value we also observed on our data.
Mostly we can say that the data has a sort of symmetric distribution basically around the value 0.0. The value we observe corresponding to -0.2 it seems really difficult for the two groups we take in account to be similar.
###Question 3f: test statistics
```{r echo=F, cache=T}
if(!exists(".Random.seed")) set.seed(NULL)
previous_seed <- .Random.seed
set.seed(1)
ptest_new <- permutation_twogroups(iris, "Sepal.Width", "Species", "versicolor",
"virginica", new_test_statistic)
set.seed(previous_seed)
```
```{r echo=F}
ptest_new_p <- tibble::as_tibble(ptest_new["permuted"])
ptest_new_o <- tibble::as_tibble(ptest_new["observed"])
plot_line_at_observed(ptest_new_p, ptest_new_o, "Permuted data plot")
```
Here we will test a new kind of statistic. We'll take in account just the values corresponding to intersection of the two groups in the dataset we analize. It would appear interesting to see how the common values change from a subset to another, this could be lead us to consider some more fundamental values which for example remain the same in all permutations or can reveal which permutation lose every link to values with the preceeding ones.
In our case for exemple we see that value 0 is the one which doesn't change in the new set we take.
###Question 3g: calculating p-values
```{r}
permutation_pvalue_right(ptest)
permutation_pvalue_left(ptest)
```
- **Hypothesis A.** The two groups are drawn from equivalent sources of variability if we get a *p-value* "extreme" (a really small fraction) in positive or negative direction.
- **Hypothesis B.** Our obtained *p-values* are not "extreme" or just in a certain direction if the two groups are drawn from different sources of variability.
Let's take the example of the surface form of the vowel "e" in italian, it corresponds to the phonemes /e/ and /ɛ/, we'd expect that all native speakers (taken from North and South regions) can make the two different realisations without problems because they are just two underlying forms in the phonological system of this language.
We'd predict that the difference in the medians statistic would be extreme on the right side if our *Hypothesis B* were that our dataset presents native speakers who makes the realisation of just one phoneme not or never realising the other for the vowel "e". In our *Hypothesis A* we would expect a normal distribution of the two realisations.
In which case, we would use the function `permutation_pvalue_left` or
`permutation_pvalue_right` ?
If we assign two different median values to the two phonemes, based on their features or their distribution in language, we would consider the left value for a greater realisation of a phoneme, we say the /e/ one and the right one for the phoneme /ɛ/.
###Question 3h
```{r echo=F, cache=T}
#subset1
iris_subset_1 <- iris[c(89:94, 108:112),]
if(!exists(".Random.seed")) set.seed(NULL)
previous_seed <- .Random.seed
set.seed(1)
ptest_sub1 <- permutation_twogroups(iris_subset_1, "Sepal.Width", "Species",
"versicolor", "virginica",
difference_in_medians)
set.seed(previous_seed)
ptest_sub1_p <- tibble::as_tibble(ptest_sub1["permuted"])
ptest_sub1_o <- tibble::as_tibble(ptest_sub1["observed"])
#subset2
iris_subset_2 <- iris[88:114,]
if(!exists(".Random.seed")) set.seed(NULL)
previous_seed <- .Random.seed
set.seed(1)
ptest_sub2 <- permutation_twogroups(iris_subset_2, "Sepal.Width", "Species",
"versicolor", "virginica",
difference_in_medians)
set.seed(previous_seed)
ptest_sub2_p <- tibble::as_tibble(ptest_sub2["permuted"])
ptest_sub2_o <- tibble::as_tibble(ptest_sub2["observed"])
```
We'll now create two new subsets of our data and proceed as earlier in permutation, we take an observed value and we plot everything in an histogram in order to discuss later. We will show every observed value in every histogram, thanks to our improved plot function:
```{r echo=F}
plot_line_at_observed(ptest_sub1_p, ptest_o,
"Subset 1 data plot",
ptest_sub1_o, ptest_sub2_o)
```
For this first subset the observed line is the first appearing on the plot, the brown line at -0.3, the median difference look greater here than on the other two datasets. We can also notice a more harmonic distribution of values compared to the first permuted set we observed earlier, here we don't have any symmetric distribution but great populations of values in different places.
```{r echo=F}
plot_line_at_observed(ptest_sub2_p, ptest_o,
"Subset 2 data plot",
ptest_sub1_o, ptest_sub2_o)
```
Our second subset presents once again a different distribution both in median observation and in vlues themselves. We can see a slightly return of symmetrical representation with increasing values from (almost) -0.4 to almost (-0.2), a great decreasing count and an increaed one, we see the inverse path from little more than 0 to the last values.
Median value here is between -0.2 and -0.3, perfectly between the two preceeding subsets. As the original dataset is the same for the three examples, we can firmly assume that our three permutation really produced something interesting to observe, three subsets completely different which can lead us to confirm that the original set is composed of data which basically differ in characters and values (species, sepal width, etc.) from one another.