assignment_1.Rmd

---
title: "Assignment 1"
author: "Gabriele Chignoli"
date: "9/28/2017"
output: html_document
---

## Exercise 1

**RStudio** is a coding environnement where you can use all all the **R** coding potential.  
Thanks to the **Git** integration all your works will be available for sharing and collaboration.

## Exercise 2

```{r}
possible_outcomes <- c(0, 1, 2, 3, 4, 5)
outcome_probabilities <- c(0.1, 0.5, 0.2, 0.1, 0.05, 0.05)
n_data_points <- 400

set.seed(1)
fake_data_points <- sample (possible_outcomes,
                            n_data_points,
                            replace=T,
                            prob=outcome_probabilities)
set.seed(NULL)

fake_data_set <- tibble::data_frame(`Fake measurement`=fake_data_points)
```

*On my statements explanations*  
    I once read about some **R** basics in a hot summer day two years ago while I was deciding to study a programming language. I finally ended up doing some Swift, so all I remember from that small lecture is the major reason for my answers:

* **Statement 1.** `possible_outcomes` is the first variable created in this code chunk, the `<-` element after a variable name indicates what are the values (all must be of the same type) we assign to it.
    + Instead of a common programming *variable* **R**'s ones can be more dynamic, we can assign multiple values to one variable and every time we do an operation on this variables it will be applied to all of its values;
    + *Vectors* like Variables made by all the values we assign to them in the `c()` expression, we can access to a specific value calling it like a common array (index starting at 1) `possible_outcomes[3]` corresponds to `[1] 2`;
    + When we *"print"* a variable on the *RStudio* console, a number inside brackets also appears, it indicates the vector *level* we are looking to;  
  
* **Statement 2.** `outcome_probabilities` like the first one, this variable too has 5 numerical values in it, this time they are not just integer numbers but decimal ones. By it name we can guess this vector will indicates the probabilities which will be applied to the outcomes definied earlier.

* **Statement 3.** `n_data_points` this time our variable is just an integer, its value maybe indicates the number of measurements made to obtain our outcomes.

* **Statement 4.** `set.seed()` the help in **R** tells us this is a function which generates random numbers, my first doubt raised from the word seed: I thought that normally a seed doesn't grow randomly, we normally know to what a seed will grow.
Starting from this I made some research[^1]  and I found some interesting facts:  

    > The seed number you choose is the starting point used in the generation of a sequence of random numbers, which is why (provided you use the same pseudo-random number generator) you'll obtain the same results given the same seed number.[^2]  

    + Actually the integer argument is the starting point and it always produces the same result as the second example in this reference[^3]
- re-calling this function with NULL argument (like some lines below in the chunk) reset it, preventing us from obtaining another time the same "random" numbers.

* **Statement 5.** here we use the `sample()` function (I just remembered its result but not how to define its arguments), basically it process integers in argument returning a set (of determined size) of these datas. For arguments explanation I found the **RStudio** help useful enough.
    1. Integers set from which to sample datas = our *statement 1*;
    2. Size of resulting set = our *statement 3*;
    3. Boolean value `replace`indicating if resulting set may contain replacements. In our case we have the obligation to say True due to size being larger than datas;
    4. A probability weights vector = our *statement 2*;  
    
* **Statement 6.** In our last statement the most important element is the package `tibble`, it made my first obstacle in this assignement with the message:  
    
    > Line 15 Error in loadNamespace(name) : there is no package called 'tibble' Calls: <Anonymous> ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous> Execution halted
    
    Asking for some help to a fellow, she told me how to install this package, so I made some research to understand how it works and why when we run `fake_data_set` it returns a this line `# A tibble: 400 x 1` followed by 10 lines of a random number from 0 to 5 (our possible outcomes) and the mention `# ... with 390 more rows`.  
    Using *help("tibble")* the topic shown is **Build a list**, it explains basically how *list()* and *data.frame()* functions can be used to show data. Doing some deeper research on the web I found out that the *tibble* package in a certain way comes from *data.frame()*, representing a modern reimagining[^4] of it.


```{r, echo=FALSE}
ggplot2::ggplot(fake_data_set, ggplot2::aes(x=`Fake measurement`)) + ggplot2::geom_histogram(bins=5, colour="black", fill="lightgrey")
```

## Exercise 3 {#data_list}

**Question 3a.** The two histograms here  show a data subset from the Iris study. We have two out of three species data, vertical axis shows the number of irises measured for every spieces, 50 samples for each, while horizontal axis show the sepal width measurement from 2 (minimum sepal width for versicolor) cm to 3.8 cm (maximum sepal width for virginica iris).  

We have for *versicolor*:  
- 2cm on 1 sample;  
- 2.2cm on 2 samples;  
- 2.3cm on 3 samples;  
- 2.4cm on 3 samples;  
- 2.5cm on 4 samples;  
- 2.6cm on 3 samples;  
- 2.7cm on 5 samples;  
- 2.8cm on 6 samples;  
- 2.9cm on 7 samples;  
- 3cm on 8 samples;  
- 3.1cm on 3 samples;  
- 3.2cm on 3 samples;  
- 3.3cm on 1 sample;  
- 3.4cm on 1 sample;  
  
For *virginica*:  
- 2.2cm on 1 sample;  
- 2.5cm on 4 samples;  
- 2.6cm on 2 samples;  
- 2.7cm on 4 samples;  
- 2.8cm on 8 camples;  
- 2.9cm on 2 samples;  
- 3cm on 12 samples;  
- 3.1cm on 4 samples;  
- 3.2cm on 5 samples;  
- 3.3cm on 3 samples;  
- 3.4cm on 2 samples;  
- 3.6cm on 1 sample;  
- 3.8cm on 2 samples;  

```{r, echo=FALSE}
iris_groups23 <- dplyr::filter(iris, Species %in% c("versicolor", "virginica"))
ggplot2::ggplot(iris_groups23, ggplot2::aes(x=Sepal.Width)) +
  ggplot2::geom_histogram(colour="black", fill="lightgrey", binwidth=0.1) +
  ggplot2::facet_grid(Species ~ .)
```

This is my drawing for the *versicolor* subset:

```{r, out.width='30%', echo=FALSE}
knitr::include_graphics("exo3a.jpg")
```

**Question 3b.** Here the histogram with both groups pooled together. We can simply confirm that it contains all the data from the other two thanks to the [data list](#data_list) at the beginning. In all the 100 counts we have:  

- 2cm for 1 sample;  
- 2.2cm for 3 samples;  
- 2.3cm for 3 samples;  
- 2.4cm for 3 samples;  
- 2.5cm for 8 samples;  
- 2.6cm for 5 samples;  
- 2.7cm for 9 samples;  
- 2.8cm for 14 samples;  
- 2.9cm for 9 samples;  
- 3cm for 20 samples;  
- 3.1cm for 7 samples;  
- 3.2cm for 8 samples;  
- 3.3cm for 4 samples;  
- 3.4cm for 3 samples;  
- 3.6cm for 1 sample;  
- 3.8cm for 2 samples;  

```{r, echo=FALSE}
iris_groups23 <- dplyr::filter(iris, Species %in% c("versicolor", "virginica"))
ggplot2::ggplot(iris_groups23, ggplot2::aes(x=Sepal.Width)) +
  ggplot2::geom_histogram(colour="black", fill="lightgrey", binwidth=0.1)
```

**Question 3c-d.**

Once we have discussed the histograms, we can now start to make hypothesis and discussing what all our measurements on sepal width really mean:

* **Hypothesis A**: The virginica and versicolor iris species are the same in terms of sepal width.  
    Given our three histograms and our 100 measurements we can consider these two species as just one great set. First we consider the versicolor data, sepal width vary from 2cm to 3.4 with a greater population between 2.7cm and 3cm while the virginica measurement show us a width variation from 2.2cm to 3.8 with most of the samples taking 2.8cm and 3cm. Just looking at this major distribution, we can assume both spieces develop the most between 2.7-3cm. When we look at the unique histogram showing all measurements we see the core of sepal witdh for irises, but not only this, we also can assume that measurements complete one common pattern filling one another.  
    Also we can consider that both spiecies have a common average and they share some isolated greater or smaller measurements. Assuming that two completely different spieces would have grown some very own features, looking just at sepal width we can't totally affirm that virginica and versicolor represent two different spieces.
    
* **Hypothesis B**: The virginica and versicolor iris species are different in terms of sepal width.
    Now why would we assume that virginica and versicolor in fact really are two different species and they show different features ?  
    The first reason is in the contradiction of two histograms: while versicolor show a most constant spiece which slightly increase its sepal width until 3cm, it presents some samples with a greater width but nothing really considerable, the virginica has two great reference scales in 2.8cm and 3cm, all other measurements reveal a small-medium distribution.
    Another interesting fact we can consider to distinguish versicolor and virginica could be minimum and maximum values these two spieces can reach. I'm going to make a cross-domain comparaison to better explain the reason why this hypothesis is more realistic. Like in phonetics we can use the F2  values of differents vowels to determine a boundary between plosives following them (I mostly took the idea from another class[^5]). The F2 define mostly a starting and an ending point from where the plosive segment changes. We can assume that a common feature shared by all the samples we analyse and apparently distibuted in average, F2 for the phonetics examples, sepal width here for irises, could help in determine a starting and an ending point of a specific segment in order to distinguish it from another.

## Exercise 4
The following bar plots represent the number of times the word **permit** as a *noun* or a *verb* was marked as having the stress on the first or on the second syllable, samples are taken from a large collection of English dictionaries in an order of 46 for each category.  

```{r, echo=FALSE}
ggplot2::ggplot(stressshift::stress_shift_permit,
                ggplot2::aes(x=Category, fill=Syllable)) +
  ggplot2::geom_bar(position="dodge", colour="black") + 
  ggplot2::scale_fill_brewer(palette="Set3")
```

Here is the merged data:

```{r, echo=FALSE}
ggplot2::ggplot(stressshift::stress_shift_permit, ggplot2::aes(x=0, fill=Syllable)) +
  ggplot2::geom_bar(position="dodge", colour="black") + 
  ggplot2::scale_fill_brewer(palette="Set3") +
  ggplot2::xlab("") +
  ggplot2::theme(axis.text.x=ggplot2::element_blank(),
                 axis.ticks.x=ggplot2::element_blank()) +
  ggplot2::xlim(c(-1,1))
```

We now are going to discuss two hypotheses, like we did before with the irises example:

* **Hypothesis A**: **Permit (noun)** and **permit (verb)** are the same in terms of their stress.  
    In order to affirm this our first assomption we could start thinking under a phonological point of view. We affirm the word permit /pɜrmɪt/ can be realised on  surface with both stress on the first or on the second syllable, this stress rotation, as we can see from the merged data, is more common on the second place but not really dominant. The underlying form which could make the hypothesis A to prevail has to be a stress constraints free form for this word which permit it to pretty randomly realise its stress, but looking at the separate plots, we notice a dominance realisation of stress characterising the two word categories.  
    
* **Hypothesis B**: **Permit (noun)** and **permit (verb)** are different in terms of their stress.  
    The last idea from the hypothesis A is the starting point to assume the veracity of hypothesis B. When we look at bar plots for individual realisation of stress in permit as a noun or as a verb, we can firmly say the verb is marked by the stress on the syllable 2 (only one time on attested on syllable 1) while noun is mostly marked on syllable 1, so why this net difference between two words which are nearly the same under various aspects (semantic, morphological, phonological) ? So the explanation will not take into account semantic or morphological elements because they don't really realise the variations we want to focus on.  
    We can consider syntactic and phonological variation strongly linked in our case. When the segment /pɜrmɪt/ changes its syntactic status from noun to verb, it also changes its mark of stress from first to second syllable. Doing some research I found out that most of two syllable verbs in English put the stress on syllable 2 and most of two syllables nouns put it on the first[^6]. In my idea this could also have a prosodic reason for the verb to realise the stress on syllable 2, it could be a kind of heritage from the infinite form `to *verb`, where mostly stress goes to the last syllable due to the presence of the clitic element at beginning.

## Exercise 5
```{r, echo=FALSE}
library(magrittr)
set.seed(1)
ver_balanced <- languageR::ver %>%
  dplyr::group_by(SemanticClass) %>%
  dplyr::sample_n(198)
set.seed(NULL)
ggplot2::ggplot(ver_balanced, ggplot2::aes(x=Frequency)) +
  ggplot2::geom_histogram(fill="lightgrey", colour="black", binwidth=250) +
  ggplot2::facet_grid(SemanticClass ~ .)
```

**Question 5a.** We now face a count data histogram fo frequency counts from a Dutch corpus, in details frequency of verbs presenting the prefix *ver-* (it is near to the English verbs prefix *for-*). The aspect we want to analise here is the semantic appearance of ver- verbs, we distinguish, in a count of 198 samples for each group, between semantic transparent verbs, whose thanks to their surface structure and morphological elements the meaning of the verb can be recognized[^7] and opaque verbs whose morphology cannot help so much in the meaning recognition[^8]. 

**Question 5b-c.** We take on discussion as usual:  

* **Hypothesis A**: Semantically transparent and opaque ver- verbs are the same in terms of their frequency.  
    In order to verify this hypothesis we would expect the plots of both transparent and opaque verbs look pretty much the same. Opaque verbs would present much more counts in low frequency and the high frequency area would be totally empty. On the other side transparent verbs would distribuite their counts instead of just being attested on low frequency zone.  
    
* **Hypothesis B**: Semantically transparent and opaque ver- verbs are different in terms of their frequency.  
    As we can remark the bar plots aren't really similar, for transparent verbs low frequency counts represent a large majority of the data, showing the first bar higher than 150 counts. Transparent ver- verbs not having a great frequency count we can assume they really play a marginal role while the opaque verbs distribution shows a slightly more important presence in our corpus, this can take us to think that ver- verbs in Dutch don't really link their morphology to their semantic.
    This hypothesis is in my idea more interesting and near to reality, not only because looking at the bar plots we see a slightly difference in frequency between the two kinds of verbs we analise, but also because considering *ver-* verb prefix as a sort of correspondent of *for-* English verb prefix, we can use the English vocabulary as a reference and in a personal global view I consider opaque verbs much frequent than transparent ones in with this construction.

<!--references-->
[^1]:<http://www.talkstats.com/showthread.php/16833-the-function-set.seed()-in-R>
[^2]:<https://stats.stackexchange.com/questions/86285/random-number-set-seedn-in-r>
[^3]:<https://stackoverflow.com/questions/13605271/reasons-for-using-the-set-seed-function>
[^4]:<https://blog.rstudio.com/2016/03/24/tibble-1-0-0/>
[^5]: *last four slides* <http://jacqueline.vaissiere.ilpga.fr/COURS2017/COURS1.pdf>
[^6]:<https://www.englishclub.com/pronunciation/word-stress-rules.htm>
[^7]:<https://www.thoughtco.com/semantic-transparency-1691939>
[^8]:<https://link.springer.com/article/10.3758/s13421-014-0430-1?no-access=true>