-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAssignment_B4.Rmd
229 lines (184 loc) · 6.44 KB
/
Assignment_B4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
---
title: "Assignment_B4"
author: "Jérôme Plumier"
date: "2023-12-07"
output: github_document
---
# Option A - Strings and functional programming in R
For the completion of the fourth assignment for STAT545B, option A was chosen.
To complete option A, exercise 1 and exercise 2 will be done.
First, let's load some libraries we will use.
```{r}
library(tidyverse)
library(dplyr)
library(tidytext)
library(ggplot2)
library(stringr)
library(roxygen2)
library(testthat)
```
## Exercise 1 - Most common words (37.5 points)
For the completion of exercise 1, we shall use the Jane Austen books, using the
`janeaustenr` and `stopwords`package.
```{r}
library(janeaustenr)
library(stopwords)
```
Now, we create our new dataframe `books_words` that contains Jane Austen's
6 books.
```{r}
books<-austen_books() # Place all 6 books in a books dataframe
# Make a new dataframe where each observation is a single word
books_words <- unnest_tokens(books, # Select our Jane Austen book dataframe
word, # Name the output as words
text, # Input is the variable named text
token = "words", # Name the output words
to_lower = TRUE, # Make all words lower case
drop = TRUE) # Drop the original text variable
head(books_words)
```
This permits us to create another dataframe named `stopwords_austen` with all
the English stopwords provided by the `stopwords`package removed.
We can see, comparing with the previous `head()` function results,
that the stopwords were removed.
```{r}
stopwords_austen <- books_words %>%
filter(!word %in% stopwords("en"))
head(stopwords_austen)
```
To obtain our finalised dataframe, we now want to add the count of each words.
```{r}
count_austen <- stopwords_austen %>%
group_by(book, word) %>%
summarise(.groups="keep", n=n())
tail(count_austen)
```
We can now plot the most common words. For the sake of legibility, we will plot
only the ten most common words. We will use as an example all six books first:
```{r}
plot_austen <- ggplot(data = (count_austen %>%
arrange(desc(n)) %>% head(n=10))) +
geom_bar(aes(y=word, x=n, fill = word), stat= "identity") +
theme_bw() +
theme(legend.position="none") +
ylab("Ten most common words") +
xlab("Number of word appearances")+
ggtitle("All six Jane Austen books")
plot_austen
```
And as another example, here we looked at the ten most common words in the
`Sense & Sensibility` book:
```{r}
plot_austen_SandS <- ggplot(data = (count_austen %>%
filter(book =="Sense & Sensibility") %>%
arrange(desc(n)) %>% head(n=10))) +
geom_bar(aes(y=word, x=n, fill = word), stat= "identity") +
theme_bw() +
theme(legend.position="none") +
ylab("Ten most common words") +
xlab("Number of word appearances")+
ggtitle("Sense & Sensibility")
plot_austen_SandS
```
## Exercise 2 - Pig Latin function (37.5 points)
This section will be further divided into three subsections.
### Exercise 2.1: Make a function
Let's first create a function that will convert words into my own Pig Latin
version, and document it using `roxygen2`.
It has a rearrangement component:
For words that end with a vowel, that vowel is duplicated at the beginning of
the word sequence. In all other cases, the end of a word sequence from its
last vowel onward is placed at the beginning of the word sequence.
It has an addition component:
All words receive a "o" that follows the new added beginning of the word
sequence, and a "y" that comes at the new end of the word sequence.
```{R}
#' @title Transform a singular word into its pig Latin equivalent
#'
#' @description Given a string of letters, this function will return a pig Latin
#' word sequence based on personal rules: For words that end with a vowel, that
#' vowel is duplicated at the beginning of the word sequence. In all other
#' cases, the end of a word sequence from its last vowel onward is placed at
#' the beginning of the word sequence. Moreover, all words receive a "o" that
#' follows the new added beginning of the word sequence, and a "y" that comes
#' at the end of the new word sequence.
#'
#' @param input, a string. It must be composed solely out of letters, and
#' contain both at least one vowel and one consonant.
#'
#' @return a new word sequence with only lower case letters. It is the
#' inputed string modified by personal pig Latin rules.
#'
#' @export
pigLatin <- function(input) {
if(!(is.character(input))) {
stop('The word sequence needs to be inputed as a string.')}
if(!grepl("^[A-Za-z]+$", input)) {
stop('The word sequence needs to be made out of letters only (no space).')}
input<-tolower(input)
vowel_position <- grep("[aeiouy]", strsplit (input, NULL)[[1]])
if(length(vowel_position)==0){
stop('The word sequence contains no vowel.')}
consonant_check <- grep("[zrtpqsdfghjklmwxcvbn]", strsplit (input, NULL)[[1]])
if(length(consonant_check)==0){
stop('The word sequence contains only vowels.')}
vowel_position <- tail(vowel_position, 1)
if(str_sub(input, start=-1)=="vowel_position"){
str_c(str_sub(input, start = vowel_position), "o", str_sub(input, start = 1L, end = 1L), "y")
}else{
str_c(str_sub(input, start = vowel_position), "o", str_sub(input, start = 1L, end = vowel_position-1), "y")
}
}
```
### Exercise 2.2: Function examples
Here is an example of using the function to transform the word`hello`, which
ends with a vowel:
```{R}
test1 = "hello"
pigLatin(test1)
```
Here is an example of using the function to transform the word`transform`, which
ends with a consonant:
```{R}
test2 = "transform"
pigLatin(test2)
```
Here is an example of using the function to transform the word`HAMBURGER`, with
only capital letters:
```{R}
test3 = "HAMBURGER"
pigLatin(test3)
```
### Exercise 2.3: Function tests
For this exercise, I tested my function using the `expect_error()` function
from the `testthat` package with non-redundant inputs. They all passed:
```{R}
test_that('Not a string', {
x=c("a", "b", "c")
expect_error(pigLatin(x))
})
```
```{R}
test_that('Only one word', {
x="Hello world"
expect_error(pigLatin(x))
})
```
```{R}
test_that('The string is only made out of letters', {
x="Hello1"
expect_error(pigLatin(x))
})
```
```{R}
test_that('The string has at least one consonant', {
x="aeouiy"
expect_error(pigLatin(x))
})
```
```{R}
test_that('The string has at least one vowel', {
x="zrtpqsdfg"
expect_error(pigLatin(x))
})
```