-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
378 lines (297 loc) · 11.6 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
---
title: "Sentiment Analysis of Bernie Sanders Tweets"
date: "`r Sys.Date()`"
author: Nils Indreiten
output:
rmdformats::readthedown:
self_contained: true
thumbnails: true
lightbox: true
gallery: false
highlight: tango
editor_options:
markdown:
wrap: 72
---
```{r setup, include=FALSE, warning=FALSE}
pacman::p_load(twitteR,ROAuth,tidyverse,tm,SnowballC,topicmodels,sentiment,data.table,syuzhet,plotly,gridGraphics)
```
# Project Goal
The goal of this project is to perform basic text processing on tweets.
In order to obtain this data the [Twitter
API](https://developer.twitter.com/en) was used. This allowed for the
Bernie Sander tweets to be retrieved from the
[BernieSanders](https://twitter.com/BernieSanders) Twitter account.
## Twitter API
Once you have a Twitter developer account, you can create a Twitter app
to generate Twitter API, Access Token and Secret Keys etc. In order to
retrieve the tweets you need the following:
- Consumer Key
- Consumer Secret
- Access Token
- Access Token Secret
All of which can be accessed once you have created an app in the Twitter
Developer portal. Once you have them, you can assign them to R objects
as below, and authorize them using the setup_twitter_oath() :
```{r, eval=FALSE}
library(twitteR)
library(ROAuth)
consumer_key <-'your consumer key'
consumer_secret <- 'your consumer secret'
access_token <- 'your access token'
access_secret <- 'your access secret'
setup_twitter_oauth(consumer_key, consumer_secret, access_token,
access_secret)
```
The twitteR package has certain parameters that enable us to retrieve
data from a specific user. In order to do so we have to specify the
Twitter ID and the number of tweets, in this case we specify
'BernieSanders' and 583 tweets, from 2020-10-19 to 2021-08-17 (Twitter
policy changes may affect the number of twitters you can pull):
```{r, eval=FALSE}
twitter_user <- 'BernieSanders'
twitter_max <- 583
```
Next we can use the userTimeline() function to download the timeline
(tweets) from the specified twitter_user. The function allows us to
determine whether we also want to retrieve retweets and replied tweets,
by specifying the relevant parameters. Consult the help documentation of
the twitteR package for more details.
```{r, eval=FALSE}
tweets <- userTimeline(twitter_user, n = twitter_max, includeRts=FALSE)
# Get the amount of tweets pulled:
length(tweets)
# Convert tweets to a data frame:
tweets.df <- twListToDF(tweets)
# save the downloaded tweets into a csv for future use
file_name = "Bernie_tweets.csv"
write.csv(tweets.df, file = file_name)
```
Once you have retrieved the tweets to your local environment or saved
them as a csv file, you are ready to start exploring and pre-processing
the data.
# Data Pre-processing
Load the Bernie Sanders tweets you previously saved:
```{r}
tweets.df <- read.csv('Bernie_tweets.csv', stringsAsFactors = FALSE)
```
Lets take a quick look at some of the tweets, for instance, say we are
interested int he 150th tweet.:
```{r, eval=FALSE}
# Tweet number 150
display_n <- 150
tweets.df[display_n, c("text")]
```
```{r, include=FALSE}
# Tweet number 150
display_n <- 150
tweets.df[display_n, c("text")]
```
> "The way to rebuild the crumbling middle class in this country is by
> growing the trade union movement."
Next in order to work with the text data properly we have to create a
Corpus, which we can do using the Corpus() function from the tm package,
it allows us to specify the source to be character vectors, we assign
this to a new object, myCorpus:
```{r}
myCorpus_raw <- Corpus(VectorSource(tweets.df$text))
myCorpus <- myCorpus_raw
myCorpus
```
Using the lapply function we can index the corpus object, returning the
first three tweets:
```{r}
lapply(myCorpus[1:3], as.character)
```
Now, we would like to remove non-graphical characters, remove
non-English words, and remove URLs.
## Non-graphical characters:
To replace non-graphical characters we will define an operation that
replaces non-graphical characters with a space, we will name it toSpace,
using the gsub function, and the content_transformer() function as a
wrapper function:
```{r}
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern," ",x))})
```
In order to apply this to all non visible characters we will use
'[\^[:graph:]]', which is a regular expression for all non-visible
characters([Follow this link for more information on regular
expressions.](http://www.regular-expressions.info/posixbrackets.html)):
```{r, warning=FALSE}
myCorpus<- tm_map(myCorpus,toSpace,"[^[:graph:]]")
# Convert to lower case
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
```
## Removing URLs
Once again to remove the URLs we use regular '[:space:]', which is
another Regular expression for whitespace. What we are trying to do here
is to match '[\^[:space:]]' (non-space) zero or multiple times, as a way
to identify a URL. We will refer to this operation as removeURL and
apply it to our Corpus object using the tm_map function:
```{r,warning=FALSE}
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))
```
## Removing Non-English words
To remove everything except English words or space, we take advantage of
the '[\^[:alpha:][:space:]]\*' regular expression, assigning this
operation to removeNumPunct then apply it to our Corpus as done in the
previous step:
```{r, warning=FALSE}
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeNumPunct))
# remove stopwords
myCorpus <- tm_map(myCorpus, removeWords, stopwords())
# remove extra whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)
```
## Text stemming
Text stemming basically reduces words to their root form, the SnowballC
package allows us to do so:
```{r, warning=FALSE}
library("SnowballC")
myCorpus <- tm_map(myCorpus, stemDocument)
```
## Term Document Matrix
Our final data preparation step is to build a term document matrix. In
this matrix each word is a row and each column, a tweet:
```{r}
tdm <- TermDocumentMatrix(myCorpus,control = list(wordLengths = c(1, Inf)))
tdm
nrow(tdm) # number of words
ncol(tdm) # number of tweets
```
We can inspect the frequent word by specifying a frequency threshold, in
this case we will set it to 30.
```{r}
freq_thre = 30
# First few words:
head((freq.terms <- findFreqTerms(tdm, lowfreq = freq_thre)))
```
Next we want to calculate the word frequency by using the rowSums
function, and filter the matrix to only include words that appear more
than the specified threshold:
```{r}
# calculate the word frenquency
term.freq <- rowSums(as.matrix(tdm))
# only keep the frequencies of words(terms) appeared more then freq_thre times
term.freq <- subset(term.freq, term.freq >= freq_thre)
```
We may wish to plot the word frequencies:
```{r, fig.cap="Term Frequency in Bernie Sander Tweets"}
library(ggplot2)
# select the words(terms) appeared more then freq_thre times, according to selected term.freq
df <- data.frame(term = names(term.freq), freq = term.freq)
p1 <- ggplot(df, aes(x=term, y=freq)) + geom_bar(stat="identity") +
xlab("Terms") + ylab("Count") + coord_flip() +
theme(axis.text=element_text(size=7))+theme_light()
p1
```
# Associations
We may want to consider which words are associated with a specific word,
we can do so by specifying our word of interest and the correlation
limit, specifying these parameters when calling the findAssocs()
function from the tm package:
```{r}
word <- 'covid'
cor_limit <- 0.2
(findAssocs(tdm,word,cor_limit))
```
# Topic Modelling
In order to conduct topic modelling, to try to identify themes in the
tweets, we will use the topicmodels package. In order to do so we must
convert our tdm object into a document term matrix, where wach row is a
document (i.e. a tweet) and each column a term. Then we can specify the
number of topics and terms that we are interested in:
```{r}
library(topicmodels)
dtm <- as.DocumentTermMatrix(tdm)
topic_num = 6
term_num = 2
rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document (tweet)
dtm <- dtm[rowTotals> 0, ] #remove all docs without words
lda <- LDA(dtm, k = topic_num) # find k topics
term <- terms(lda, term_num) # first term_num terms of every topic
(term <- apply(term, MARGIN = 2, paste, collapse = ", "))
```
# Sentiment Analysis
Sentiment analysis can be useful to get an idea of the extent to which
the tweets in question can be considered negative, positive, or neutral.
To do so we will use the sentiment package. In order to find the
sentiment we will use the raw text, which is in our tweets.df object:
```{r, warning=FALSE}
library(sentiment)
# use the raw text for sentiment analysis
sentiments <- sentiment(tweets.df$text)
table(sentiments$polarity)
```
We may wish to visualise the sentiment, where he negative values are
associated with negative sentiment and positive values with positive
sentiment, in contrast, neutral is 0:
```{r}
library(data.table)
sentiments$score <- 0
sentiments$score[sentiments$polarity == "positive"] <- 1
sentiments$score[sentiments$polarity == "negative"] <- -1
sentiments$date <- as.IDate(tweets.df$created)
result <- aggregate(score ~ date, data = sentiments, sum)
# plot the scores
p2 <- result %>% ggplot(aes(date,score)) +geom_line()+theme_light()
ggplotly(p2)
```
As we can see there seems to be a peak in positive sentiment in Bernie's
tweets between the May and July periods, on 2021-06-09 to be specific. We might want to check out these tweets more closely to see why this was the case, filtering using a regular
expression to filter
for the date, will allow us to do so:
```{r}
filter(tweets.df, grepl('2021-06-09',created)) %>% select(text)
```
## Emotion Lexicon
Alternatively we may use the emotion lexicon,which is a list of words
and their respective associations with 8 emotions
(anger,fear,anticipation,trust,surprise,sadness,joy and disgust), and
two sentiments (negative and positive). FOr this we will use the syuzhet
package. It is important that we use the cleaned data as the functions
we will use from this package cannot handle non-graphic data:
```{r}
library(syuzhet)
# use the cleaned text for emotion analysis
# since get_nrc_sentiment cannot deal with non-graphic data
tweet_clean <- data.frame(text = sapply(myCorpus, as.character), stringsAsFactors = FALSE)
```
In contrast to the previous section, in this case we have a matrix,
where each row is a document (i.e. a tweet) and each column an emotion:
```{r}
# each row: a document (a tweet); each column: an emotion
emotion_matrix <- get_nrc_sentiment(tweet_clean$text)
```
We need to format the matrix so that instead of each row being a tweet
and each column and emotion, each row is an emotion and each column a
tweet. To do so we will transpose the matrix.
```{r}
# Matrix Transpose
td <- data.frame(t(emotion_matrix))
#The function rowSums computes column sums across rows for each level of a grouping variable.
td_new <- data.frame(rowSums(td))
td_new
```
Finally some transformation and cleaning is necessary, in particular to
rename and arrange the columns:
```{r}
#Transformation and cleaning
names(td_new)[1] <- "count"
td_new <- cbind("sentiment" = rownames(td_new), td_new)
rownames(td_new) <- NULL
knitr::kable(td_new)
```
Finally we may wish to visualise the sentiment, we add an interactive
element here using the plotly package:
```{r,fig.cap="Bernie Sanders tweet emotion and sentiment"}
library(plotly)
library(ggplot2)
sentiment_plot <- qplot(sentiment, data=td_new, weight=count, geom="bar",fill=sentiment)+ggtitle("Tweets emotion and sentiment")+theme_minimal()+theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))
ggplotly(sentiment_plot)
```
# Session Info
```{r}
sessionInfo()