forked from rstudio/bigdataclass
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path12-bonus-textmining-.Rmd
148 lines (113 loc) · 3.34 KB
/
12-bonus-textmining-.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
title: "Text mining with sparklyr"
output: html_notebook
---
## 12.1 - Data Import
1. Open a Spark session
```{r}
library(sparklyr)
library(dplyr)
conf <- spark_config()
conf$`sparklyr.cores.local` <- 4
conf$`sparklyr.shell.driver-memory` <- "8G"
conf$spark.memory.fraction <- 0.9
sc <- spark_connect(master = "local", config = conf,version = "2.0.0")
```
1. The `spark_read_text()` is a new function which works like `readLines()` but for `sparklyr`. Use it to read the *mark_twain.txt* file into Spark.
```{r}
twain_path <- paste0("file:///usr/share/bonus/mark_twain.txt")
twain <- spark_read_text(sc, "twain", twain_path)
```
2. Read the *arthur_doyle.txt* file into Spark
```{r}
doyle_path <- paste0("file:///usr/share/bonus/arthur_doyle.txt")
doyle <- spark_read_text(sc, "doyle", doyle_path)
```
## 12.2 - Prepare the data
1. Use `sdf_bind_rows()` to append the two files together
```{r}
all_words <- doyle %>%
mutate(author = "doyle") %>%
sdf_bind_rows({
twain %>%
mutate(author = "twain")
}) %>%
filter(nchar(line) > 0)
```
2. Use Hive's *regexp_replace* to remove punctuation
```{r}
all_words <- all_words %>%
mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " "))
```
3. Use `ft_tokenizer()` to separate each word.
```{r}
all_words <- all_words %>%
ft_tokenizer(input.col = "line",
output.col = "word_list")
head(all_words, 4)
```
4. Remove "stop words" with the `ft_stop_words_remover()` transformer
```{r}
all_words <- all_words %>%
ft_stop_words_remover(input.col = "word_list",
output.col = "wo_stop_words")
head(all_words, 4)
```
5. Un-nest the tokens with **explode**
```{r}
all_words <- all_words %>%
mutate(word = explode(wo_stop_words)) %>%
select(word, author) %>%
filter(nchar(word) > 2)
head(all_words, 4)
```
6. Cache the *all_words* variable using `compute()`
```{r}
all_words <- all_words %>%
compute("all_words")
```
## 12.3 - Data Analysis
1. Words used the most by author
```{r}
word_count <- all_words %>%
group_by(author, word) %>%
tally() %>%
arrange(desc(n))
word_count
```
2. Figure out which words are used by Doyle but not Twain
```{r}
doyle_unique <- filter(word_count, author == "doyle") %>%
anti_join(filter(word_count, author == "twain"), by = "word") %>%
arrange(desc(n)) %>%
compute("doyle_unique")
doyle_unique
```
3. Use `wordcloud` to visualize the data in the previous step
```{r}
doyle_unique %>%
head(100) %>%
collect() %>%
with(wordcloud::wordcloud(
word,
n,
colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9")))
```
4. Find out how many times Twain used the word "sherlock"
```{r}
all_words %>%
filter(author == "twain",
word == "sherlock") %>%
tally()
```
5. Against the `twain` variable, use Hive's *instr* and *lower* to make all ever word lower cap, and then look for "sherlock" in the line
```{r}
twain %>%
mutate(line = lower(line)) %>%
filter(instr(line, "sherlock") > 0) %>%
pull(line)
```
Most of these lines are in a short story by Mark Twain called [A Double Barrelled Detective Story](https://www.gutenberg.org/files/3180/3180-h/3180-h.htm#link2H_4_0008). As per the [Wikipedia](https://en.wikipedia.org/wiki/A_Double_Barrelled_Detective_Story) page about this story, this is a satire by Twain on the mystery novel genre, published in 1902.
```{r}
spark_disconnect(sc)
```