-
Notifications
You must be signed in to change notification settings - Fork 44
/
Copy pathme314-assignment12_2023.Rmd
335 lines (193 loc) · 14.6 KB
/
me314-assignment12_2023.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
---
title: "Assignment 12 - Data from the Web"
author: "Jack Blumenau"
output: html_document
---
# APIs
## Packages
You will need to load the following packages before beginning the assignment
```{r, echo = TRUE, eval=TRUE, message=FALSE, warning=FALSE}
library(tidyverse)
library(quanteda)
```
# Web-Scraping
*Warning: Collecting data from the web ("web scraping") is usually really annoying. There is no single function that will give you the information you need from a webpage. Instead, you must carefully and painfully write code that will give you what you want. If that sounds OK, then continue on with this problem set. If it doesn't, stop here, and do something else.*
### Packages
You will need to load the following libraries to complete this part of the assignment (you may need to use `install.packages()` first):
```{r, eval=TRUE, message = FALSE, warning=FALSE}
library(rvest)
library(xml2)
```
1. `rvest` is a nice package which helps you to scrape information from web pages.
2. `xml2` is a package which includes functions that can make it (somewhat) easier to navigate through html data that is loaded into R.
### Overview
Throughout this course, the modal structure of a problem set has been that we give you a nice, clean, rectangular `data.frame` or `tibble`, which you use for the application of some fancy method. Here, we are going to walk through an example of getting the horrible, messy, and oddly-shaped data from a webpage, and turning it into a `data.frame` or `tibble` that is usable.
Since no two websites are the same, web scraping requires you to identify the relevant parts of the html code that lies behind websites. The goal here is to parse the HTML into usable data. Generally speaking, there are three main steps for webscraping:
1. Access a web page from R
2. Tell R where to "look" on the page
3. Manipulate the data in a usable format within R.
4. (We don't speak about step 4 so much, but it normally includes smacking your head against your desk, wondering where things went wrong and generally questioning all your life choices. But we won't dwell on that here.)
We are going to set ourselves a typical data science-type task in which we are going to scrape some data about politicians from their wiki pages. In particular, our task is to establish which universities were most popular amongst the crop of UK MPs who served in the House of Commons between 2017 and 2019. It is often useful to define in advance what the exact goal of the data collection task is. For us, we would like to finish with a `data.frame` or `tibble` that consists of one observation for each MP, and two variables: the MP's name, and where they went to university.
### Step 1: Scrape a list of current MPs
First, we need to know which MPs were in parliament in this period. A bit of googling shows that [this wiki page](https://en.wikipedia.org/wiki/List_of_United_Kingdom_MPs_by_seniority_(2017–2019)) gives us what we need. Scroll down a little, and you will see that there is a table where each row is an MP. It looks like this:
![](figs/rt_hon.png)
The nice thing about this is that an html table like this should be reasonably easy to work with. We will need to be able to work with the underlying html code of the wiki page in what follows, so you will need to be able to see the source code of the website. If you don't know how to look at the source code, follow the relevant instructions on [this page](https://www.computerhope.com/issues/ch000746.htm) for the browser that you are using.
When you have figured that out, you should be able to see something that looks a bit like this:
![](figs/htmleg.png)
As you can see, html is horrible to look at. In R, we can read in the html code by using the `read_html` function from the `rvest` package:
```{r}
# Read in the raw html code of the wiki page
mps_list_page <- read_html("https://en.wikipedia.org/wiki/List_of_United_Kingdom_MPs_by_seniority_(2017–2019)")
```
`read_html` returns an XML document (to check, try running `class(mps_list_page)`), which makes navigating the different parts of the website (somewhat) easier.
Now that we have the html code in R, we need to find the parts of the webpage that contain the table. Scroll through the source code that you should have open in your browser to see if you can find the parts of the code that contain the table we are interested in.
On line 1154, you should see something like `<table class="wikitable collapsible sortable" style="text-align: center; font-size: 85%; line-height: 14px;">`. This marks the beginning of the table that we are interested in, and we can ask `rvest` to extract that table from our `mps_list_page` object by using the `html_elements` function.
```{r, echo = TRUE, eval = TRUE}
# Extract table of MPs
mp_table <- html_elements(mps_list_page,
css = "table[class='wikitable collapsible sortable']")
```
Here, the string we pass to the `css` argument tells `rvest` that we would like to grab the `table` from the object `mps_list_page` that has the class `wikitable collapsible sortable`. The object we have created (`mp_table`) is itself an XML object, which is good, because we will need to navigate through that table to get the information we need.
Now, within that table, we would like to extract two pieces of information for each MP: their name, and the link to their own individual wikipedia page. Looking back at the html source code, you should be able to see that each MP's entry in the table is contained within its own separate `<span>` tag, and the information we are after is further nested within a `<a>` tag. For example, line 1250 includes the following:
![](figs/bottomley.png)
Yes, Bottomley is a funny name.
We would like to extract all of these entries from the table, and we can do so by again using `html_elements` and using the appropriate css expression, which here is `"span a"`, because the information we want is included in the `a` tag which itself is nested within the `span` tag.
```{r, echo = TRUE, eval = TRUE}
# Extract MP names and urls
mp_table_entries <- html_elements(mp_table, "span a")
mp_table_entries
```
Finally, now that we have the entry for each MP, it is very simple to extract the name of the MP and the URL to their wikipedia page:
```{r, echo = TRUE, eval = TRUE}
# html_text returns the text between the tags (here, the MPs' names)
mp_names <- html_text(mp_table_entries)
# html_attr returns the attrubutes of the tags that you have named. Here we have asked for the "href" which will give us the link to each MP's own wiki page
mp_hrefs <- html_attr(mp_table_entries,
name = "href")
# Combine into a tibble
mps <- tibble(name = mp_names, url = mp_hrefs, university = NA, stringsAsFactors = FALSE)
head(mps)
```
OK, OK, so those urls are not *quite* complete. We need to fix "https://en.wikipedia.org" to the front of them first. We can do that using the `paste0()` function:
```{r, echo = TRUE, eval = TRUE}
mps$url <- paste0("https://en.wikipedia.org", mps$url)
head(mps)
```
That's better. Though, wait, how many observations are there in our `data.frame`?
```{r, echo = TRUE, eval = TRUE}
dim(mps)
```
`r nrow(mps)`? But there are only 650 MPs in the House of Commons! Oh, I know why, it's because some MPs will have left/died/[been caught in a scandal](https://www.theguardian.com/politics/2019/mar/22/tory-mp-christopher-davies-admits-expenses-offences) and therefore have been replaced...
Are you still here? Well done! We have something! We have...a list of MPs' names! But we don't have anything else. In particular, we still do not know where these people went to university. To find that, we have to move on to step 2.
### Step 2: Scrape the wiki page for each MP
Let's look at the page for the first MP in our list: <https://en.wikipedia.org/wiki/Kenneth_Clarke>. Scroll down the page, looking at the panel on the right-hand side. At the bottom of the panel, you will see this:
![](figs/clarke.png)
The bottom line gives Clarke's alma mater, which in this case is one of the Cambridge colleges. That is the information we are after. If we look at the html source code for this page, we can see that the alma mater line of the panel is enclosed in another `<a>` tag:
![](figs/alma.png)
Now that we know this, we can call in the html using `read_html` again:
```{r, echo = TRUE, eval = TRUE}
mp_text <- read_html(mps$url[1])
```
And then we can use `html_elements` and `html_text` to extract the name of the university. Here we use a somewhat more complicated argument to find the information we are looking for. The `xpath` argument tells `rvest` to look for the tag `a` with a title of `"Alma mater"`, and then asking `rvest` to look for the *next* `a` tag that comes after the alma mater tag. This is because the name of the university is actually stored in the subsequent `a` tag.
```{r, echo = TRUE, eval = TRUE}
mp_university <- html_elements(mp_text,
xpath = "//a[@title='Alma mater']/following::a[1]") %>%
html_text()
print(mp_university)
```
Regardless of whether you followed that last bit: it works! We now know where Kenneth Clarke went to university. Finally, we can assign the university that he went to to the `mps` `tibble` that we created earlier:
```{r, echo = TRUE, eval = TRUE}
mps$university[1] <- mp_university
head(mps)
```
### Scraping exercises
1. Figure out how to collect this university information for all of the other MPs in the data. You will need to write a for-loop, which iterates over the URLs in the `data.frame` we just constructed and pulls out the relevant information from each MP's wiki page. You will find very quickly that web-scraping is a messy business, and your loop will probably fail. You might want to use the `stop`, `next`, `try` and `if` functions to help avoid problems.
<details>
<summary>Show solution</summary>
A for-loop is pretty easy to set up given the code provided above. We just need to loop over each row of the `mps` object, read in the html, find the university, and assign it to the relevant cell in the data.frame. E.g.
```{r, eval=FALSE, echo=TRUE}
for(i in 1:nrow(mps)){
cat('.')
mp_text <- read_html(mps$url[i])
mp_university <- html_elements(mp_text,
xpath = "//a[@title='Alma mater']/following::a[1]") %>%
html_text()
mps$university[i] <- mp_university
}
```
Here, `cat('.')` is just a piece of convenience code that will print out a dot to the console on every iteration of the loop. This just helps us to know that R hasn't crashed or that nothing is happening. It's also quite satisfying to know that every time a dot appears, that means that you have collected some new data.
However, if you try running that code, you'll see that it will cut out after a short while with an error.
The main difficulty with this exercise is that there are essentially an infinite number of ways in which data scraping can go wrong. Here, the main problems is that some of the MPs do not actually have any information recorded in their wiki profiles about the university that they attended. Look at the page for [Ronnie Campbell](https://en.wikipedia.org/wiki/Ronnie_Campbell) for example. Never went to university, but certainly looks like a happy chap.
Because of that, we need to build in some code into the loop that says 'OK, if you can't find any information about this MP's university, just code it as `NA`.' I've added a line that does this to the loop.
```{r, eval = FALSE, echo=TRUE}
for(i in 1:nrow(mps)){
cat(".")
mp_text <- read_html(mps$url[i])
mp_university <- xml_text(xml_find_all(mp_text, xpath = "//a[@title='Alma mater']/following::a[1]"))
if(length(mp_university)==0) mp_university <- NA
mps$university[i] <- mp_university
}
```
```{r, echo=FALSE}
load("mps_alma_mater.Rdata")
```
Now the loop runs without breaking! Hooray!
(It is worth noting that this is a very simple example. In the typical web-scraping exercise, you should expect considerably more frustration than you have encountered here. :) Enjoy!)
</details>
2. Which was the modal university for the current set of UK MPs?
<details>
<summary>Show solution</summary>
```{r, eval=TRUE, echo = TRUE}
sort(table(mps$university), decreasing = T)[1]
```
So, LSE is the most popular university for MPs? That seems...unlikely... And indeed it is. Remember the Kenneth Clarke example: wiki lists the college he attended in Cambridge, not just the university. Maybe lots of MPs went to Cambridge, but they all just went to different colleges? Let's check:
```{r, eval=TRUE, echo = TRUE}
unique(mps$university[grep("Cambridge",mps$university)])
```
Oh dear. Maybe it is the same for Oxford?
```{r, eval=TRUE, echo = TRUE}
unique(mps$university[grep("Oxford",mps$university)])
```
Yup.
Right, so we need to do some recoding. Let's create a new variable that we can use to simplify the universities coding:
```{r, eval=TRUE, echo = TRUE}
mps$university_new <- mps$university
mps$university_new[grep("Cambridge",mps$university)] <- "Cambridge"
mps$university_new[grep("Oxford",mps$university)] <- "Oxford"
mps$university_new[grep("London School of Economics",mps$university)] <- "LSE"
head(sort(table(mps$university_new), decreasing = T))
```
Looks like the Oxbridge connection is still pretty strong!
</details>
3. Go back to the scraping code and see if you can add some more variables to the `tibble`. Can you scrape the MPs' party affiliations? Can you scrape their date of birth? Doing so will require you to look carefully at the html source code, and work out the appropriate xpath expression to use. For guidance on xpath, see [here](https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples).
<details>
<summary>Show solution</summary>
```{r, eval=FALSE, echo = TRUE}
mps$university <- NA
mps$party <- NA
mps$birthday <- NA
for(i in 1:nrow(mps)){
cat(".")
mp_text <- read_html(mps$url[i])
mp_university <- html_elements(mp_text, xpath = "//a[@title='Alma mater']/following::a[1]") %>%
html_text()
mp_party <- html_elements(mp_text, xpath = "////tr/th[text()='Political party']/following::a[1]") %>%
html_text()
mp_birthday <- html_elements(mp_text, xpath = "//span[@class='bday']") %>%
html_text()
if(length(mp_university)==0) mp_university <- NA
if(length(mp_party)==0) mp_party <- NA
if(length(mp_birthday)==0) mp_birthday <- NA
mps$university[i] <- mp_university
mps$party[i] <- mp_party
mps$birthday[i] <- mp_birthday
}
save(mps, file = "mps_alma_mater.Rdata")
```
```{r, eval=TRUE, echo = FALSE}
load(file = "mps_alma_mater.Rdata")
```
```{r, eval=TRUE, echo = TRUE}
head(mps)
```
</details>