Skip to content

Commit

Permalink
TDM 20200 Project 1 Question 3
Browse files Browse the repository at this point in the history
  • Loading branch information
mdw333 committed Jan 11, 2025
1 parent bfca948 commit cbb9029
Showing 1 changed file with 24 additions and 5 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -94,23 +94,42 @@ For instance, the row for Indiana should have `'Indiana'` in the left column and

=== Question 3 (2 pts)

For this question, we only pay attention to the state (not the city) for each airport. Which state has the largest number of airports? How many airports are located in that state? We can extract (only) the states from each airport by writing:
This website http://www.scrapethissite.com/pages/forms/ has data about hockey teams, which students can use to practice scraping tables.

We can view 100 rows of this data at a time, for instance, as follows: http://www.scrapethissite.com/pages/forms/?page_num=4&per_page=100 which gives the 4th page of the data. In other words, this page shows rows 301 through 400.

Indeed, there are only 582 rows altogether. By asking for 582 or more rows at a time, in this particular website, we can actually get all 582 rows at once, like this: https://www.scrapethissite.com/pages/forms/?per_page=600

(This is website dependent! Not every website will allow you to do this.)

Now we can extract the entire table from this website. First we need to import Pandas, and also `io` from `StringIO`:

[source, python]
----
myDF['state']
import pandas as pd
from io import StringIO
----

and then the `value_counts` function gives the number of airports in each state:
Then, as in the previous two questions, we can extract the contents of the website as follows:

[source, python]
----
myDF['state'].value_counts()
myresponse = requests.get("https://www.scrapethissite.com/pages/forms/?per_page=600")
mysoup = BeautifulSoup(myresponse.content, 'html.parser')
----

and then we can read the entire table, using `StringIO` and Pandas, as follows:

[source, python]
----
pd.read_html(StringIO(str(myresponse.text)))[0]
----

which will show rows 0 through 4 and also rows 577 through 581.

.Deliverables
====
- Use the `value_counts` function to find the number of airports in each state.
- Extract all 582 rows and 9 columns of the hockey data into a Pandas data frame. Display rows 0 through 4 and also rows 577 through 581.
- Be sure to document your work from Question 3, using some comments and insights about your work.
====

Expand Down

0 comments on commit cbb9029

Please sign in to comment.