Skip to content

Commit

Permalink
TDM 20200 Project 1 Question 2
Browse files Browse the repository at this point in the history
  • Loading branch information
mdw333 committed Jan 11, 2025
1 parent 5d4ee7d commit bfca948
Showing 1 changed file with 10 additions and 42 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -47,9 +47,9 @@ and then we can parse this content using `BeautifulSoup`:
mysoup = BeautifulSoup(myresponse.content, 'html.parser')
----

Afterwards, we can use `select` statements to extract elements from the website. For instance, the names of The Data Mine staff members are contained in `p` tags with `class` attribute `purdue-home-cta-grid__card-name`.
Afterwards, we can use `select` statements to extract elements from the website. For instance, the names of The Data Mine staff members are given as the text after the `p` tags with `class` attribute `purdue-home-cta-grid__card-name`.

We can extract the names of all 22 staff members at once, by selecting the data in these tags from the page:
We can extract the names of all 22 staff members at once, by selecting the data in these `p` tags with `class` attribute `purdue-home-cta-grid__card-name` from the page, as follows:

[source, python]
----
Expand All @@ -63,64 +63,32 @@ With a list comprehension, we can get these 22 names into a list:
[element.text for element in mysoup.select('p[class = "purdue-home-cta-grid__card-name"]')]
----

Now that you have the 22 staff members' names in a list, use a similar operation to extract the 22 staff members' job titles.
Now that you have the 22 staff members' names in a list, *use a similar operation* to extract the 22 staff members' job titles.

Finally, make a Pandas data frame with 22 rows and 2 columns, namely, 1 row per staff member, with their name in the left column and their job title in the right column.


.Deliverables
====
- Use Python to make a Pandas data frame with 22 rows and 2 columns, namely, 1 row per staff member, with their name in the left column and their job title in the right column.
- Be sure to document your work from Question 1, using some comments and insights about your work.
====

=== Question 2 (2 pts)

In Python, we often use the Pandas library for loading DataFrames. Pandas allows us to check some properties of our data frame. For instance, we can use the `shape` property to see how many rows and columns our DataFrame has:
The National Park Service homepage at https://www.nps.gov lists 56 states and territories. Their names are given as the text after the `a` tags with `class` attribute `dropdown-item dropdown-state`.

[source, python]
----
myDF.shape
----

Notice that Python starts counting from 0 (as opposed to R, which starts counting from 1). So the initial row of the Pandas DataFrame is row 0. In the head of the DataFrame, as we saw in Question 1, we see rows 0, 1, 2, 3, 4.

Now load the tail of the DataFrame. It displays rows 3371, 3372, 3373, 3374, 3375. As indicated by the `shape` parameter, this DataFrame has 3376 rows altogether, so this makes sense.

[source, python]
----
myDF.tail()
----

We can select rows of the data frame that meet certain conditions. For instance, we can extract the airports located in New York City as follows:

[source, python]
----
myDF[(myDF['city'] == 'New York') & (myDF['state'] == 'NY')]
----
Extract the 56 names of the states and territories, and remove the whitespace from the `text`, using the `strip()` function.

Now try it yourself! After you display the airports in New York City, then please display the airports from Indianapolis, IN, and also from Houston, TX. Please note that you need to specify the city and the state for this to work. If you forget the state on the Indianapolis query, it will be OK, because no other city in the country is named Indianapolis. BUT if you forget the state for Houston, you will get some airports that are not in Houston, TX, but instead, are from other states. For this reason, you need to always include the conditions on the city and the state of the desired location.
Now that you have these 56 names, we can get the locations of the webpages devoted to each state and territory, by extracting the `href` attribute from each tag. If your data is stored in `element`, then the `href` attribute can be retrieved as `element['href']`. Append the string `'https://www.nps.gov'` to the front of each string.

*For each question in The Data Mine*, please always be sure to put some comments after your cells, which describe all of the work that you are doing in the cells, as well as your thinking and insights about the results.
Finally, make a Pandas data frame with 56 rows and 2 columns, namely, 1 row per state or territory, with their name in the left column and a string displaying the URL for that state or territory in the right column.

[NOTE]
====
Some common Jupyter notebooks shortcuts:
- Instead of clicking the `play button`, you can press ctrl+enter (or cmd+enter on Mac) to run a cell.
- If you want to run a cell and then move immediately to the next cell, you can use shift+enter. This is oftentimes more useful than ctrl+enter
- If you want to run the current cell and then immediately create a new code cell below it, you can press alt+enter (or option+enter on Mac) to do so.
- When a cell is selected (this means you clicked next to it, and it should show a blue bar to its left to signify this), pressing the `d` key twice will delete that cell.
- When a cell is selected, pressing the `a` key will create a new code cell `a`bove the currently selected cell.
- When a cell is selected, pressing the `b` key will create a new code cell `b`elow the selected cell
====
For instance, the row for Indiana should have `'Indiana'` in the left column and `'https://www.nps.gov/state/in/index.htm'` in the right column.

.Deliverables
====
- Use the `shape` property to see how many rows and columns are in the data frame with the airports data.
- Display the `tail` of the DataFrame.
- Display the airports located in New York, NY.
- Display the airports located in Indianapolis, IN.
- Display the airports located in Houston, TX.
- Use Python to make a Pandas data frame with 56 rows and 2 columns, namely, 1 row per state or territory, with their name in the left column and a string displaying the URL for that state or territory in the right column.
- Be sure to document your work from Question 2, using some comments and insights about your work.
====

Expand Down

0 comments on commit bfca948

Please sign in to comment.