diff --git a/projects-appendix/modules/spring2025/pages/20200/project1-teachinglearning.adoc b/projects-appendix/modules/spring2025/pages/20200/project1-teachinglearning.adoc deleted file mode 100644 index e39fa3275..000000000 --- a/projects-appendix/modules/spring2025/pages/20200/project1-teachinglearning.adoc +++ /dev/null @@ -1,215 +0,0 @@ -= TDM 20200: Web Scraping Project 1 -- Spring 2025 - -**Motivation:** Happy New Year! In this project, we will get comfortable scraping some data from the internet in Python. - -**Context:** In this project, we will use BeautifulSoup, documented here: https://pypi.org/project/beautifulsoup4/ - -**Scope:** Web Scraping in Python - -.Learning Objectives: -**** -- We learn the basics about web scraping using Beautiful Soup in Python -**** - -Make sure to read about, and use the template found xref:ROOT:templates.adoc[here], and the important information about project submissions xref:ROOT:submissions.adoc[here]. - -== Dataset(s) - -In this project we will scrape data from the following websites: - -- https://datamine.purdue.edu/about/about-welcome/ -- https://www.nps.gov -- https://books.toscrape.com -- https://www.scrapethissite.com/pages/forms/ - -== Questions - -=== Question 1 (2 pts) - -When scraping data from websites using Beautiful Soup, we first import `requests` and `BeautifulSoup`: - -[source, python] ----- -import requests -from bs4 import BeautifulSoup ----- - -Afterwards, we can use `requests` to scrape the data from a website, such as (for instance) The Data Mine staff directory: - -[source, python] ----- -myresponse = requests.get("https://datamine.purdue.edu/about/about-welcome/") ----- - -and then we can parse this content using `BeautifulSoup`: - -[source, python] ----- -mysoup = BeautifulSoup(myresponse.content, 'html.parser') ----- - -Afterwards, we can use `select` statements to extract elements from the website. For instance, the names of The Data Mine staff members are given as the text after the `p` tags with `class` attribute `purdue-home-cta-grid__card-name`. - -We can extract the names of all 22 staff members at once, by selecting the data in these `p` tags with `class` attribute `purdue-home-cta-grid__card-name` from the page, as follows: - -[source, python] ----- -mysoup.select('p[class = "purdue-home-cta-grid__card-name"]') ----- - -With a list comprehension, we can get these 22 names into a list: - -[source, python] ----- -[element.text for element in mysoup.select('p[class = "purdue-home-cta-grid__card-name"]')] ----- - -Now that you have the 22 staff members' names in a list, *use a similar operation* to extract the 22 staff members' job titles. - -Finally, make a Pandas data frame with 22 rows and 2 columns, namely, 1 row per staff member, with their name in the left column and their job title in the right column. - - -.Deliverables -==== -- Use Python to make a Pandas data frame with 22 rows and 2 columns, namely, 1 row per staff member, with their name in the left column and their job title in the right column. -- Be sure to document your work from Question 1, using some comments and insights about your work. -==== - -=== Question 2 (2 pts) - -The National Park Service homepage at https://www.nps.gov lists 56 states and territories. Their names are given as the text after the `a` tags with `class` attribute `dropdown-item dropdown-state`. - -Extract the 56 names of the states and territories, and remove the whitespace from the `text`, using the `strip()` function. - -Now that you have these 56 names, we can get the locations of the webpages devoted to each state and territory, by extracting the `href` attribute from each tag. If your data is stored in `element`, then the `href` attribute can be retrieved as `element['href']`. Append the string `'https://www.nps.gov'` to the front of each string. - -Finally, make a Pandas data frame with 56 rows and 2 columns, namely, 1 row per state or territory, with their name in the left column and a string displaying the URL for that state or territory in the right column. - -For instance, the row for Indiana should have `'Indiana'` in the left column and `'https://www.nps.gov/state/in/index.htm'` in the right column. - -.Deliverables -==== -- Use Python to make a Pandas data frame with 56 rows and 2 columns, namely, 1 row per state or territory, with their name in the left column and a string displaying the URL for that state or territory in the right column. -- Be sure to document your work from Question 2, using some comments and insights about your work. -==== - -=== Question 3 (2 pts) - -The demo website Books To Scrape does not have real prices for books. It is only a demonstration website, located at https://books.toscrape.com/ - -This website has numerous categories in the left-hand sidebar. The names of the categories are given in a double set of `li` tags, and then an `li` tag, and then an `a` tag. The names of the categories are the text after the `a` tags. - -Extract the 50 category types as the text after the `a` tags, and remove the whitespace from the `text`, using the `strip()` function. Hint: `'Travel'` should be the first category, and `'Crime'` should be the last category. - -Now that you have these 50 categories, we can get the locations of the webpages devoted to each category, by extracting the `href` attribute from each tag. If your data is stored in `element`, then the `href` attribute can be retrieved as `element['href']`. Append the string `'https://books.toscrape.com/'` to the front of each string. - -(As a very minor point for sharp readers: In question 2, we appended `'https://www.nps.gov'` without an additional forward slash, because in the NPS website, the slash was already in the `href` attribute.) - -Finally, make a Pandas data frame with 50 rows and 2 columns, namely, 1 row per category, with their name in the left column and a string displaying the URL for that category in the right column. - -For instance, the row for Poetry should have `'Poetry'` in the left column and `'https://books.toscrape.com/catalogue/category/books/poetry_23/index.html'` in the right column. - -.Deliverables -==== -- Use Python to make a Pandas data frame with 50 rows and 2 columns, namely, 1 row per category, with their name in the left column and a string displaying the URL for that category in the right column. -- Be sure to document your work from Question 3, using some comments and insights about your work. -==== - -=== Question 4 (2 pts) - -This website http://www.scrapethissite.com/pages/forms/ has data about hockey teams, which students can use to practice scraping tables. - -We can view 100 rows of this data at a time, for instance, as follows: http://www.scrapethissite.com/pages/forms/?page_num=4&per_page=100 which gives the 4th page of the data. In other words, this page shows rows 301 through 400. - -Indeed, there are only 582 rows altogether. By asking for 582 or more rows at a time, in this particular website, we can actually get all 582 rows at once, like this: https://www.scrapethissite.com/pages/forms/?per_page=600 - -(This is website dependent! Not every website will allow you to do this.) - -Now we can extract the entire table from this website. First we need to import Pandas, and also `io` from `StringIO`: - -[source, python] ----- -import pandas as pd -from io import StringIO ----- - -Then, as in the previous two questions, we can extract the contents of the website as follows: - -[source, python] ----- -myresponse = requests.get("https://www.scrapethissite.com/pages/forms/?per_page=600") -mysoup = BeautifulSoup(myresponse.content, 'html.parser') ----- - -and then we can read the entire table, using `StringIO` and Pandas, as follows: - -[source, python] ----- -pd.read_html(StringIO(str(myresponse.text)))[0] ----- - -which will show rows 0 through 4 and also rows 577 through 581. - -.Deliverables -==== -- Extract all 582 rows and 9 columns of the hockey data into a Pandas data frame. Display rows 0 through 4 and also rows 577 through 581. -- Be sure to document your work from Question 4, using some comments and insights about your work. -==== - - -=== Question 5 (2 pts) - -For *academic purposes only* now we extract a Snoopy comic from the internet. As many students know, Dr Ward loves the Woodstock character from the Peanuts comic strip. Although Woodstock first appeared on March 4, 1966, he was not named until June 22, 1970. We can extract the comic from June 22, 1970, as follows: - -Load the comic at this website: https://www.gocomics.com/peanuts/1970/06/22 - -In Firefox, right click on the comic (or Control-click on a Mac), and "Inspect" the image in Firefox. If we look into some of the html content for the picture, we will see: - -[source, html] ----- -Peanuts Comic Strip for June 22, 1970 ----- - -In particular, if we look for an `img` tag with `alt` attribute that has value `'Peanuts Comic Strip for June 22, 1970 '` then we can extract the `src` attribute. Hint: It is necessary to put the space after the year in the string, on this website. - -Verify that this URL contains the comic for the day that Woodstock got named: https://assets.amuniversal.com/2181aa70f895013014ff001dd8b71c47 - -Now load the Peanuts comic for two other days, and explain your steps. In particular, specify which two other days you explored, and give the location of the comic image for those two days, just like for June 22, 1970, the comic image is located here: https://assets.amuniversal.com/2181aa70f895013014ff001dd8b71c47 - -.Deliverables -==== -- Verify that this URL contains the comic for the day that Woodstock got named: https://assets.amuniversal.com/2181aa70f895013014ff001dd8b71c47 -- For two additional days of your choice, give the days and the locations of the Peanuts comic image for those two days. -- Be sure to document your work from Question 5, using some comments and insights about your work. -==== - -== Submitting your Work - -Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template. - -Congratulations! Assuming you've completed all the above questions, you've just finished your first project for TDM 10200! If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours. - -Prior to submitting your work, you need to put your work xref:ROOT:templates.adoc[into the project template], and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the xref:ROOT:submissions.adoc[detailed instructions on how to ensure that your submission is formatted correctly]. To download your completed project, you can right-click on the file in the file explorer and click 'download'. - -Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don't lose any points. We hope your first project with us went well, and we look forward to continuing to learn with you on future projects!! - -.Items to submit -==== -- firstname_lastname_project1.ipynb -==== - -[WARNING] -==== -It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template. - -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. - -**Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== - diff --git a/projects-appendix/modules/spring2025/pages/20200/project1.adoc b/projects-appendix/modules/spring2025/pages/20200/project1.adoc index e06c37a56..d1fd7607e 100644 --- a/projects-appendix/modules/spring2025/pages/20200/project1.adoc +++ b/projects-appendix/modules/spring2025/pages/20200/project1.adoc @@ -1,73 +1,209 @@ -= TDM 20200: Project Project 1 -- Spring 2025 += TDM 20200: Web Scraping Project 1 -- Spring 2025 -**Motivation:** Put some motivation here +**Motivation:** Happy New Year! In this project, we will get comfortable scraping some data from the internet in Python. -**Context:** Put some context here +**Context:** In this project, we will use BeautifulSoup, documented here: https://pypi.org/project/beautifulsoup4/ -**Scope:** Put a scope here +**Scope:** Web Scraping in Python .Learning Objectives: **** -- Objective 1 -- Objective 2 -- Objective 3 +- We learn the basics about web scraping using Beautiful Soup in Python **** -Put preliminary stuff here +Make sure to read about, and use the template found xref:ROOT:templates.adoc[here], and the important information about project submissions xref:ROOT:submissions.adoc[here]. + +== Dataset(s) + +In this project we will scrape data from the following websites: + +- https://datamine.purdue.edu/about/about-welcome/ +- https://www.nps.gov +- https://books.toscrape.com +- https://www.gocomics.com/peanuts/ +- https://www.scrapethissite.com/pages/forms/ == Questions === Question 1 (2 pts) -Put question 1 here +When scraping data from websites using Beautiful Soup, we first import `requests` and `BeautifulSoup`: + +[source, python] +---- +import requests +from bs4 import BeautifulSoup +---- + +Afterwards, we can use `requests` to scrape the data from a website, such as (for instance) The Data Mine staff directory: + +[source, python] +---- +myresponse = requests.get("https://datamine.purdue.edu/about/about-welcome/") +---- + +and then we can parse this content using `BeautifulSoup`: + +[source, python] +---- +mysoup = BeautifulSoup(myresponse.content, 'html.parser') +---- + +Afterwards, we can use `select` statements to extract elements from the website. For instance, the names of The Data Mine staff members are given as the text after the `p` tags with `class` attribute `purdue-home-cta-grid__card-name`. + +We can extract the names of all 22 staff members at once, by selecting the data in these `p` tags with `class` attribute `purdue-home-cta-grid__card-name` from the page, as follows: + +[source, python] +---- +mysoup.select('p[class = "purdue-home-cta-grid__card-name"]') +---- + +With a list comprehension, we can get these 22 names into a list: + +[source, python] +---- +[element.text for element in mysoup.select('p[class = "purdue-home-cta-grid__card-name"]')] +---- + +Now that you have the 22 staff members' names in a list, *use a similar operation* to extract the 22 staff members' job titles. + +[TIP] +==== +Just like Dr Ward in the video, you might need to use Command-U (Mac) or Control-U (Windows, Unix) in Firefox to inspect the page and find the attributes that you want. +==== + +Finally, make a Pandas data frame with 22 rows and 2 columns, namely, 1 row per staff member, with their name in the left column and their job title in the right column. + .Deliverables ==== -Put deliverables for question 1 here. +- Use Python to make a Pandas data frame with 22 rows and 2 columns, namely, 1 row per staff member, with their name in the left column and their job title in the right column. +- Be sure to document your work from Question 1, using some comments and insights about your work. ==== === Question 2 (2 pts) -Put question 2 here +The National Park Service homepage at https://www.nps.gov lists 56 states and territories. Their names are given as the text after the `a` tags with `class` attribute `dropdown-item dropdown-state`. + +Extract the 56 names of the states and territories, and remove the whitespace from the `text`, using the `strip()` function. + +Now that you have these 56 names, we can get the locations of the webpages devoted to each state and territory, by extracting the `href` attribute from each tag. If your data is stored in `element`, then the `href` attribute can be retrieved as `element['href']`. Append the string `'https://www.nps.gov'` to the front of each string. + +Finally, make a Pandas data frame with 56 rows and 2 columns, namely, 1 row per state or territory, with their name in the left column and a string displaying the URL for that state or territory in the right column. + +For instance, the row for Indiana should have `'Indiana'` in the left column and `'https://www.nps.gov/state/in/index.htm'` in the right column. .Deliverables ==== -Put deliverables for question 2 here. +- Use Python to make a Pandas data frame with 56 rows and 2 columns, namely, 1 row per state or territory, with their name in the left column and a string displaying the URL for that state or territory in the right column. +- Be sure to document your work from Question 2, using some comments and insights about your work. ==== === Question 3 (2 pts) -Put question 3 here +The demo website Books To Scrape does not have real prices for books. It is only a demonstration website, located at https://books.toscrape.com/ + +This website has numerous categories in the left-hand sidebar. The names of the categories are given in a double set of `li` tags, and then an `li` tag, and then an `a` tag. The names of the categories are the text after the `a` tags. + +Extract the 50 category types as the text after the `a` tags, and remove the whitespace from the `text`, using the `strip()` function. Hint: `'Travel'` should be the first category, and `'Crime'` should be the last category. + +Now that you have these 50 categories, we can get the locations of the webpages devoted to each category, by extracting the `href` attribute from each tag. If your data is stored in `element`, then the `href` attribute can be retrieved as `element['href']`. Append the string `'https://books.toscrape.com/'` to the front of each string. + +(As a very minor point for sharp readers: In question 2, we appended `'https://www.nps.gov'` without an additional forward slash, because in the NPS website, the slash was already in the `href` attribute.) + +Finally, make a Pandas data frame with 50 rows and 2 columns, namely, 1 row per category, with their name in the left column and a string displaying the URL for that category in the right column. + +For instance, the row for Poetry should have `'Poetry'` in the left column and `'https://books.toscrape.com/catalogue/category/books/poetry_23/index.html'` in the right column. .Deliverables ==== -Put deliverables for question 3 here. +- Use Python to make a Pandas data frame with 50 rows and 2 columns, namely, 1 row per category, with their name in the left column and a string displaying the URL for that category in the right column. +- Be sure to document your work from Question 3, using some comments and insights about your work. ==== + === Question 4 (2 pts) -Put question 4 here +For *academic purposes only* now we extract a Snoopy comic from the internet. As many students know, Dr Ward loves the Woodstock character from the Peanuts comic strip. Although Woodstock first appeared on March 4, 1966, he was not named until June 22, 1970. We can extract the comic from June 22, 1970, as follows: + +Load the comic at this website: https://www.gocomics.com/peanuts/1970/06/22 + +In Firefox, right click on the comic (or Control-click on a Mac), and "Inspect" the image in Firefox. If we look into some of the html content for the picture, we will see: + +[source, html] +---- +Peanuts Comic Strip for June 22, 1970 +---- + +In particular, if we look for an `img` tag with `alt` attribute that has value `'Peanuts Comic Strip for June 22, 1970 '` then we can extract the `src` attribute. Hint: It is necessary to put the space after the year in the string, on this website. + +Verify that this URL contains the comic for the day that Woodstock got named: https://assets.amuniversal.com/2181aa70f895013014ff001dd8b71c47 + +Now load the Peanuts comic for two other days, and explain your steps. In particular, specify which two other days you explored, and give the location of the comic image for those two days, just like for June 22, 1970, the comic image is located here: https://assets.amuniversal.com/2181aa70f895013014ff001dd8b71c47 .Deliverables ==== -Put deliverables for question 4 here. +- Verify that this URL contains the comic for the day that Woodstock got named: https://assets.amuniversal.com/2181aa70f895013014ff001dd8b71c47 +- For two additional days of your choice, give the days and the locations of the Peanuts comic image for those two days. +- Be sure to document your work from Question 4, using some comments and insights about your work. ==== === Question 5 (2 pts) -Put question 5 here +This website http://www.scrapethissite.com/pages/forms/ has data about hockey teams, which students can use to practice scraping tables. + +We can view 100 rows of this data at a time, for instance, as follows: http://www.scrapethissite.com/pages/forms/?page_num=4&per_page=100 which gives the 4th page of the data. In other words, this page shows rows 301 through 400. + +Indeed, there are only 582 rows altogether. By asking for 582 or more rows at a time, in this particular website, we can actually get all 582 rows at once, like this: https://www.scrapethissite.com/pages/forms/?per_page=600 + +(This is website dependent! Not every website will allow you to do this.) + +Now we can extract the entire table from this website. First we need to import Pandas, and also `io` from `StringIO`: + +[source, python] +---- +import pandas as pd +from io import StringIO +---- + +Then, as in the previous two questions, we can extract the contents of the website as follows: + +[source, python] +---- +myresponse = requests.get("https://www.scrapethissite.com/pages/forms/?per_page=600") +mysoup = BeautifulSoup(myresponse.content, 'html.parser') +---- + +and then we can read the entire table, using `StringIO` and Pandas, as follows: + +[source, python] +---- +pd.read_html(StringIO(str(myresponse.text)))[0] +---- + +which will show rows 0 through 4 and also rows 577 through 581. .Deliverables ==== -Put deliverables for question 5 here. +- Extract all 582 rows and 9 columns of the hockey data into a Pandas data frame. Display rows 0 through 4 and also rows 577 through 581. +- Be sure to document your work from Question 5, using some comments and insights about your work. ==== - == Submitting your Work -Put any final comments here. +Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template. + +Congratulations! Assuming you've completed all the above questions, you've just finished your first project for TDM 10200! If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours. + +Prior to submitting your work, you need to put your work xref:ROOT:templates.adoc[into the project template], and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the xref:ROOT:submissions.adoc[detailed instructions on how to ensure that your submission is formatted correctly]. To download your completed project, you can right-click on the file in the file explorer and click 'download'. + +Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don't lose any points. We hope your first project with us went well, and we look forward to continuing to learn with you on future projects!! .Items to submit ==== @@ -76,7 +212,7 @@ Put any final comments here. [WARNING] ==== -It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please be mindful of the +It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template. You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. @@ -84,3 +220,4 @@ You _must_ double check your `.ipynb` after submitting it in gradescope. A _very You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. ==== +